newLISP Fan Club

Forum => newLISP in the real world => Topic started by: Fritz on October 07, 2009, 02:15:48 PM

Title: How to take one byte from a string
Post by: Fritz on October 07, 2009, 02:15:48 PM
I'm trying to read the string byte-per-byte (for encoding from 8-bit codepage to UTF-8). But (pop the-string) returns some random number of bytes, so does (the-string 0) etc:

http://img7.imageshost.ru/imgs/091008/3b8db7732c/11005.png



(set-locale "C") did not help too. Only working way I have found is to write temporary file and then use read-char function.



; Usage: (cyr-win-utf "text in windows-1251 encoding")
; Decodes text from windows-1251 to utf-8
(define (cyr-win-utf t-linea)
  ; Loading encoding table
  (set 'en-win-1251 '((255 "я") (254 "ю") (253 "э") (252 "ь") (251 "ы")
  (250 "ъ") (249 "щ") (248 "ш") (247 "ч") (246 "ц") (245 "х") (244 "ф")
  (243 "у") (242 "т") (241 "с") (240 "р") (239 "п") (238 "о") (237 "н")
  (236 "м") (235 "л") (234 "к") (233 "й") (232 "и") (231 "з") (230 "ж")
  (184 "ё") (229 "е") (228 "д") (227 "г") (226 "в") (225 "б") (224 "а")
  (223 "Я") (222 "Ю") (221 "Э") (220 "Ь") (219 "Ы") (218 "Ъ") (217 "Щ")
  (216 "Ш") (215 "Ч") (214 "Ц") (213 "Х") (212 "Ф") (211 "У") (210 "Т")
  (209 "С") (208 "Р") (207 "П") (206 "О") (205 "Н") (204 "М") (203 "Л")
  (202 "К") (201 "Й") (200 "И") (199 "З") (198 "Ж") (168 "Ё") (197 "Е")
  (196 "Д") (195 "Г") (194 "В") (193 "Б") (192 "А")))
  ; saving string to a temp file
  (set 't-file-name (append "/tmp/" (crypto:md5 (string (random)))))
  (write-file t-file-name t-linea)
  ; loading characters to the t-out
  (set 't-out "")
  (set 't-file (open t-file-name "read"))
  (while (set 't-char (read-char t-file))
    (push (or (lookup t-char en-win-1251) (char t-char)) t-out -1))
  (close t-file)
  t-out)


May be, there is a shorter way, without file-writing? I need this function in both Linux and Windows, and Windows temp directory has another name.
Title:
Post by: cormullion on October 07, 2009, 02:52:55 PM
Does unpack help at all?
Title:
Post by: Fritz on October 07, 2009, 03:14:19 PM
Quote from: "cormullion"Does unpack help at all?


Thank you! I think, yes, "unpack" is a solution. Function is much shorter now:


(define (cyr-koi-utf-2 t-linea)
  ; putting character codes to the list
  (set 't-list (unpack (dup "b" (mul 2 (length t-linea))) t-linea))
  ; decoding characters from 't-list to the 't-out
  (set 't-out "")
  (dolist (t-char t-list)
    (push (or (lookup t-char en-koi8r) (char t-char)) t-out -1))
  t-out)


It works ok. Have found a funny thing, btw. Manual says: "Length... returns... the number of characters in a string". But (length "one-russian-letter-in-utf-8") returns 2, not 1.
Title:
Post by: Jeff on October 07, 2009, 03:14:24 PM
dostring processes a string one char at a time...
Title:
Post by: Fritz on October 07, 2009, 03:20:32 PM
Quote from: "Jeff"dostring processes a string one char at a time...


Dostring takes several bytes from string per time, and I need one byte only:

http://img7.imageshost.ru/imgs/091008/ee0e3865a2/4489d.png
Title:
Post by: m35 on October 08, 2009, 08:35:15 AM
Quote from: "Fritz"Manual says: "Length... returns... the number of characters in a string". But (length "one-russian-letter-in-utf-8") returns 2, not 1.


What version of the manual are you using? The current manual (//http) says


Quote from: "The manual"Returns ... the number of bytes in a string.


There is also utf8len (//http) for utf8 strings.



I've run into troubles myself when treating strings as binary data. It would work fine in normal newlisp then blow up when running in utf8 newlisp. Can't remember what I did to make things universal though.



Edit

Looked at the functions in the manual and I see 3 functions that work with bytes regardless: unpack (as you know), slice, and get-char.



You could just loop over the bytes with slice or get-char
(for (i 0 (- (length s) 1))
   (setq c (slice s i 1))
   ' or
   (setq c (char (get-char (+ i (address s)))))
)
Title:
Post by: cormullion on October 08, 2009, 10:00:01 AM
You could even use implicit slicing:


  (offset length utf8-str)

but don't confuse it with implicit indexing:


   (utf8-str offset length)

which does work on characters not bytes.



You can sometimes write code for both UTF8 and non-UTF8. Eg:


(define (string-length s)
    (if unicode (utf8len s) (length s)))
Title:
Post by: Fritz on October 08, 2009, 11:09:20 AM
I think I have old manual. It is good: now I can be sure my "unpack" will work in future versions too.


Quote from: "m35"
You could just loop over the bytes with slice or get-char
(for (i 0 (- (length s) 1))
   (setq c (slice s i 1))
   ' or
   (setq c (char (get-char (+ i (address s)))))
)


Slice works, at least, with uft8 locale and ASCII encoded line. But get-char gives me only some strange negative numbers. Only this entangled construction works:



(dotimes (i (length rln))
  (print (or (lookup (+ 256 (get-char (+ i (address rln)))) en-win-1251) "?")))
Title:
Post by: Fritz on October 08, 2009, 11:42:47 AM
Quote from: "cormullion"You could even use implicit slicing:


  (offset length utf8-str)


But how? ((address str) 1 1) ?
Title:
Post by: cormullion on October 08, 2009, 02:51:10 PM
How about


(set 's "04030201")
(for (i 0 3)
  (println (get-char (address (i 1 s)))))
4
3
2
1


where i is the offset, 1 is the length, and s is the string you're slicing...
Title:
Post by: Lutz on October 08, 2009, 06:11:36 PM
You can do without 'address' if the argument is a string. This will do it too:


(for (i 0 3) (println (get-char (i 1 s))))
Title:
Post by: Fritz on October 10, 2009, 02:32:15 PM
Quote from: "Lutz"You can do without 'address' if the argument is a string. This will do it too:


(for (i 0 3) (println (get-char (i 1 s))))


(get-char (address (i 1 s)))



gives always "0" as a result.


(get-char (i 1 s))

works, but only in this strange form:


(+ 256 (get-char (i 1 s)))

PS: its a pity "explode" can not work with raw bytes, so I can not use "map".