Fore nth to not use utf8 ?

Started by dexter, January 02, 2014, 09:04:29 PM

Previous topic - Next topic

dexter

Well new lisp support utf8



(nth utf8str index) will return a utf8 string



(slice utf8str 0 1) will return part of a utf8 string, let's say one byte



Why nth works like that?  



And How can I force nth to work without utf8 behaviour? like slice do



without re-compile new lisp?

dexter

#1
I re-compiled newlisp



This is not the first time ,-DSUPPORT_UTF-8 make mess with me





I really don't think new lisp should support utf8 natively , could be based on system, leave utf-8 to system



isn't it better?

Lutz

#2
Most people in the world speak languages which need Unicode characters, so UTF-8 should be the standard in newLISP rather than the exception ;-)  But we also need to process binary information, so newLISP has functions to process for UTF-8 and others for processing binary string buffers. The manual indicates what it does for each string processing function.



All functions working on UTF-8 multi byte characters are also marked with a "utf8" suffix in the reference section. An there is also a table collecting them all:



http://www.newlisp.org/downloads/newlisp_manual.html#utf8_capable">http://www.newlisp.org/downloads/newlis ... f8_capable">http://www.newlisp.org/downloads/newlisp_manual.html#utf8_capable

dexter

#3
I think if these "utf8 "function with a name "utf8_" in the head would less the mess



I know people speak different languages

but today ,most of the operation systems support multi language natively , quiet well



If I split a utf8 string to binary, which will be three bytes , and the print out these bytes, it will be showed as a utf8 string normally



Just a thought :)

Lutz

#4

> (set  'str "我能吞下玻璃而不伤身体。")
"我能吞下玻璃而不伤身体。"

; split into 8-bit bytes

> (unpack (dup "b" (length str)) str)
(230 136 145 232 131 189 229 144 158 228 184 139 231 142 187 231 146 131 232 128
 140 228 184 141 228 188 164 232 186 171 228 189 147 227 128 130)

; show a char for each 8-bit byte
> (map char (unpack (dup "b" (length str)) str))
("æ" "ˆ" "‘" "è" "ƒ" "½" "å" "" "ž" "ä" "¸" "‹" "ç" "Ž" "»" "ç"
 "’" "ƒ" "è" "€" "Œ" "ä" "¸" "" "ä" "¼" "¤" "è" "º" "«" "ä" "½"
 "“" "ã" "€" "‚")
>


or do this if you want a unicode number for each Chinese character:


> (explode str)
("我" "能" "吞" "下" "玻" "璃" "而" "不" "伤" "身" "体" "。")
> (map char (explode str))
(25105 33021 21534 19979 29627 29827 32780 19981 20260 36523 20307 12290)
>

winger

#5
> (println(join  ( 0 4 (explode "好汉不吃眼前亏"))) " ")
好汉
" "

This  trick is very cool!
Welcome to a newlisper home:)

http://www.cngrayhat.org\">//http://www.cngrayhat.org

Lutz

#6
you mean this:
> (println(join  ( 0 4 (explode "好汉不吃眼前亏"))) " ")
好汉不吃
" "

TedWalther

#7
Since utf8 should be the default, for every function that has a binary equivalent, is that mentioned in the manual, what to use for binary mode instead of utf8 mode?



Instead of the utf8_ prefix proposal, I propose the opposite; bin_ prefix for all the functions for when you want it to work byte by byte intead of char by char.



Knowing Lutz, he probably has something even better and more fun up his sleeve.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.