Fore nth to not use utf8 ?

dexter · January 02, 2014, 09:04:29 PM

Well new lisp support utf8

(nth utf8str index) will return a utf8 string

(slice utf8str 0 1) will return part of a utf8 string, let's say one byte

Why nth works like that?

And How can I force nth to work without utf8 behaviour? like slice do

without re-compile new lisp?

dexter · January 03, 2014, 02:08:53 AM

I re-compiled newlisp

This is not the first time ,-DSUPPORT_UTF-8 make mess with me

I really don't think new lisp should support utf8 natively , could be based on system, leave utf-8 to system

isn't it better?

Lutz · January 03, 2014, 07:35:41 AM

Most people in the world speak languages which need Unicode characters, so UTF-8 should be the standard in newLISP rather than the exception ;-) But we also need to process binary information, so newLISP has functions to process for UTF-8 and others for processing binary string buffers. The manual indicates what it does for each string processing function.

All functions working on UTF-8 multi byte characters are also marked with a "utf8" suffix in the reference section. An there is also a table collecting them all:

http://www.newlisp.org/downloads/newlisp_manual.html#utf8_capable">http://www.newlisp.org/downloads/newlis ... f8_capable">http://www.newlisp.org/downloads/newlisp_manual.html#utf8_capable

dexter · January 09, 2014, 11:58:21 PM

I think if these "utf8 "function with a name "utf8_" in the head would less the mess

I know people speak different languages

but today ,most of the operation systems support multi language natively , quiet well

If I split a utf8 string to binary, which will be three bytes , and the print out these bytes, it will be showed as a utf8 string normally

Just a thought :)

Lutz · January 10, 2014, 01:28:53 PM

Code Select Expand

> (set  'str "我能吞下玻璃而不伤身体。")
"我能吞下玻璃而不伤身体。"

; split into 8-bit bytes

> (unpack (dup "b" (length str)) str)
(230 136 145 232 131 189 229 144 158 228 184 139 231 142 187 231 146 131 232 128 
 140 228 184 141 228 188 164 232 186 171 228 189 147 227 128 130)

; show a char for each 8-bit byte
> (map char (unpack (dup "b" (length str)) str))
("æ" "" "" "è" "" "½" "å" "" "" "ä" "¸" "" "ç" "" "»" "ç" 
 "" "" "è" "" "" "ä" "¸" "" "ä" "¼" "¤" "è" "º" "«" "ä" "½" 
 "" "ã" "" "")
>

or do this if you want a unicode number for each Chinese character:

Code Select Expand
> (explode str)
("我" "能" "吞" "下" "玻" "璃" "而" "不" "伤" "身" "体" "。")
> (map char (explode str))
(25105 33021 21534 19979 29627 29827 32780 19981 20260 36523 20307 12290)
>

winger · February 08, 2014, 10:24:03 PM

Code Select Expand
> (println(join  ( 0 4 (explode "好汉不吃眼前亏"))) " ")
好汉
" "

This trick is very cool!

Lutz · February 09, 2014, 07:17:56 AM

you mean this:

Code Select Expand
> (println(join  ( 0 4 (explode "好汉不吃眼前亏"))) " ")
好汉不吃 
" "

TedWalther · February 09, 2014, 11:07:51 AM

Since utf8 should be the default, for every function that has a binary equivalent, is that mentioned in the manual, what to use for binary mode instead of utf8 mode?

Instead of the utf8_ prefix proposal, I propose the opposite; bin_ prefix for all the functions for when you want it to work byte by byte intead of char by char.

Knowing Lutz, he probably has something even better and more fun up his sleeve.

newLISP Fan Club

News:

Fore nth to not use utf8 ?

dexter

dexter

Lutz

dexter

Lutz

winger

Lutz

TedWalther