I set a str with cjk chars like
(setq cn "中文abc")
which contains chinese chars
How can I cut this string into an binary array like in C cn
Cause I need to putchar this string ,but in newlisp
if I use slice like :
> (char (slice cn 0 1))
16384
> (char (slice cn 1 1))
184
> (char (slice cn 2 1))
173
> (char (slice cn 3 1))
24576
I think this is not the right code value .right?
DONE
TURN OFF UTF8 SUPPORT
---------------------------------------------
Turn off utf8 support in makefile
rebuild newlisp withouf utf8
you will see -DSUPPORT_UTF8 in
makefile_build
makefile_linuxLP64_utf8
....
I Just deleted -DSUPPORT_UTF8.
now if ( setq cn "中文")
it'll be :
> (setq cn "中文")
"228184173230150135"
20013 or else will cause putchar (FCGI_putchar ) error.
the right code of 中文 is above 228....
like lutz said
:)
Could you please tell the rest of us, what exactly you did ?
BTW, the correct codes should be:
中 20013
文 25991
a 97
b 98
c 99
(verified by Python 2.7.2).
There you have to explicitly mark a string as unicode via u'the string' (this changed in Python 3.x, where
all strings are unicode by default).
I'm asking because disabling unicode support while using unicode strings and then getting correct
results seems a bit strange.
Perhaps you could post the code you wrote.
Me wants to learn :-)
In UTF-8 versions of newLISP indexing on strings works on character rather than single byte boundaries. Although 'slice' slices binary, 'char' will try to convert to Unicode on UTF-8 versions of newLISP. Use 'unpack':
> (unpack (dup "b" (length cn)) cn)
(228 184 173 230 97 98 99)
In the manual all functions working on UTF-8 character boundaries are marked with a utf8 behind the red function name.
There is a list of all of these functions in this chapter:
http://www.newlisp.org/downloads/newlisp_manual.html#unicode_utf8
ps: run this to see how it works:
(set 'str "中文abc")
(println (unpack (dup "b" (length str)) str))
(println (explode str))
(dotimes (i (utf8len str))
(print (str i) " -> ")
(println (char (str i))))
gives you this output:
(228 184 173 230 150 135 97 98 99)
("中" "文" "a" "b" "c")
中 -> 20013
文 -> 25991
a -> 97
b -> 98
c -> 99