How to do string like binary?

Started by dexter, November 16, 2011, 12:46:49 AM

Previous topic - Next topic

dexter

I set a str with cjk chars like



(setq cn "中文abc")



which contains chinese chars



How can I  cut this string into an binary array  like in C  cn



Cause I need to putchar this string ,but in newlisp



if I use slice like :


> (char (slice cn 0 1))
16384
> (char (slice cn 1 1))
184
> (char (slice cn 2 1))
173
> (char (slice cn 3 1))
24576



I think this is not the right code value .right?

dexter

#1
DONE



TURN OFF UTF8 SUPPORT

---------------------------------------------





Turn off utf8 support in makefile

 rebuild newlisp withouf utf8



you will see -DSUPPORT_UTF8 in

makefile_build

makefile_linuxLP64_utf8

....

I Just deleted -DSUPPORT_UTF8.



now if  ( setq cn "中文")

it'll be :

> (setq cn "中文")
"228184173230150135"


20013 or else will cause putchar (FCGI_putchar ) error.



the right code of 中文 is above  228....



like lutz said



:)

sunmountain

#2
Could you please tell the rest of us, what exactly you did ?

BTW, the correct codes should be:



中 20013

文 25991

a 97

b 98

c 99



(verified by Python 2.7.2).

There you have to explicitly mark a string as unicode via u'the string' (this changed in Python 3.x, where

all strings are unicode by default).



I'm asking because disabling unicode support while using unicode strings and then getting correct

results seems a bit strange.

Perhaps you could post the code you wrote.



Me wants to learn :-)

Lutz

#3
In UTF-8 versions of newLISP indexing on strings works on character rather than single byte boundaries. Although 'slice' slices binary, 'char' will try to convert to Unicode on UTF-8 versions of newLISP. Use 'unpack':


> (unpack (dup "b" (length cn)) cn)
(228 184 173 230 97 98 99)


In the manual all functions working on UTF-8 character boundaries are marked with a utf8 behind the red function name.



There is a list of all of these functions in this chapter:



http://www.newlisp.org/downloads/newlisp_manual.html#unicode_utf8">http://www.newlisp.org/downloads/newlis ... icode_utf8">http://www.newlisp.org/downloads/newlisp_manual.html#unicode_utf8



ps: run this to see how it works:


(set 'str "中文abc")
(println (unpack (dup "b" (length str)) str))
(println (explode str))
(dotimes (i (utf8len str))
    (print (str i) " -> ")
    (println (char (str i))))


gives you this output:


(228 184 173 230 150 135 97 98 99)
("中" "文" "a" "b" "c")
中 -> 20013
文 -> 25991
a -> 97
b -> 98
c -> 99