UTF considerations

Started by pjot, October 01, 2004, 03:42:50 PM

Previous topic - Next topic

pjot

It appears that GTK widgets by default use UTF-8 encoded text. I thought I could use the newLisp 'utf8' function to convert (extended) ASCII to UTF, but this did not work the way I expected; the utf8-function always assumes a 4-byte UCS encoded string.



Now, what I was looking for, was a function which could convert a character to UTF-8.



I have written a small function myself which performs this task for a string, assuming bytevalues 0-255:



(define (utf str)
(set 't 0)
(while (< t (length str))
(begin
(set 'x (nth t str))
(if (> (char x) 127)
(begin
(set 'b1 (+ (/ (& (char x) 192) 64) 192))
(set 'b2 (+ (& (char x) 63) 128))
(set-nth t str (append (char b1)(char b2)))
(inc 't)
)
)
(inc 't)
)
)
str)



Probably this can be optimized, but a character-by-character conversion is very slow. I wonder, might it not be convenient to have a UTF-8 conversion command available, like this:



(utf "Kein überraschung") -> "Kein überraschung"



How about that?