reading utf16 files?

Started by cormullion, May 28, 2007, 02:56:29 PM

Previous topic - Next topic

cormullion

is it possible to read the contents of UTF16 files with newLISP? I'm just getting a couple of strange characters when I use read-file...

Lutz

#1
Only UTF-8 encoded files are supported directly. You would have to read the file in 2-byte pieces expand those to a 4-byte Unicode integer using 'unpack' and the "u" format and then convert it to UTF-8 using the newLISP 'utf-8' function.



Lutz



ps: too busy at the moment on the GUI stuff, to give you a solution, remind me next week.

Dmi

#2
In case of conversion can help U (to utf-8 or to something national), U can use iconv() libc call.

Look at http://en.feautec.pp.ru/store/libs/doc/iconv.lsp.html">//http://en.feautec.pp.ru/store/libs/doc/iconv.lsp.html.

Under *nices other than Linux, path to libc may need correction.

Under Win* I'd seen iconv.dll somewhere, but there was slightly different function names.
WBR, Dmi

cormullion

#3
thanks guys, i'll check these ideas out...!

cormullion

#4
Yeah - I found a few MacOS X libraries, but they didn't seem to work:



libc.dylib

libiconv.dylib

libiconv.2.2.0.dylib

libiconv.2.dylib

libiconv.dylib



It seemed a bit easier to try this:



(exec "iconv -f -t " etc....)



But in the end, I used something else altogether, just to get it done. :-(



Thanks though...

Dmi

#5
What does "man 3 iconv" shows about linking and about function specifications?



Usage of "iconv" shell command is not a good idea because it doesn't handle incorrect symbols - just stop processing immediately.



In Linux I using "recode -f" for that.
WBR, Dmi

m35

#6
This may not be the fastest approach, or even the most accurate, but it seemed to work in my tests.


(define (utf16->utf8 s)
(join
(map
(fn (c)
(utf8 (append (reverse c) "000000000000"))
)
(find-all ".." s)
)
)
)


And speaking of Unicode files, does anyone (Lutz ;) know how to open a file with Unicode characters in the path (on Windows)? I tried it using a utf8 string, but open just returned nil. Do I need to dig into the Win32 API on this one?



Edit: Much faster version (2x), and leaves it as unicode (for you to call utf8 if desired).
(define (utf16->utf32 s)
(append
(join
(map
; (curry pack "u") ;identical speed
(fn (c)
(pack "u" c)
)
(unpack (dup ">u" (>> (length s) 1)) s)
)
"0000"
)
"000000000000"
)
)


Edit:

Note the ">u" may also be "<u", depending on whether it is BE or LE encoded.

Lutz

#7
If I have a file-path name with strange character encoding in it, I try to open it using the string shown from a 'directory' statement. That may show you how the filename characters have to be translated.



Lutz

m35

#8
Thanks Lutz, I gave that a try but still didn't have any luck.



I have a file with the path

"F:test梶浦由記file.txt"



I run the following (in the "test" directory) with the following result.
F:test>newlispw -e "(directory)"
("." ".." "????")
(note: newlispw = UTF8 enabled newlisp)





Hoping it's just a console limitation, I also run this
F:test>newlispw -e "(write-file {dir.txt} ((directory) 3))"
4
Opening the "dir.txt" file, again all I see is ????





Finally, trying to read the file
F:test>newlispw -e "(read-file {????file.txt})"
nil

F:test>newlispw -e "(open {????file.txt} {r})"
nil

Lutz

#9
It works on MacOS X:


newLISP v.9.1.7 on OSX UTF-8, execute 'newlisp -h' for more info.

> (print "230162182230181166231148177232168152")
????"230162182230181166231148177232168152"
> (write-file "230162182230181166231148177232168152" "Hello Unicode")
13
> (directory)
("." ".." ".DS_Store" "230162182230181166231148177232168152")
> !ls
????????????
> (read-file "230162182230181166231148177232168152")
"Hello Unicode"
>


don't know whats different on Win32



Lutz

Lutz

#10
... before posting I saw the Chinese characters in the post/edit box of the browser, and also in the terminal window, but after posting they got ???? (in the first 'print' statement)



Lutz

Lutz

#11
... the thing is not to 'print', but get the unprinted string to work with. In newLISP you see the raw string where UTF-8 is shown with numbers in the return values. I guess if you do exactly the same thing, I did in MacOS X, it will work for you too on Win32. What it prooves is, that both OSs seem to encode filenames in UTF-8.



Lutz

m35

#12
Wow, things look interesting with that same code on windows
newLISP v.9.1.1 on Win32 UTF-8, execute 'newlisp -h' for more info.

> (print "230162182230181166231148177232168152")
梶浦由記"梶浦由記"
> (write-file "230162182230181166231148177232168152" "Hello Unicode")
13
> (directory)
("." ".." "梶浦由記")
> !dir
 Volume in drive F has no label.
 Volume Serial Number is C458-D3A7

 Directory of F:test2

06/04/2007  05:45 PM    <DIR>          .
06/04/2007  05:45 PM    <DIR>          ..
06/04/2007  05:45 PM                13 梶æµ▌ç"±è"~
               1 File(s)             13 bytes
               2 Dir(s)      12,067,328 bytes free
> (read-file "230162182230181166231148177232168152")
"Hello Unicode"
>


I am left with a file named
梶浦ç"±è¨˜ in the directory.

Lutz

#13
I believe notepad.exe has a UTF-8 option and you could paste those characters into it to have the Chinese chars back.



What you would need on Wndows is a cmd.exe which does UTF-8



Lutz

jp

#14
Quote from: "Lutz"I believe notepad.exe has a UTF-8 option and you could paste those characters into it to have the Chinese chars back.



What you would need on Wndows is a cmd.exe which does UTF-8



Lutz


Actually for Win2k and above to set the command line to UTF-8 you will have simply to set the code page with the following command, chcp 65001, prior to the execution of your command. There only caveat is: make sure the command prompt's properties are not set on Raster Fonts