is it possible to read the contents of UTF16 files with newLISP? I'm just getting a couple of strange characters when I use read-file...
Only UTF-8 encoded files are supported directly. You would have to read the file in 2-byte pieces expand those to a 4-byte Unicode integer using 'unpack' and the "u" format and then convert it to UTF-8 using the newLISP 'utf-8' function.
Lutz
ps: too busy at the moment on the GUI stuff, to give you a solution, remind me next week.
In case of conversion can help U (to utf-8 or to something national), U can use iconv() libc call.
Look at //http://en.feautec.pp.ru/store/libs/doc/iconv.lsp.html.
Under *nices other than Linux, path to libc may need correction.
Under Win* I'd seen iconv.dll somewhere, but there was slightly different function names.
thanks guys, i'll check these ideas out...!
Yeah - I found a few MacOS X libraries, but they didn't seem to work:
libc.dylib
libiconv.dylib
libiconv.2.2.0.dylib
libiconv.2.dylib
libiconv.dylib
It seemed a bit easier to try this:
(exec "iconv -f -t " etc....)
But in the end, I used something else altogether, just to get it done. :-(
Thanks though...
What does "man 3 iconv" shows about linking and about function specifications?
Usage of "iconv" shell command is not a good idea because it doesn't handle incorrect symbols - just stop processing immediately.
In Linux I using "recode -f" for that.
This may not be the fastest approach, or even the most accurate, but it seemed to work in my tests.
(define (utf16->utf8 s)
(join
(map
(fn (c)
(utf8 (append (reverse c) " 00 00 00 00 00 00"))
)
(find-all ".." s)
)
)
)
And speaking of Unicode files, does anyone (Lutz ;) know how to open a file with Unicode characters in the path (on Windows)? I tried it using a utf8 string, but open just returned nil. Do I need to dig into the Win32 API on this one?
Edit: Much faster version (2x), and leaves it as unicode (for you to call utf8 if desired).
(define (utf16->utf32 s)
(append
(join
(map
; (curry pack "u") ;identical speed
(fn (c)
(pack "u" c)
)
(unpack (dup ">u" (>> (length s) 1)) s)
)
" 00 00"
)
" 00 00 00 00 00 00"
)
)
Edit:
Note the ">u" may also be "<u", depending on whether it is BE or LE encoded.
If I have a file-path name with strange character encoding in it, I try to open it using the string shown from a 'directory' statement. That may show you how the filename characters have to be translated.
Lutz
Thanks Lutz, I gave that a try but still didn't have any luck.
I have a file with the path
"F:test梶浦由記file.txt"
I run the following (in the "test" directory) with the following result.
F:test>newlispw -e "(directory)"
("." ".." "????")
(note: newlispw = UTF8 enabled newlisp)
Hoping it's just a console limitation, I also run this
F:test>newlispw -e "(write-file {dir.txt} ((directory) 3))"
4
Opening the "dir.txt" file, again all I see is ????
Finally, trying to read the file
F:test>newlispw -e "(read-file {????file.txt})"
nil
F:test>newlispw -e "(open {????file.txt} {r})"
nil
It works on MacOS X:
newLISP v.9.1.7 on OSX UTF-8, execute 'newlisp -h' for more info.
> (print "230162182230181166231148177232168152")
????"230162182230181166231148177232168152"
> (write-file "230162182230181166231148177232168152" "Hello Unicode")
13
> (directory)
("." ".." ".DS_Store" "230162182230181166231148177232168152")
> !ls
????????????
> (read-file "230162182230181166231148177232168152")
"Hello Unicode"
>
don't know whats different on Win32
Lutz
... before posting I saw the Chinese characters in the post/edit box of the browser, and also in the terminal window, but after posting they got ???? (in the first 'print' statement)
Lutz
... the thing is not to 'print', but get the unprinted string to work with. In newLISP you see the raw string where UTF-8 is shown with numbers in the return values. I guess if you do exactly the same thing, I did in MacOS X, it will work for you too on Win32. What it prooves is, that both OSs seem to encode filenames in UTF-8.
Lutz
Wow, things look interesting with that same code on windows
newLISP v.9.1.1 on Win32 UTF-8, execute 'newlisp -h' for more info.
> (print "230162182230181166231148177232168152")
梶浦由記"梶浦由記"
> (write-file "230162182230181166231148177232168152" "Hello Unicode")
13
> (directory)
("." ".." "梶浦由記")
> !dir
Volume in drive F has no label.
Volume Serial Number is C458-D3A7
Directory of F:test2
06/04/2007 05:45 PM <DIR> .
06/04/2007 05:45 PM <DIR> ..
06/04/2007 05:45 PM 13 梶æµ▌ç"±è"~
1 File(s) 13 bytes
2 Dir(s) 12,067,328 bytes free
> (read-file "230162182230181166231148177232168152")
"Hello Unicode"
>
I am left with a file named
梶浦ç"±è¨˜
in the directory.
I believe notepad.exe has a UTF-8 option and you could paste those characters into it to have the Chinese chars back.
What you would need on Wndows is a cmd.exe which does UTF-8
Lutz
Quote from: "Lutz"
I believe notepad.exe has a UTF-8 option and you could paste those characters into it to have the Chinese chars back.
What you would need on Wndows is a cmd.exe which does UTF-8
Lutz
Actually for Win2k and above to set the command line to UTF-8 you will have simply to set the code page with the following command, chcp 65001, prior to the execution of your command. There only caveat is: make sure the command prompt's properties are not set on Raster Fonts
Quote from: "jp"
set the code page with the following command, chcp 65001
Thanks jp! I wasn't aware of that one.
Now here is that same process after changing the code page.
F:temp>chcp 65001
Active code page: 65001
...
newLISP v.9.1.1 on Win32 UTF-8, execute 'newlisp -h' for more info.
> (print "230162182230181166231148177232168152")
梶浦由記""
> (write-file "230162182230181166231148177232168152" "Hello Unicode")
13
> (directory)
("." ".." "")
> !dir /w
Volume in drive F has no label.
Volume Serial Number is C458-D3A7
Directory of F:temp
[.] [..] 梶浦ç"±è¨˜
1 File(s) 13 bytes
2 Dir(s) 12,066,816 bytes free
> (read-file "230162182230181166231148177232168152")
"Hello Unicode"
>
Note that the 梶浦由記 appear as rectangles in the console (but I assume that's just because the Lucida Console font doesn't have those characters).
The behavior of that (directory) entry is interesting...
> (directory)
("." ".." "")
> (length ((directory) 2))
12
> (setq s ((directory) 2))
""
> s
""
> (length s)
12
> (source 's)
"(set 's "")rnrn"
> (print s)
梶浦由記""
Unfortunately I'm still left with the 梶浦ç"±è¨˜ file, and not the proper Unicode one.
Since I'm not having any luck, I went ahead and implemented UTF-16 versions of functions that refer to path names (using the Win32 API). I'll post them on the "newlisp for Win" board when I'm done.
Quote from: "m35"
Unfortunately I'm still left with the 梶浦ç"±è¨˜ file, and not the proper Unicode one.
Perhaps it is worth mentioning that for win2k and above the internal representations are in Unicode UTF-16LE and if one can change arbitrarily its DOS code page, in Windows proper, the internal character representations remained fixed.
Also the name 梶浦由記 strikes me more as being a Japanese name (Kajiura Yuki) rather than a Chinese. Nonetheless Windows will need to have its Chinese/Japanese Fonts enabled in order to render those characters properly.
Quote from: "jp"
Also the name 梶浦由記 strikes me more as being a Japanese name (Kajiura Yuki) rather than a Chinese.
Good eye jp. Read Japanese? Other languages?
ps I'm a big fan of Yuki Kajiura's work (//http) :)
Quote from: "m35"
Good eye jp. Read Japanese? Other languages?
Pleased to oblige!
Yes indeed, I read Japanese. And I believe you are a Japanese native speaker since you inadvertently inverted the L for an R in your login summary.
ご免なさい I know only a little Japanese because I work with Japanese people (and like あにめ ^_^). The カリフォニア typo is part 日本語 accent, and part Arnold Schwarzenegger accent (´∀`)
Quote
And I believe you are a Japanese native speaker since you inadvertently inverted the L for an R in your login summary.
Speaking about good eyes?? That must be a secret hint.. I was indeed wondering why he mispelled california... ;-) No offence btw... it just caught my eye too and did not know there was perhpas a reason for it..
Quote
Speaking about good eyes?? That must be a secret hint..
Well, there is nothing too esoteric about it!
Japanese has no phonetic equivalent to the L and R consonants but has a consonant that seat somewhere between those 2 sounds. Hence even knowing perfectly well all common place names since childhood due to the lack of that phonetic register the Japanese are often at loss to write down L and R containing names in English they know assuredly in Japanese.