newLISP Fan Club

Forum => Anything else we might add? => Topic started by: cormullion on May 28, 2007, 02:56:29 PM

Title: reading utf16 files?
Post by: cormullion on May 28, 2007, 02:56:29 PM
is it possible to read the contents of UTF16 files with newLISP? I'm just getting a couple of strange characters when I use read-file...
Title:
Post by: Lutz on May 28, 2007, 04:28:20 PM
Only UTF-8 encoded files are supported directly. You would have to read the file in 2-byte pieces expand those to a 4-byte Unicode integer using 'unpack' and the "u" format and then convert it to UTF-8 using the newLISP 'utf-8' function.



Lutz



ps: too busy at the moment on the GUI stuff, to give you a solution, remind me next week.
Title:
Post by: Dmi on May 28, 2007, 10:52:35 PM
In case of conversion can help U (to utf-8 or to something national), U can use iconv() libc call.

Look at //http://en.feautec.pp.ru/store/libs/doc/iconv.lsp.html.

Under *nices other than Linux, path to libc may need correction.

Under Win* I'd seen iconv.dll somewhere, but there was slightly different function names.
Title:
Post by: cormullion on May 28, 2007, 11:54:35 PM
thanks guys, i'll check these ideas out...!
Title:
Post by: cormullion on May 30, 2007, 01:34:48 PM
Yeah - I found a few MacOS X libraries, but they didn't seem to work:



libc.dylib

libiconv.dylib

libiconv.2.2.0.dylib

libiconv.2.dylib

libiconv.dylib



It seemed a bit easier to try this:



(exec "iconv -f -t " etc....)



But in the end, I used something else altogether, just to get it done. :-(



Thanks though...
Title:
Post by: Dmi on May 30, 2007, 02:07:06 PM
What does "man 3 iconv" shows about linking and about function specifications?



Usage of "iconv" shell command is not a good idea because it doesn't handle incorrect symbols - just stop processing immediately.



In Linux I using "recode -f" for that.
Title:
Post by: m35 on June 04, 2007, 04:09:24 PM
This may not be the fastest approach, or even the most accurate, but it seemed to work in my tests.


(define (utf16->utf8 s)
(join
(map
(fn (c)
(utf8 (append (reverse c) "000000000000"))
)
(find-all ".." s)
)
)
)


And speaking of Unicode files, does anyone (Lutz ;) know how to open a file with Unicode characters in the path (on Windows)? I tried it using a utf8 string, but open just returned nil. Do I need to dig into the Win32 API on this one?



Edit: Much faster version (2x), and leaves it as unicode (for you to call utf8 if desired).
(define (utf16->utf32 s)
(append
(join
(map
; (curry pack "u") ;identical speed
(fn (c)
(pack "u" c)
)
(unpack (dup ">u" (>> (length s) 1)) s)
)
"0000"
)
"000000000000"
)
)


Edit:

Note the ">u" may also be "<u", depending on whether it is BE or LE encoded.
Title:
Post by: Lutz on June 04, 2007, 04:32:08 PM
If I have a file-path name with strange character encoding in it, I try to open it using the string shown from a 'directory' statement. That may show you how the filename characters have to be translated.



Lutz
Title:
Post by: m35 on June 04, 2007, 05:10:16 PM
Thanks Lutz, I gave that a try but still didn't have any luck.



I have a file with the path

"F:test梶浦由記file.txt"



I run the following (in the "test" directory) with the following result.
F:test>newlispw -e "(directory)"
("." ".." "????")
(note: newlispw = UTF8 enabled newlisp)





Hoping it's just a console limitation, I also run this
F:test>newlispw -e "(write-file {dir.txt} ((directory) 3))"
4
Opening the "dir.txt" file, again all I see is ????





Finally, trying to read the file
F:test>newlispw -e "(read-file {????file.txt})"
nil

F:test>newlispw -e "(open {????file.txt} {r})"
nil
Title:
Post by: Lutz on June 04, 2007, 05:33:46 PM
It works on MacOS X:


newLISP v.9.1.7 on OSX UTF-8, execute 'newlisp -h' for more info.

> (print "230162182230181166231148177232168152")
????"230162182230181166231148177232168152"
> (write-file "230162182230181166231148177232168152" "Hello Unicode")
13
> (directory)
("." ".." ".DS_Store" "230162182230181166231148177232168152")
> !ls
????????????
> (read-file "230162182230181166231148177232168152")
"Hello Unicode"
>


don't know whats different on Win32



Lutz
Title:
Post by: Lutz on June 04, 2007, 05:35:05 PM
... before posting I saw the Chinese characters in the post/edit box of the browser, and also in the terminal window, but after posting they got ???? (in the first 'print' statement)



Lutz
Title:
Post by: Lutz on June 04, 2007, 05:43:00 PM
... the thing is not to 'print', but get the unprinted string to work with. In newLISP you see the raw string where UTF-8 is shown with numbers in the return values. I guess if you do exactly the same thing, I did in MacOS X, it will work for you too on Win32. What it prooves is, that both OSs seem to encode filenames in UTF-8.



Lutz
Title:
Post by: m35 on June 04, 2007, 06:05:06 PM
Wow, things look interesting with that same code on windows
newLISP v.9.1.1 on Win32 UTF-8, execute 'newlisp -h' for more info.

> (print "230162182230181166231148177232168152")
梶浦由記"梶浦由記"
> (write-file "230162182230181166231148177232168152" "Hello Unicode")
13
> (directory)
("." ".." "梶浦由記")
> !dir
 Volume in drive F has no label.
 Volume Serial Number is C458-D3A7

 Directory of F:test2

06/04/2007  05:45 PM    <DIR>          .
06/04/2007  05:45 PM    <DIR>          ..
06/04/2007  05:45 PM                13 梶æµ▌ç"±è"~
               1 File(s)             13 bytes
               2 Dir(s)      12,067,328 bytes free
> (read-file "230162182230181166231148177232168152")
"Hello Unicode"
>


I am left with a file named
梶浦ç"±è¨˜ in the directory.
Title:
Post by: Lutz on June 04, 2007, 07:11:52 PM
I believe notepad.exe has a UTF-8 option and you could paste those characters into it to have the Chinese chars back.



What you would need on Wndows is a cmd.exe which does UTF-8



Lutz
Title: Windows and UTF-8
Post by: jp on June 05, 2007, 09:05:13 PM
Quote from: "Lutz"I believe notepad.exe has a UTF-8 option and you could paste those characters into it to have the Chinese chars back.



What you would need on Wndows is a cmd.exe which does UTF-8



Lutz


Actually for Win2k and above to set the command line to UTF-8 you will have simply to set the code page with the following command, chcp 65001, prior to the execution of your command. There only caveat is: make sure the command prompt's properties are not set on Raster Fonts
Title:
Post by: m35 on June 06, 2007, 11:15:41 AM
Quote from: "jp"set the code page with the following command, chcp 65001

Thanks jp! I wasn't aware of that one.



Now here is that same process after changing the code page.

F:temp>chcp 65001
Active code page: 65001

...

newLISP v.9.1.1 on Win32 UTF-8, execute 'newlisp -h' for more info.

> (print "230162182230181166231148177232168152")
梶浦由記""
> (write-file "230162182230181166231148177232168152" "Hello Unicode")
13
> (directory)
("." ".." "")
> !dir /w
 Volume in drive F has no label.
 Volume Serial Number is C458-D3A7

 Directory of F:temp

[.]            [..]           梶浦ç"±è¨˜
               1 File(s)             13 bytes
               2 Dir(s)      12,066,816 bytes free
> (read-file "230162182230181166231148177232168152")
"Hello Unicode"
>
Note that the 梶浦由記 appear as rectangles in the console (but I assume that's just because the Lucida Console font doesn't have those characters).



The behavior of that (directory) entry is interesting...
> (directory)
("." ".." "")
> (length ((directory) 2))
12
> (setq s ((directory) 2))
""
> s
""
> (length s)
12
> (source 's)
"(set 's "")rnrn"
> (print s)
梶浦由記""


Unfortunately I'm still left with the 梶浦ç"±è¨˜ file, and not the proper Unicode one.



Since I'm not having any luck, I went ahead and implemented UTF-16 versions of functions that refer to path names (using the Win32 API). I'll post them on the "newlisp for Win" board when I'm done.
Title: Windows and UTF-8
Post by: jp on June 06, 2007, 05:49:26 PM
Quote from: "m35"Unfortunately I'm still left with the 梶浦ç"±è¨˜ file, and not the proper Unicode one.


Perhaps it is worth mentioning that for win2k and above the internal representations are in Unicode UTF-16LE and if one can change arbitrarily its DOS code page, in Windows proper, the internal character representations remained fixed.

Also the name 梶浦由記 strikes me more as being a Japanese name (Kajiura Yuki) rather than a Chinese. Nonetheless Windows will need to have its Chinese/Japanese Fonts enabled in order to render those characters properly.
Title:
Post by: m35 on June 07, 2007, 05:06:40 AM
Quote from: "jp"Also the name 梶浦由記 strikes me more as being a Japanese name (Kajiura Yuki) rather than a Chinese.


Good eye jp. Read Japanese? Other languages?



ps I'm a big fan of Yuki Kajiura's work (//http) :)
Title:
Post by: jp on June 07, 2007, 04:54:04 PM
Quote from: "m35"Good eye jp. Read Japanese? Other languages?

Pleased to oblige!

Yes indeed, I read Japanese. And I believe you are a Japanese native speaker since you inadvertently inverted the L for an R in your login summary.
Title:
Post by: m35 on June 07, 2007, 08:23:39 PM
ご免なさい I know only a little Japanese because I work with Japanese people (and like あにめ ^_^). The カリフォニア typo is part 日本語 accent, and part Arnold Schwarzenegger accent (´∀`)
Title:
Post by: newdep on June 08, 2007, 12:12:15 PM
QuoteAnd I believe you are a Japanese native speaker since you inadvertently inverted the L for an R in your login summary.


Speaking about good eyes?? That must be a secret hint.. I was indeed wondering why he mispelled california... ;-) No offence btw... it just caught my eye too and did not know there was perhpas a reason for it..
Title:
Post by: jp on June 08, 2007, 08:26:02 PM
QuoteSpeaking about good eyes?? That must be a secret hint..

Well, there is nothing too esoteric about it!

Japanese has no phonetic equivalent to the L and R consonants but has a consonant that seat somewhere between those 2 sounds. Hence even knowing perfectly well all common place names since childhood due to the lack of that phonetic register the Japanese are often at loss to write down L and R containing names in English they know assuredly in Japanese.