Print Page - reading utf16 files?

Title: reading utf16 files?
Post by: cormullion on May 28, 2007, 02:56:29 PM

is it possible to read the contents of UTF16 files with newLISP? I'm just getting a couple of strange characters when I use read-file...

Title:
Post by: Lutz on May 28, 2007, 04:28:20 PM

Only UTF-8 encoded files are supported directly. You would have to read the file in 2-byte pieces expand those to a 4-byte Unicode integer using 'unpack' and the "u" format and then convert it to UTF-8 using the newLISP 'utf-8' function.

Lutz

ps: too busy at the moment on the GUI stuff, to give you a solution, remind me next week.

Title:
Post by: Dmi on May 28, 2007, 10:52:35 PM

In case of conversion can help U (to utf-8 or to something national), U can use iconv() libc call.

Look at //http://en.feautec.pp.ru/store/libs/doc/iconv.lsp.html.

Under *nices other than Linux, path to libc may need correction.

Under Win* I'd seen iconv.dll somewhere, but there was slightly different function names.

Title:
Post by: cormullion on May 28, 2007, 11:54:35 PM

thanks guys, i'll check these ideas out...!

Title:
Post by: cormullion on May 30, 2007, 01:34:48 PM

Yeah - I found a few MacOS X libraries, but they didn't seem to work:

libc.dylib

libiconv.dylib

libiconv.2.2.0.dylib

libiconv.2.dylib

libiconv.dylib

It seemed a bit easier to try this:

(exec "iconv -f -t " etc....)

But in the end, I used something else altogether, just to get it done. :-(

Thanks though...

Title:
Post by: Dmi on May 30, 2007, 02:07:06 PM

What does "man 3 iconv" shows about linking and about function specifications?

Usage of "iconv" shell command is not a good idea because it doesn't handle incorrect symbols - just stop processing immediately.

In Linux I using "recode -f" for that.

Title:
Post by: m35 on June 04, 2007, 04:09:24 PM

This may not be the fastest approach, or even the most accurate, but it seemed to work in my tests.

Code Select Expand
(define (utf16->utf8 s)
	(join
		(map
			(fn (c)
				(utf8 (append (reverse c) "000000000000"))
			)
			(find-all ".." s)
		)
	)
)

And speaking of Unicode files, does anyone (Lutz ;) know how to open a file with Unicode characters in the path (on Windows)? I tried it using a utf8 string, but open just returned nil. Do I need to dig into the Win32 API on this one?

Edit: Much faster version (2x), and leaves it as unicode (for you to call utf8 if desired).

Code Select Expand
(define (utf16->utf32 s)
	(append 
		(join 
			(map 
				; (curry pack "u") ;identical speed 
				(fn (c) 
					(pack "u" c)
				)
				(unpack (dup ">u" (>> (length s) 1)) s)
			)
			"0000"
		)
		"000000000000"
	)
)

Edit:

Note the ">u" may also be "<u", depending on whether it is BE or LE encoded.

Title:
Post by: Lutz on June 04, 2007, 04:32:08 PM

If I have a file-path name with strange character encoding in it, I try to open it using the string shown from a 'directory' statement. That may show you how the filename characters have to be translated.

Lutz

Title:
Post by: m35 on June 04, 2007, 05:10:16 PM

Thanks Lutz, I gave that a try but still didn't have any luck.

I have a file with the path

"F:test梶浦由記file.txt"

I run the following (in the "test" directory) with the following result.

Code Select Expand
F:test>newlispw -e "(directory)"
("." ".." "????")

(note: newlispw = UTF8 enabled newlisp)

Hoping it's just a console limitation, I also run this

Code Select Expand
F:test>newlispw -e "(write-file {dir.txt} ((directory) 3))"
4

Opening the "dir.txt" file, again all I see is ????

Finally, trying to read the file

Code Select Expand
F:test>newlispw -e "(read-file {????file.txt})"
nil

F:test>newlispw -e "(open {????file.txt} {r})"
nil

Title:
Post by: Lutz on June 04, 2007, 05:33:46 PM

It works on MacOS X:

Code Select Expand
newLISP v.9.1.7 on OSX UTF-8, execute 'newlisp -h' for more info.

> (print "230162182230181166231148177232168152")
????"230162182230181166231148177232168152"
> (write-file "230162182230181166231148177232168152" "Hello Unicode")
13
> (directory)
("." ".." ".DS_Store" "230162182230181166231148177232168152")
> !ls
????????????
> (read-file "230162182230181166231148177232168152")
"Hello Unicode"
>

don't know whats different on Win32

Lutz

Title:
Post by: Lutz on June 04, 2007, 05:35:05 PM

... before posting I saw the Chinese characters in the post/edit box of the browser, and also in the terminal window, but after posting they got ???? (in the first 'print' statement)

Lutz

Title:
Post by: Lutz on June 04, 2007, 05:43:00 PM

... the thing is not to 'print', but get the unprinted string to work with. In newLISP you see the raw string where UTF-8 is shown with numbers in the return values. I guess if you do exactly the same thing, I did in MacOS X, it will work for you too on Win32. What it prooves is, that both OSs seem to encode filenames in UTF-8.

Lutz

Title:
Post by: m35 on June 04, 2007, 06:05:06 PM

Wow, things look interesting with that same code on windows

Code Select Expand
newLISP v.9.1.1 on Win32 UTF-8, execute 'newlisp -h' for more info.

> (print "230162182230181166231148177232168152")
µó╢µ╡ªτö▒Φ¿ÿ"µó╢µ╡ªτö▒Φ¿ÿ"
> (write-file "230162182230181166231148177232168152" "Hello Unicode")
13
> (directory)
("." ".." "µó╢µ╡ªτö▒Φ¿ÿ")
> !dir
 Volume in drive F has no label.
 Volume Serial Number is C458-D3A7

 Directory of F:test2

06/04/2007  05:45 PM    <DIR>          .
06/04/2007  05:45 PM    <DIR>          ..
06/04/2007  05:45 PM                13 æ¢¶æµ▌ç"±è"~
               1 File(s)             13 bytes
               2 Dir(s)      12,067,328 bytes free
> (read-file "230162182230181166231148177232168152")
"Hello Unicode"
>

I am left with a file named

Code Select Expand
æ¢¶æµ¦ç"±è¨˜

in the directory.

Title:
Post by: Lutz on June 04, 2007, 07:11:52 PM

I believe notepad.exe has a UTF-8 option and you could paste those characters into it to have the Chinese chars back.

What you would need on Wndows is a cmd.exe which does UTF-8

Lutz

Title: Windows and UTF-8
Post by: jp on June 05, 2007, 09:05:13 PM

~~Quote from: "Lutz"~~I believe notepad.exe has a UTF-8 option and you could paste those characters into it to have the Chinese chars back.

What you would need on Wndows is a cmd.exe which does UTF-8

Lutz

Actually for Win2k and above to set the command line to UTF-8 you will have simply to set the code page with the following command, chcp 65001, prior to the execution of your command. There only caveat is: make sure the command prompt's properties are not set on Raster Fonts

Title:
Post by: m35 on June 06, 2007, 11:15:41 AM

~~Quote from: "jp"~~set the code page with the following command, chcp 65001

Thanks jp! I wasn't aware of that one.

Now here is that same process after changing the code page.

Code Select Expand

F:temp>chcp 65001
Active code page: 65001

...

newLISP v.9.1.1 on Win32 UTF-8, execute 'newlisp -h' for more info.

> (print "230162182230181166231148177232168152")
梶浦由記""
> (write-file "230162182230181166231148177232168152" "Hello Unicode")
13
> (directory)
("." ".." "")
> !dir /w
 Volume in drive F has no label.
 Volume Serial Number is C458-D3A7

 Directory of F:temp

[.]            [..]           æ¢¶æµ¦ç"±è¨˜
               1 File(s)             13 bytes
               2 Dir(s)      12,066,816 bytes free
> (read-file "230162182230181166231148177232168152")
"Hello Unicode"
>

Note that the 梶浦由記 appear as rectangles in the console (but I assume that's just because the Lucida Console font doesn't have those characters).

The behavior of that (directory) entry is interesting...

Code Select Expand
> (directory)
("." ".." "")
> (length ((directory) 2))
12
> (setq s ((directory) 2))
""
> s
""
> (length s)
12
> (source 's)
"(set 's "")rnrn"
> (print s)
梶浦由記""

Unfortunately I'm still left with the æ¢¶æµ¦ç"±è¨˜ file, and not the proper Unicode one.

Since I'm not having any luck, I went ahead and implemented UTF-16 versions of functions that refer to path names (using the Win32 API). I'll post them on the "newlisp for Win" board when I'm done.

Title: Windows and UTF-8
Post by: jp on June 06, 2007, 05:49:26 PM

~~Quote from: "m35"~~Unfortunately I'm still left with the æ¢¶æµ¦ç"±è¨˜ file, and not the proper Unicode one.

Perhaps it is worth mentioning that for win2k and above the internal representations are in Unicode UTF-16LE and if one can change arbitrarily its DOS code page, in Windows proper, the internal character representations remained fixed.

Also the name 梶浦由記 strikes me more as being a Japanese name (Kajiura Yuki) rather than a Chinese. Nonetheless Windows will need to have its Chinese/Japanese Fonts enabled in order to render those characters properly.

Title:
Post by: m35 on June 07, 2007, 05:06:40 AM

~~Quote from: "jp"~~Also the name 梶浦由記 strikes me more as being a Japanese name (Kajiura Yuki) rather than a Chinese.

Good eye jp. Read Japanese? Other languages?

ps I'm a big fan of Yuki Kajiura's work (//http) :)

Title:
Post by: jp on June 07, 2007, 04:54:04 PM

~~Quote from: "m35"~~Good eye jp. Read Japanese? Other languages?

Pleased to oblige!

Yes indeed, I read Japanese. And I believe you are a Japanese native speaker since you inadvertently inverted the L for an R in your login summary.

Title:
Post by: m35 on June 07, 2007, 08:23:39 PM

ご免なさい I know only a little Japanese because I work with Japanese people (and like あにめ ^_^). The カリフォニア typo is part 日本語 accent, and part Arnold Schwarzenegger accent （´∀｀）

Title:
Post by: newdep on June 08, 2007, 12:12:15 PM

~~Quote~~And I believe you are a Japanese native speaker since you inadvertently inverted the L for an R in your login summary.

Speaking about good eyes?? That must be a secret hint.. I was indeed wondering why he mispelled california... ;-) No offence btw... it just caught my eye too and did not know there was perhpas a reason for it..

Title:
Post by: jp on June 08, 2007, 08:26:02 PM

~~Quote~~Speaking about good eyes?? That must be a secret hint..

Well, there is nothing too esoteric about it!

Japanese has no phonetic equivalent to the L and R consonants but has a consonant that seat somewhere between those 2 sounds. Hence even knowing perfectly well all common place names since childhood due to the lack of that phonetic register the Japanese are often at loss to write down L and R containing names in English they know assuredly in Japanese.

newLISP Fan Club

Forum => Anything else we might add? => Topic started by: cormullion on May 28, 2007, 02:56:29 PM