Print Page - read-file doesn't work with /dev/stdin

Title: read-file doesn't work with /dev/stdin
Post by: TedWalther on January 25, 2009, 09:52:27 PM

how can I make read-file work with /dev/stdin? when my application doesn't have a file specified by the user, I want to fall back to stdin to slurp my data in. It is potentially utf8 data input, and I will want to iterate over each character in the stream..

Writing to /dev/stdout works just fine.

Perhaps the simplest solution is to let read-file etc use file descriptors (integers) as well as strings?

Example code:

(setq from-format (sym (or (env "FROM") "utf8")))

(setq to-format (sym (or (env "TO") "neo-paleo-hebrew")))

(setq output-file (or (env "SAVEAS") "/dev/stdout"))

(setq input-file (or (env "OPEN") "/dev/stdin"))

(dostring (c (read-file input-file))

(append-file output-file (char c)))

Title:
Post by: Lutz on January 26, 2009, 04:37:52 AM

'read-file' can only be used to read from files.

You could use 'read-char' to read from stdin using 0 as the device. Returned values will be one byte at a time, or use 'read-line' then 'explode' the line into UTF-8 multibyte characters. In both cases processing does not start until a line-feed.

Code Select Expand
#!/usr/bin/newlisp

(while (!= (setq ch (read-char 0)) (char "q"))
		(println "->" ch))

(exit)

Or use 'read-key', which will process immediately after each key-stroke:

Code Select Expand
#!/usr/bin/newlisp

(while (!= (setq ch (read-key)) (char "q"))
	(println "->" ch))

(exit)

Title:
Post by: TedWalther on January 30, 2009, 11:07:14 PM

Lutz, could it be time to change read-char with read-byte, and make read-char do the utf8 thing?

Would it be hard to slurp in fd 0 for read-file? I mean, generally, I cat a file into the program, and it is no different from reading a regular file. The only difference between read-file and read-char is that one takes an already opened file descriptor, and the other wants to do the open itself. Once read-file accepted all possible input from fd0, it makes sense it would close it.

Title:
Post by: Lutz on January 31, 2009, 05:35:17 AM

I see 'read-char', 'read-buffer' and 'read-line' as the low-level API using file descriptors and dealing with octets and 'read/write/append -file' as the highlevel API dealing with file names.

Title:
Post by: Lutz on February 01, 2009, 09:43:30 AM

A future version of newLISP has a 'read-utf8' which works like 'read-char', but reads UTF-8 characters from a file handle or stdin. Until then I suggest using (explode (read-line ...)) instead. 'explode' splits a string on multi-byte character borders in UTF8 enabled versions of newLISP.

Title: utf8, pack, char, and iso8859-8 (hebrew)
Post by: TedWalther on February 02, 2009, 12:31:19 PM

~~Quote from: "Lutz"~~A future version of newLISP has a 'read-utf8' which works like 'read-char', but reads UTF-8 characters from a file handle or stdin. Until then I suggest using (explode (read-line ...)) instead. 'explode' splits a string on multi-byte character borders in UTF8 enabled versions of newLISP.

Thank you for the explanation. I'm meditating now on the whole utf8 vs byte oriented paradigm works in newLISP. I should have spent more time meditating on the pack and unpack functions.

Does (char ...) use (pack ...) internally? Should it? I recently had some problems when trying to output some Hebrew characters in codepage iso8859-8. I think (char ...) was interpreting them as utf8 characters.

Title:
Post by: Lutz on February 02, 2009, 04:23:50 PM

'char' does not use 'pack' or 'unpack' internally, but 'char' is UTF-8-sensitive when using the UTF-8 enabled version of newLISP. If a Windows installation is working with ISO-8859-x character sets, then one should not use the UTF-8 enabled version for Hebrew and most Eastern European (cyrillic) character sets.

When the UTF-8 enabled version of newLISP sees Hebrew ISO-8859-8 characters (greater 127), it sees them not as Hebrew but something else. All characters below 128 will interpreted as ASCII in both, ISO-8859-x or UTf-8/UNICODE. All above 127 will be one-byte-wide characters in ISO-8859-x characters sets but could initiate different one- or multi-byte-wide characters in UTF-8.

Here is a list of all functions sensitive to UTF-8 characters when using the UTF-8 enabled version: http://www.newlisp.org/newlisp_manual.html#utf8_capable

Ps: by default the Windows installer shipped is non-UTF8, and the Mac OS X installer is UTF-8.

Title:
Post by: newdep on February 03, 2009, 02:53:14 AM

Lutz,

Why do you make a differences in function names between utf8 and none utf8?

Its somewhat awkward to have that inside a language...

I would expect global functionality in a function instead of separate behaviour..

Just a though....

Title:
Post by: Lutz on February 03, 2009, 04:22:56 AM

Some functions work differently in utf8 and non-utf8 versions like these: http://www.newlisp.org/newlisp_manual.html#utf8_capable They are meant to work strictly on displayable strings and can display this global behavior. But there are also functions where you need both version like 'read-char' and 'read-utf8'. If you let 'read-char' switch behavior, like the others in the link, you would not be able to read binary files or ISO-8859 files correctly.

newLISP Fan Club

Forum => newLISP in the real world => Topic started by: TedWalther on January 25, 2009, 09:52:27 PM