read-file doesn't work with /dev/stdin

Started by TedWalther, January 25, 2009, 09:52:27 PM

Previous topic - Next topic

TedWalther

how can I make read-file work with /dev/stdin?  when my application doesn't have a file specified by the user, I want to fall back to stdin to slurp my data in.  It is potentially utf8 data input, and I will want to iterate over each character in the stream..



Writing to /dev/stdout works just fine.



Perhaps the simplest solution is to let read-file etc use file descriptors (integers) as well as strings?



Example code:



(setq from-format (sym (or (env "FROM") "utf8")))

(setq   to-format (sym (or (env "TO") "neo-paleo-hebrew")))

(setq output-file (or (env "SAVEAS") "/dev/stdout"))

(setq  input-file (or (env "OPEN")   "/dev/stdin"))



(dostring (c (read-file input-file))

   (append-file output-file (char c)))
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

Lutz

#1
'read-file' can only be used to read from files.



You could use 'read-char' to read from stdin using 0 as the device. Returned values will be one byte at a time, or use 'read-line' then 'explode' the line into UTF-8 multibyte characters. In both cases processing does not start until a line-feed.


#!/usr/bin/newlisp

(while (!= (setq ch (read-char 0)) (char "q"))
(println "->" ch))

(exit)


Or use 'read-key', which will process immediately after each key-stroke:


#!/usr/bin/newlisp

(while (!= (setq ch (read-key)) (char "q"))
(println "->" ch))

(exit)

TedWalther

#2
Lutz, could it be time to change read-char with read-byte, and make read-char do the utf8 thing?



Would it be hard to slurp in fd 0 for read-file?  I mean, generally, I cat a file into the program, and it is no different from reading a regular file.  The only difference between read-file and read-char is that one takes an already opened file descriptor, and the other wants to do the open itself.  Once read-file accepted all possible input from fd0, it makes sense it would close it.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

Lutz

#3
I see 'read-char', 'read-buffer' and 'read-line' as the low-level API using file descriptors and dealing with octets and 'read/write/append -file' as the highlevel API dealing with file names.

Lutz

#4
A future version of newLISP has a 'read-utf8' which works like 'read-char', but reads UTF-8 characters from a file handle or stdin. Until then I suggest using (explode (read-line ...)) instead. 'explode' splits a string on multi-byte character borders in UTF8 enabled versions of newLISP.

TedWalther

#5
Quote from: "Lutz"A future version of newLISP has a 'read-utf8' which works like 'read-char', but reads UTF-8 characters from a file handle or stdin. Until then I suggest using (explode (read-line ...)) instead. 'explode' splits a string on multi-byte character borders in UTF8 enabled versions of newLISP.


Thank you for the explanation.  I'm meditating now on the whole utf8 vs byte oriented paradigm works in newLISP.  I should have spent more time meditating on the pack and unpack functions.



Does (char ...) use (pack ...) internally?  Should it?  I recently had some problems when trying to output some Hebrew characters in codepage iso8859-8.  I think (char ...) was interpreting them as utf8 characters.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

Lutz

#6
'char' does not use 'pack' or 'unpack' internally, but 'char' is UTF-8-sensitive when using the UTF-8 enabled version of newLISP. If a Windows installation is working with ISO-8859-x character sets, then one should not use the UTF-8 enabled version for Hebrew and most Eastern European (cyrillic) character sets.



When the UTF-8 enabled version of newLISP sees Hebrew ISO-8859-8 characters (greater 127), it sees them not as Hebrew but something else. All characters below 128 will interpreted as ASCII in both, ISO-8859-x or UTf-8/UNICODE. All above 127 will be one-byte-wide characters in ISO-8859-x characters sets but could initiate different one- or multi-byte-wide characters in UTF-8.



Here is a list of all functions sensitive to UTF-8 characters when using the UTF-8 enabled version: http://www.newlisp.org/newlisp_manual.html#utf8_capable">http://www.newlisp.org/newlisp_manual.html#utf8_capable



Ps: by default the Windows installer shipped is non-UTF8, and the Mac OS X installer is UTF-8.

newdep

#7
Lutz,



Why do you make a differences in function names between utf8 and none utf8?



Its somewhat awkward to have that inside a language...

I would expect global functionality in a function instead of separate behaviour..



Just a though....
-- (define? (Cornflakes))

Lutz

#8
Some functions work differently in utf8 and non-utf8 versions like these: http://www.newlisp.org/newlisp_manual.html#utf8_capable">http://www.newlisp.org/newlisp_manual.html#utf8_capable They are meant to work strictly on displayable strings and can display this global behavior. But there are also functions where you need both version like 'read-char' and 'read-utf8'. If you let 'read-char' switch behavior, like the others in the link, you would not be able to read binary files or ISO-8859 files correctly.