newLISP Fan Club

Forum => newLISP in the real world => Topic started by: TedWalther on January 25, 2009, 09:52:27 PM

Title: read-file doesn't work with /dev/stdin
Post by: TedWalther on January 25, 2009, 09:52:27 PM
how can I make read-file work with /dev/stdin?  when my application doesn't have a file specified by the user, I want to fall back to stdin to slurp my data in.  It is potentially utf8 data input, and I will want to iterate over each character in the stream..



Writing to /dev/stdout works just fine.



Perhaps the simplest solution is to let read-file etc use file descriptors (integers) as well as strings?



Example code:



(setq from-format (sym (or (env "FROM") "utf8")))

(setq   to-format (sym (or (env "TO") "neo-paleo-hebrew")))

(setq output-file (or (env "SAVEAS") "/dev/stdout"))

(setq  input-file (or (env "OPEN")   "/dev/stdin"))



(dostring (c (read-file input-file))

   (append-file output-file (char c)))
Title:
Post by: Lutz on January 26, 2009, 04:37:52 AM
'read-file' can only be used to read from files.



You could use 'read-char' to read from stdin using 0 as the device. Returned values will be one byte at a time, or use 'read-line' then 'explode' the line into UTF-8 multibyte characters. In both cases processing does not start until a line-feed.


#!/usr/bin/newlisp

(while (!= (setq ch (read-char 0)) (char "q"))
(println "->" ch))

(exit)


Or use 'read-key', which will process immediately after each key-stroke:


#!/usr/bin/newlisp

(while (!= (setq ch (read-key)) (char "q"))
(println "->" ch))

(exit)
Title:
Post by: TedWalther on January 30, 2009, 11:07:14 PM
Lutz, could it be time to change read-char with read-byte, and make read-char do the utf8 thing?



Would it be hard to slurp in fd 0 for read-file?  I mean, generally, I cat a file into the program, and it is no different from reading a regular file.  The only difference between read-file and read-char is that one takes an already opened file descriptor, and the other wants to do the open itself.  Once read-file accepted all possible input from fd0, it makes sense it would close it.
Title:
Post by: Lutz on January 31, 2009, 05:35:17 AM
I see 'read-char', 'read-buffer' and 'read-line' as the low-level API using file descriptors and dealing with octets and 'read/write/append -file' as the highlevel API dealing with file names.
Title:
Post by: Lutz on February 01, 2009, 09:43:30 AM
A future version of newLISP has a 'read-utf8' which works like 'read-char', but reads UTF-8 characters from a file handle or stdin. Until then I suggest using (explode (read-line ...)) instead. 'explode' splits a string on multi-byte character borders in UTF8 enabled versions of newLISP.
Title: utf8, pack, char, and iso8859-8 (hebrew)
Post by: TedWalther on February 02, 2009, 12:31:19 PM
Quote from: "Lutz"A future version of newLISP has a 'read-utf8' which works like 'read-char', but reads UTF-8 characters from a file handle or stdin. Until then I suggest using (explode (read-line ...)) instead. 'explode' splits a string on multi-byte character borders in UTF8 enabled versions of newLISP.


Thank you for the explanation.  I'm meditating now on the whole utf8 vs byte oriented paradigm works in newLISP.  I should have spent more time meditating on the pack and unpack functions.



Does (char ...) use (pack ...) internally?  Should it?  I recently had some problems when trying to output some Hebrew characters in codepage iso8859-8.  I think (char ...) was interpreting them as utf8 characters.
Title:
Post by: Lutz on February 02, 2009, 04:23:50 PM
'char' does not use 'pack' or 'unpack' internally, but 'char' is UTF-8-sensitive when using the UTF-8 enabled version of newLISP. If a Windows installation is working with ISO-8859-x character sets, then one should not use the UTF-8 enabled version for Hebrew and most Eastern European (cyrillic) character sets.



When the UTF-8 enabled version of newLISP sees Hebrew ISO-8859-8 characters (greater 127), it sees them not as Hebrew but something else. All characters below 128 will interpreted as ASCII in both, ISO-8859-x or UTf-8/UNICODE. All above 127 will be one-byte-wide characters in ISO-8859-x characters sets but could initiate different one- or multi-byte-wide characters in UTF-8.



Here is a list of all functions sensitive to UTF-8 characters when using the UTF-8 enabled version: http://www.newlisp.org/newlisp_manual.html#utf8_capable



Ps: by default the Windows installer shipped is non-UTF8, and the Mac OS X installer is UTF-8.
Title:
Post by: newdep on February 03, 2009, 02:53:14 AM
Lutz,



Why do you make a differences in function names between utf8 and none utf8?



Its somewhat awkward to have that inside a language...

I would expect global functionality in a function instead of separate behaviour..



Just a though....
Title:
Post by: Lutz on February 03, 2009, 04:22:56 AM
Some functions work differently in utf8 and non-utf8 versions like these: http://www.newlisp.org/newlisp_manual.html#utf8_capable They are meant to work strictly on displayable strings and can display this global behavior. But there are also functions where you need both version like 'read-char' and 'read-utf8'. If you let 'read-char' switch behavior, like the others in the link, you would not be able to read binary files or ISO-8859 files correctly.