Strange problem with dolist

Started by Darth.Severus, August 23, 2015, 05:34:06 AM

Previous topic - Next topic

Darth.Severus

In a script I run: (read-file <path>) and (parse str "/n") to get a list with the content of a file parsed in lines into the symbol input-list. Then I run following code:


(dolist (temp input-list)
(when (not (or
(starts-with temp "#")
(starts-with temp "t")))
(setq temp (replace " " temp "&nbsp;"))
(push (string temp "<br>") result-list -1))
(when (starts-with temp "#")
(push (heading temp) result-list -1)))


It does what it should do, but not with the first line. It has no (exit) function at the end, so I can look what is in input-list. The first line is "### whatever" and e.g. the fifth is "### whatever-again". But it applies the first (when) to the first line and the second (when) function to all the others starting with "#". This is completely crazy.



Linux my-notebook 3.14-0.bpo.2-686-pae #1 SMP Debian 3.14.15-2~bpo70+1 (2014-08-21) i686 GNU/Linux

newLISP v.10.6.2 32-bit on Linux IPv4/6 UTF-8 libffi

TedWalther

#1

   (dolist (temp input-list)
          (when (not (or
                   (starts-with temp "#")
                   (starts-with temp "t")))
             (replace " " temp "&nbsp;")
             (extend temp "<br>")
             (push temp result-list -1))
          (when (starts-with temp "#")
                (push (heading temp) result-list -1)))


Here, cleaned it up a little for you.



Here is a question: are you intentionally skipping lines that start with a tab "t"?



I don't know why that is a bug; if you send me a fuller code sample I'll run it and take a look.  Tell me if this works:



   (dolist (temp input-list)
          (cond
          ((starts-with temp "t") nil) ; do nothing
          ((starts-with temp "#") (push (heading temp) result-list -1))
          (default (replace " " temp "&nbsp;") (extend temp "<br>") (push temp result-list -1))))


(this isn't a bug-fixed version, just how I would have implemented it)
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

Darth_Severus

#2
QuoteHere, cleaned it up a little for you.
Thanks, but you also put an error in it:


(push temp result-list-1))
should be (push temp result-list -1))


QuoteHere is a question: are you intentionally skipping lines that start with a tab "t"?
Yes, for the moment. I wanted to decide later what to do instead.


Quote if you send me a fuller code sample I'll run it and take a look.  
Thanks. I can't send you my current input file. I'll first try it on my own a bit, with another input file.


QuoteTell me if this works:
No sorry. I mean it's not working.


Quote(this isn't a bug-fixed version, just how I would have implemented it)
Interesting, thanks. Only that your lines are to long for my taste. I'm using huge letters in my editor. You also putted the same error in than described above. There's only one result-list. Never mind.

Darth_Severus

#3
Update



I found the nerve to try it again. I created a new file and it worked correctly, until I used that option in geany called  "Write Unicode BOM". I always activated this, without knowing if I need it for sure. It turns out, for this use case I don't. I'm quite sure I started activating this cause I had problems in another program while not having it.



https://en.wikipedia.org/wiki/Byte_order_mark">https://en.wikipedia.org/wiki/Byte_order_mark

TedWalther

#4
Wierd.  I think BOM is supposed to only happen once, at the very beginning of the file.



About the error: my eyesight.  Didn't see that the -1 was detached from the variable name.  I THOUGHT it was a wierd variable name... :)



So even in the (cond ...) style, the code doesn't work if BOM is in the file?
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

TedWalther

#5
Ok, try this code with the BOM enabled:



(dolist (temp result-list)
  (println (char (first temp)) {(} (first temp) {) } temp))
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

Darth_Severus

#6
Quote from: "TedWalther"Wierd.  I think BOM is supposed to only happen once, at the very beginning of the file.
Right, and that fits exactly to the error I got. Only the first line makes troubles.


QuoteSo even in the (cond ...) style, the code doesn't work if BOM is in the file?
Yes. I think it's clearly a bug. Unlike other programs newLisp handles the BOM like it would be part of the text.


QuoteOk, try this code with the BOM enabled:
Before I even mentioned the problem here, I checked what newLisp gives me with print or by accessing the data by indexing. Using print the first line is always shown as it should, but when it handles the data it sees the BOM as start of the first line.



I looked again into it, and could see the problem:
Quote
(setq data (read-file "/pathdeleted/untitled"))

(println data)

### Unicode?

### Test0

### lülülü

'''Test1

   Test2

"### Unicode?n### Test0n### lülülün'''Test1n   Test2"



(data 0)

""



(starts-with data "#")

nil



(println (char (first data)) {(} (first data) {) } data)

65279() ### Unicode?

### Test0

### lülülü

'''Test1

   Test2

"### Unicode?n### Test0n### lülülün'''Test1n   Test2"



(char 65279)

""

TedWalther

#7
Quote from: "Darth_Severus"
Quote from: "TedWalther"Wierd.  I think BOM is supposed to only happen once, at the very beginning of the file.
Right, and that fits exactly to the error I got. Only the first line makes troubles.


Oh!  I read your post as saying the opposite; I thought the first line worked, and the others didn't.  In that case, the fix is easy.  newLISP is doing the right thing and leaving the BOM alone.  So, add another starts-with clause that includes the BOM.  Like this:



(starts-with temp (string (char 0xFE) (char 0xFF) "#") ;for UTF16
(starts-with temp (string (char 0xFE) (char 0xBB) (char 0xBF) "#") ;for UTF8


Or, before you even enter your loop, do this:



(setf (result-list 0) (2 (result-list 0)))


That chops off the BOM, (assuming UTF16, for UTF8 change it to (3 (result-list 0))
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

Darth_Severus

#8
I think programs are not supposed to read the BOM when reading a file (this way). When I do
$> cat file in Linux, then the BOM is not shown, nor in any other program. How I've shown above in newLisp it's even a difference when using println or using starts-with. This makes no sense at all, people may have the same problem than me over and over again.

TedWalther

#9
Quote from: "Darth_Severus"I think programs are not supposed to read the BOM when reading a file (this way). When I do
$> cat file in Linux, then the BOM is not shown, nor in any other program. How I've shown above in newLisp it's even a difference when using println or using starts-with. This makes no sense at all, people may have the same problem than me over and over again.


newLISP isn't just a program; it is a general purpose programming language.  Some things NEED to see the BOM.  You are the one writing the program; it is up to you to handle the BOM.  As I just showed you with that one-liner, BOM handling can be done fairly simply.  Yes, it is something to watch out for.  Not sure where that info belongs in the manual; that is language independant general Unicode knowledge.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

Darth_Severus

#10
Even if this would be the right way to do it, then the print function also had to show it. You can't be serious about just keeping it how it is. It's invisible. A programmer can't know which files a user would be using as an input file, so this is one more thing to think of writing a program. Your line, or something like it, had to part of every script reading user generated or third party text files. Not to forget, non-advanced programmers won't have Unicode knowledge.



Is it really done this way in other languages, like Python?



I'd strongly prefer to have it not handled like this. I also see no need for it. To find out what file it is, should only be needed if it is really needed, and some function like "file" in Linux would be better to do that.  Maybe also some possibility to write a file with a BOM.



Yeah, but I see - Linux does it the same way:
> ((exec "cat ~/untitled")0)
"### Unicode?"
(((exec "cat ~/untitled")0)0)
""
Horrible, but it seems to be standard.

TedWalther

#11
Yeah.  I find in general, newLISP doesn't put any burden on you that it doesn't have to.  When dealing with Unicode, there are lots of characters that don't show up when you print.  Even with regular ASCII, there are codes like "" that don't show up at all.  If you don't know where your data is coming from, you have to do checks to sanitize it.  Just a fact of life.  newLISP does make it really easy to check and sanitize data.  But binary data is binary data; only you know how you are going to interpret it.  So newLISP couldn't practically be changed to handle every type of data format.  Instead it gives us a small set of very powerful tools so we can handle every type of binary data format.



That said, I would make a "pop-bom" function, that would strip the bom out of a data stream.  In fact, I've written a bunch of small scripts where I go character by character, and convert or drop specific unicode characters depending on what I'm interested in.  newLISP has been the ideal language for my work on the text of the Dead Sea Scrolls and other old manuscripts that are in Unicode.



One of my most useful scripts, just reads in a stream a character at a time, and makes a histogram; it counts every unique character, and prints out the count, with the FULL unicode name of that character, plus the hex and decimal value of that character.  I call it unicode-histogram.lsp.  If you're interested, I could post it here.




Quote from: "Darth_Severus"Even if this would be the right way to do it, then the print function also had to show it. You can't be serious about just keeping it how it is. It's invisible. A programmer can't know which files a user would be using as an input file, so this is one more thing to think of writing a program. Your line, or something like it, had to part of every script reading user generated or third party text files. Not to forget, non-advanced programmers won't have Unicode knowledge.



Is it really done this way in other languages, like Python?



I'd strongly prefer to have it not handled like this. I also see no need for it. To find out what file it is, should only be needed if it is really needed, and some function like "file" in Linux would be better to do that.  Maybe also some possibility to write a file with a BOM.



Yeah, but I see - Linux does it the same way:
> ((exec "cat ~/untitled")0)
"### Unicode?"
(((exec "cat ~/untitled")0)0)
""
Horrible, but it seems to be standard.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

TedWalther

#12
Never mind, it is simple enough, here is my script, it helps with debugging unicode issues.



#!/usr/bin/newlisp

(load "unicode-names.lsp")
(define histogram:histogram)

(define (hex n) (push "0x" (upper-case (format "%x" n))))

(while (setq c (read-utf8 0))
       (setq c (char c))
       (if (histogram c)
(++ (histogram c))
(histogram c 1)))

(dolist (i (sort (histogram) (fn (x y) (< (char (x 0)) (char (y 0))))))
  (println (format "%s (decimal %d) %s (%s) occurs %d times."
    (hex (char (i 0))) (char (i 0)) (i 0) (unicode-name (i 0)) (i 1 0))))

(exit)


And here is some output from a project I recently did:


Quote
0xA (decimal 10)

 (LINE FEED (LF)) occurs 17645 times.

0x20 (decimal 32)   (SPACE) occurs 377553 times.

0x26 (decimal 38) & (AMPERSAND) occurs 3 times.

0x28 (decimal 40) ( (LEFT PARENTHESIS) occurs 282 times.

0x29 (decimal 41) ) (RIGHT PARENTHESIS) occurs 282 times.

0x2A (decimal 42) * (ASTERISK) occurs 96 times.

0x2D (decimal 45) - (HYPHEN-MINUS) occurs 148 times.

0x2E (decimal 46) . (FULL STOP) occurs 9086 times.

0x30 (decimal 48) 0 (DIGIT ZERO) occurs 4429 times.

...

0x1372 (decimal 4978) ፲ (ETHIOPIC NUMBER TEN) occurs 84 times.

0x1373 (decimal 4979) ፳ (ETHIOPIC NUMBER TWENTY) occurs 77 times.

0x1374 (decimal 4980) ፴ (ETHIOPIC NUMBER THIRTY) occurs 67 times.

0x1375 (decimal 4981) ፵ (ETHIOPIC NUMBER FORTY) occurs 30 times.

0x1376 (decimal 4982) ፶ (ETHIOPIC NUMBER FIFTY) occurs 49 times.

0x1377 (decimal 4983) ፷ (ETHIOPIC NUMBER SIXTY) occurs 30 times.

0x1378 (decimal 4984) ፸ (ETHIOPIC NUMBER SEVENTY) occurs 43 times.

0x1379 (decimal 4985) ፹ (ETHIOPIC NUMBER EIGHTY) occurs 11 times.

0x137A (decimal 4986) ፺ (ETHIOPIC NUMBER NINETY) occurs 6 times.

0x137B (decimal 4987) ፻ (ETHIOPIC NUMBER HUNDRED) occurs 372 times.


The unicode-names.lsp file is structured very simply, this should give you the idea:



;; names of unicode code points provided by
;;   http://www.fileformat.info/info/unicode/block/

(define unicode-name:unicode-name)
(unicode-name (char 0x0000) "NULL")
(unicode-name (char 0x0001) "START OF HEADING")
(unicode-name (char 0x0002) "START OF TEXT")
(unicode-name (char 0x0003) "END OF TEXT")
(unicode-name (char 0x0004) "END OF TRANSMISSION")
(unicode-name (char 0x0005) "ENQUIRY")
(unicode-name (char 0x0006) "ACKNOWLEDGE")
(unicode-name (char 0x0007) "BELL")
(unicode-name (char 0x0008) "BACKSPACE")
...
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

Darth_Severus

#13
Thanks for your help so far, I might look into your code if I can use it.