Strange problem with dolist

Darth.Severus · August 23, 2015, 05:34:06 AM

In a script I run: (read-file <path>) and (parse str "/n") to get a list with the content of a file parsed in lines into the symbol input-list. Then I run following code:

Code Select Expand
(dolist (temp input-list)
		(when (not (or 
					(starts-with temp "#")
					(starts-with temp "t")))
			(setq temp (replace " " temp "&nbsp;"))
			(push (string temp "<br>") result-list -1)) 
		(when (starts-with temp "#")
				(push (heading temp) result-list -1)))

It does what it should do, but not with the first line. It has no (exit) function at the end, so I can look what is in input-list. The first line is "### whatever" and e.g. the fifth is "### whatever-again". But it applies the first (when) to the first line and the second (when) function to all the others starting with "#". This is completely crazy.

Linux my-notebook 3.14-0.bpo.2-686-pae #1 SMP Debian 3.14.15-2~bpo70+1 (2014-08-21) i686 GNU/Linux

newLISP v.10.6.2 32-bit on Linux IPv4/6 UTF-8 libffi

TedWalther · August 23, 2015, 11:14:27 AM

Code Select Expand

    (dolist (temp input-list)
          (when (not (or
                   (starts-with temp "#")
                   (starts-with temp "t")))
             (replace " " temp "&nbsp;")
             (extend temp "<br>")
             (push temp result-list -1))
          (when (starts-with temp "#")
                (push (heading temp) result-list -1)))

Here, cleaned it up a little for you.

Here is a question: are you intentionally skipping lines that start with a tab "t"?

I don't know why that is a bug; if you send me a fuller code sample I'll run it and take a look. Tell me if this works:

Code Select Expand

    (dolist (temp input-list)
          (cond
          ((starts-with temp "t") nil) ; do nothing
          ((starts-with temp "#") (push (heading temp) result-list -1))
          (default (replace " " temp "&nbsp;") (extend temp "<br>") (push temp result-list -1))))

(this isn't a bug-fixed version, just how I would have implemented it)

Darth_Severus · August 24, 2015, 04:01:40 AM

~~Quote~~Here, cleaned it up a little for you.

Thanks, but you also put an error in it:

Code Select Expand
(push temp result-list-1))
should be (push temp result-list -1))

~~Quote~~Here is a question: are you intentionally skipping lines that start with a tab "t"?

Yes, for the moment. I wanted to decide later what to do instead.

~~Quote~~ if you send me a fuller code sample I'll run it and take a look.

Thanks. I can't send you my current input file. I'll first try it on my own a bit, with another input file.

~~Quote~~Tell me if this works:

No sorry. I mean it's not working.

~~Quote~~(this isn't a bug-fixed version, just how I would have implemented it)

Interesting, thanks. Only that your lines are to long for my taste. I'm using huge letters in my editor. You also putted the same error in than described above. There's only one result-list. Never mind.

Darth_Severus · August 24, 2015, 08:19:29 AM

Update

I found the nerve to try it again. I created a new file and it worked correctly, until I used that option in geany called "Write Unicode BOM". I always activated this, without knowing if I need it for sure. It turns out, for this use case I don't. I'm quite sure I started activating this cause I had problems in another program while not having it.

https://en.wikipedia.org/wiki/Byte_order_mark">https://en.wikipedia.org/wiki/Byte_order_mark

TedWalther · August 24, 2015, 09:54:49 AM

Wierd. I think BOM is supposed to only happen once, at the very beginning of the file.

About the error: my eyesight. Didn't see that the -1 was detached from the variable name. I THOUGHT it was a wierd variable name... :)

So even in the (cond ...) style, the code doesn't work if BOM is in the file?

TedWalther · August 24, 2015, 09:58:49 AM

Ok, try this code with the BOM enabled:

Code Select Expand

(dolist (temp result-list)
  (println (char (first temp)) {(} (first temp) {) } temp))

Darth_Severus · August 25, 2015, 03:42:35 AM

~~Quote from: "TedWalther"~~Wierd. I think BOM is supposed to only happen once, at the very beginning of the file.

Right, and that fits exactly to the error I got. Only the first line makes troubles.

~~Quote~~So even in the (cond ...) style, the code doesn't work if BOM is in the file?

Yes. I think it's clearly a bug. Unlike other programs newLisp handles the BOM like it would be part of the text.

~~Quote~~Ok, try this code with the BOM enabled:

Before I even mentioned the problem here, I checked what newLisp gives me with print or by accessing the data by indexing. Using print the first line is always shown as it should, but when it handles the data it sees the BOM as start of the first line.

I looked again into it, and could see the problem:

~~Quote~~
(setq data (read-file "/pathdeleted/untitled"))

(println data)

### Unicode?

### Test0

### lülülü

'''Test1

Test2

"### Unicode?n### Test0n### lülülün'''Test1n Test2"

(data 0)

""

(starts-with data "#")

nil

(println (char (first data)) {(} (first data) {) } data)

65279() ### Unicode?

### Test0

### lülülü

'''Test1

Test2

"### Unicode?n### Test0n### lülülün'''Test1n Test2"

(char 65279)

""

TedWalther · August 25, 2015, 09:46:28 AM

~~Quote from: "Darth_Severus"~~
~~Quote from: "TedWalther"~~Wierd. I think BOM is supposed to only happen once, at the very beginning of the file.
Right, and that fits exactly to the error I got. Only the first line makes troubles.

Oh! I read your post as saying the opposite; I thought the first line worked, and the others didn't. In that case, the fix is easy. newLISP is doing the right thing and leaving the BOM alone. So, add another starts-with clause that includes the BOM. Like this:

Code Select Expand

(starts-with temp (string (char 0xFE) (char 0xFF) "#") ;for UTF16
(starts-with temp (string (char 0xFE) (char 0xBB) (char 0xBF) "#") ;for UTF8

Or, before you even enter your loop, do this:

Code Select Expand

(setf (result-list 0) (2 (result-list 0)))

That chops off the BOM, (assuming UTF16, for UTF8 change it to (3 (result-list 0))

Darth_Severus · August 25, 2015, 11:17:26 AM

I think programs are not supposed to read the BOM when reading a file (this way). When I do

Code Select Expand
$> cat file

in Linux, then the BOM is not shown, nor in any other program. How I've shown above in newLisp it's even a difference when using println or using starts-with. This makes no sense at all, people may have the same problem than me over and over again.

TedWalther · August 25, 2015, 12:03:28 PM

~~Quote from: "Darth_Severus"~~I think programs are not supposed to read the BOM when reading a file (this way). When I do
Code Select Expand $> cat file in Linux, then the BOM is not shown, nor in any other program. How I've shown above in newLisp it's even a difference when using println or using starts-with. This makes no sense at all, people may have the same problem than me over and over again.

newLISP isn't just a program; it is a general purpose programming language. Some things NEED to see the BOM. You are the one writing the program; it is up to you to handle the BOM. As I just showed you with that one-liner, BOM handling can be done fairly simply. Yes, it is something to watch out for. Not sure where that info belongs in the manual; that is language independant general Unicode knowledge.

Darth_Severus · August 25, 2015, 02:00:16 PM

Even if this would be the right way to do it, then the print function also had to show it. You can't be serious about just keeping it how it is. It's invisible. A programmer can't know which files a user would be using as an input file, so this is one more thing to think of writing a program. Your line, or something like it, had to part of every script reading user generated or third party text files. Not to forget, non-advanced programmers won't have Unicode knowledge.

Is it really done this way in other languages, like Python?

I'd strongly prefer to have it not handled like this. I also see no need for it. To find out what file it is, should only be needed if it is really needed, and some function like "file" in Linux would be better to do that. Maybe also some possibility to write a file with a BOM.

Yeah, but I see - Linux does it the same way:

Code Select Expand
> ((exec "cat ~/untitled")0)
"### Unicode?"
(((exec "cat ~/untitled")0)0)
""

Horrible, but it seems to be standard.

TedWalther · August 25, 2015, 02:08:54 PM

Yeah. I find in general, newLISP doesn't put any burden on you that it doesn't have to. When dealing with Unicode, there are lots of characters that don't show up when you print. Even with regular ASCII, there are codes like "" that don't show up at all. If you don't know where your data is coming from, you have to do checks to sanitize it. Just a fact of life. newLISP does make it really easy to check and sanitize data. But binary data is binary data; only you know how you are going to interpret it. So newLISP couldn't practically be changed to handle every type of data format. Instead it gives us a small set of very powerful tools so we can handle every type of binary data format.

That said, I would make a "pop-bom" function, that would strip the bom out of a data stream. In fact, I've written a bunch of small scripts where I go character by character, and convert or drop specific unicode characters depending on what I'm interested in. newLISP has been the ideal language for my work on the text of the Dead Sea Scrolls and other old manuscripts that are in Unicode.

One of my most useful scripts, just reads in a stream a character at a time, and makes a histogram; it counts every unique character, and prints out the count, with the FULL unicode name of that character, plus the hex and decimal value of that character. I call it unicode-histogram.lsp. If you're interested, I could post it here.

~~Quote from: "Darth_Severus"~~Even if this would be the right way to do it, then the print function also had to show it. You can't be serious about just keeping it how it is. It's invisible. A programmer can't know which files a user would be using as an input file, so this is one more thing to think of writing a program. Your line, or something like it, had to part of every script reading user generated or third party text files. Not to forget, non-advanced programmers won't have Unicode knowledge.

Is it really done this way in other languages, like Python?

I'd strongly prefer to have it not handled like this. I also see no need for it. To find out what file it is, should only be needed if it is really needed, and some function like "file" in Linux would be better to do that. Maybe also some possibility to write a file with a BOM.

Yeah, but I see - Linux does it the same way:
Code Select Expand > ((exec "cat ~/untitled")0) "### Unicode?" (((exec "cat ~/untitled")0)0) "" Horrible, but it seems to be standard.

TedWalther · August 25, 2015, 02:26:34 PM

Never mind, it is simple enough, here is my script, it helps with debugging unicode issues.

Code Select Expand

#!/usr/bin/newlisp

(load "unicode-names.lsp")
(define histogram:histogram)

(define (hex n) (push "0x" (upper-case (format "%x" n))))

(while (setq c (read-utf8 0))
       (setq c (char c))
       (if (histogram c)
	 (++ (histogram c))
	 (histogram c 1)))

(dolist (i (sort (histogram) (fn (x y) (< (char (x 0)) (char (y 0))))))
  (println (format "%s (decimal %d) %s (%s) occurs %d times."
    (hex (char (i 0))) (char (i 0)) (i 0) (unicode-name (i 0)) (i 1 0))))

(exit)

And here is some output from a project I recently did:

~~Quote~~
0xA (decimal 10)

(LINE FEED (LF)) occurs 17645 times.

0x20 (decimal 32) (SPACE) occurs 377553 times.

0x26 (decimal 38) & (AMPERSAND) occurs 3 times.

0x28 (decimal 40) ( (LEFT PARENTHESIS) occurs 282 times.

0x29 (decimal 41) ) (RIGHT PARENTHESIS) occurs 282 times.

0x2A (decimal 42) * (ASTERISK) occurs 96 times.

0x2D (decimal 45) - (HYPHEN-MINUS) occurs 148 times.

0x2E (decimal 46) . (FULL STOP) occurs 9086 times.

0x30 (decimal 48) 0 (DIGIT ZERO) occurs 4429 times.

...

0x1372 (decimal 4978) ፲ (ETHIOPIC NUMBER TEN) occurs 84 times.

0x1373 (decimal 4979) ፳ (ETHIOPIC NUMBER TWENTY) occurs 77 times.

0x1374 (decimal 4980) ፴ (ETHIOPIC NUMBER THIRTY) occurs 67 times.

0x1375 (decimal 4981) ፵ (ETHIOPIC NUMBER FORTY) occurs 30 times.

0x1376 (decimal 4982) ፶ (ETHIOPIC NUMBER FIFTY) occurs 49 times.

0x1377 (decimal 4983) ፷ (ETHIOPIC NUMBER SIXTY) occurs 30 times.

0x1378 (decimal 4984) ፸ (ETHIOPIC NUMBER SEVENTY) occurs 43 times.

0x1379 (decimal 4985) ፹ (ETHIOPIC NUMBER EIGHTY) occurs 11 times.

0x137A (decimal 4986) ፺ (ETHIOPIC NUMBER NINETY) occurs 6 times.

0x137B (decimal 4987) ፻ (ETHIOPIC NUMBER HUNDRED) occurs 372 times.

The unicode-names.lsp file is structured very simply, this should give you the idea:

Code Select Expand

;; names of unicode code points provided by
;;   http://www.fileformat.info/info/unicode/block/

(define unicode-name:unicode-name)
(unicode-name (char 0x0000) "NULL")
(unicode-name (char 0x0001) "START OF HEADING")
(unicode-name (char 0x0002) "START OF TEXT")
(unicode-name (char 0x0003) "END OF TEXT")
(unicode-name (char 0x0004) "END OF TRANSMISSION")
(unicode-name (char 0x0005) "ENQUIRY")
(unicode-name (char 0x0006) "ACKNOWLEDGE")
(unicode-name (char 0x0007) "BELL")
(unicode-name (char 0x0008) "BACKSPACE")
...

Darth_Severus · August 26, 2015, 05:43:09 AM

Thanks for your help so far, I might look into your code if I can use it.

newLISP Fan Club

News:

Strange problem with dolist

Darth.Severus

TedWalther

Darth_Severus

Darth_Severus

TedWalther

TedWalther

Darth_Severus

TedWalther

Darth_Severus

TedWalther

Darth_Severus

TedWalther

TedWalther

Darth_Severus