UTF8 and regular expressions in newLISP

Started by Lutz, August 09, 2015, 09:47:22 AM

Previous topic - Next topic

Lutz

For patterns, which don't address UTF8 characters as a single character, working with to without the PCRE_UTF8 flag 2048 or "u" will not make a difference, but the flag is needed when looking at multibyte sequences as UTF8 characters:



> (set 'utf8str "我能吞下玻璃而不伤身体。")
"我能吞下玻璃而不伤身体。"

> (regex "(.)(.)(.)" utf8str)
("我" 0 3 "?" 0 1 "?" 1 1 "?" 2 1)


Without the flag the string matched consists of 3 single characters, each octet matched by a dot, which are then combined by UTF8 enabled newLISP and a UTF8 enabled terminal to a displayable UTF8 character. They are represented as "?" because they are neither UTF8 by itself nor ASCII characters.



Now the same with using the PCRE_UTF8 options flag:



> (regex "(.)(.)(.)" utf8str 2048)
("我能吞" 0 3 "我" 0 1 "能" 1 1 "吞" 2 1)

> (regex "(.)(.)(.)" utf8str "u")
("我能吞" 0 3 "我" 0 1 "能" 1 1 "吞" 2 1)
>


... the dot "." now represents a UTF8 character.



The above examples work on versions 10.6.2, 10.6.3 and 10.6.4 of newLISP.



The error message "invalid UTF8 string" is only generated by the functions first, rest, last and pop and implicit indexing of strings, when the string seen as an UTF8 string would occupy more bytes then allocated or terminated by 0 for a string not meant to be a nUTF8 string.

TedWalther

#1
Thanks for the explanation Lutz.  I was trying to duplicate the error with version 10.6.2, but it didn't show up:


Quote
> (set 'foo (string "abcd" (pack "b" (int "0b11001111"))))

"abcd�"

> (regex "(r|n)$" foo 0)

nil

> (set 'foo (string "abcd" (pack "b" (int "0b11101111")) "e"))

"abcd�e"

> (regex "(r|n)$" foo 0)

nil

> (regex "(r|n)$" foo)

nil




Then I thought: "implicit indexing, ah"


Quote
> (regex "r|n" (foo -1))

nil

> (set 'foo (string "abcd" (pack "b" (int "0b11001111"))))

"abcd�"

> (regex "r|n" (foo -1))

nil


So still not sure exactly how my code triggered the exception.  I'd like to duplicate the bug so I can fix it in my code.



Update



You mentioned a string containing a 0 byte, so I'll test that.  And, still not triggering the exception.


Quote
> (set 'foo (string "abcd" (pack "b" 0) "e"))

"abcde"

> (regex "r|n" (foo -1))

nil

> (regex "r|n" foo)

nil

> (set 'foo (string "abcrd" (pack "b" 0) "en"))

"abcrden"

> (regex "r|n" foo)

("r" 3 1)

> (regex "n" foo)

("n" 6 1)


OpenBSD



OpenBSD recently added support for Lua patterns to their web server; I read the manpage.  The patterns are almost like regular expressions, but smaller, simple, very fast to implement, and include some nice things like paren-matching.  700 lines of code.



http://www.openbsd.org/cgi-bin/man.cgi/OpenBSD-current/man7/patterns.7?query=patterns">http://www.openbsd.org/cgi-bin/man.cgi/ ... y=patterns">http://www.openbsd.org/cgi-bin/man.cgi/OpenBSD-current/man7/patterns.7?query=patterns



http://comments.gmane.org/gmane.os.openbsd.tech/42569">http://comments.gmane.org/gmane.os.openbsd.tech/42569


Quote
there is some great interest in getting support for rewrites and

better matching in httpd.  I refused to implement this using regex, as

regex is extremely complicated code, there have been lots of bugs,

they allow, if not specified carefully, dangerous recursions and

ReDOS, and I would add another potential attack surface in httpd.



Thanks to tedu <at> 's hint at BSDCan, I stumbled across Lua's pattern

matching implementation.  It is relatively small (less than 700loc),

powerful, portable C code, MIT-licensed, and doesn't suffer from some

of regex' problems (eg., it doesn't allow recursive captures).  I

ported it on my flight back from Ottawa, KNF'ed it, and turned it into

a C API without the Lua bindings.  No, this diff does not bring the

Lua language to httpd!
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

cormullion

#2
try PEG!



https://github.com/dahu/nlpeg">https://github.com/dahu/nlpeg

TedWalther

#3
Quote from: "cormullion"try PEG!



https://github.com/dahu/nlpeg">https://github.com/dahu/nlpeg


PEG looks neat.  Have you tried this implementation?  Does it work?



Update



Reading the Wiki page, I didn't realize that PEG is a formalism for recursive descent parsers.  Awesome!
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

rrq

#4
E.g.,
> (setf b (pack "b" (+ 0xc0 0x30)))
"�"
> (regex "x" (b -1))

ERR: invalid UTF8 string in function regex

TedWalther

#5
I guess the confusion is that the difference between character streams and byte streams isn't always obvious.   Both are useful.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

TedWalther

#6
Quote from: "ralph.ronnquist"E.g.,
> (setf b (pack "b" (+ 0xc0 0x30)))
"�"
> (regex "x" (b -1))

ERR: invalid UTF8 string in function regex


This is why it is hard to chase down; the interactions between utf8 mode and octet (raw byte) mode.


Quote
> (b -1)



ERR: invalid UTF8 string

> b

"�"

> (char b)

2827

> (bits b)



ERR: value expected in function bits : b

> (bits (char b))

"101100001011"

>


Perhaps the get-char or unpack functions would do the trick.  They usually aren't the first things I think of.  I find my brain having to work to do the shift between character and byte oriented streams, each with their different API.  It wants to use the same API for both, with perhaps the occasional boolean flag or two to disambiguate.



In this case, the char function is silently converting a byte value to... to what?  As a 16 bit quantity, it is valid UTF8.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

abaddon1234

#7
Thanks for the info

https://www.gclubtg.com/">จีคลับ