bug in regex; option 256 error, option 32 unrespected

Started by TedWalther, August 07, 2015, 05:21:12 PM

Previous topic - Next topic

TedWalther

I'm trying to match a r or n at the end of a string.



Regex with option 0 isn't working.



Tried option 32, which says that $ only matches end of string, not newline.  That also doesn't work.



I try option 256, and it errors out:


Quote
> (set 'a "foorb" 'b "barnn")

> (regex "r|n$" a 256)



ERR: regular expression in function regex : "offset 0 unknown option bit(s) set"
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

rrq

#1
Hmm. This is the same for all versions 10.6.[234] that I have... options 128, 256 and 1024 contradict the documentation in that way.

Lutz

#2

> (set 'a "foorb" 'b "barnn")
"barnn"
> (regex "\r|\n$" a 32)
("r" 3 1)
> (regex {r|n$} a 32)
("r" 3 1)
> (regex "\r|\n$" b 32)
("n" 4 1)
> (regex {r|n$} b 32)
("n" 4 1)
> (regex "\r|\n$" a 0)
("r" 3 1)
> (regex "\r|\n$" b 0)
("n" 4 1)
> (regex {r|n$} a)
("r" 3 1)
> (regex {r|n$} b)
("n" 4 1)
> (regex {r|n$} a 0)
("r" 3 1)
> (regex {r|n$} b 0)
("n" 4 1)


read about usage of backslashes in regex patterns here:

http://www.newlisp.org/downloads/newlisp_manual.html#regex">http://www.newlisp.org/downloads/newlis ... html#regex">http://www.newlisp.org/downloads/newlisp_manual.html#regex

rrq

#3
There seems to be an options check at pcre.c:4520-4524 that makes it puke on a few of the documented PCRE options. It has nothing to do with newlines or backslashes. E.g.


> (regex "a" "aa" 256)

ERR: regular expression in function regex : "offset 0 unknown option bit(s) set"

Maybe one can just remove that options check code?

TedWalther

#4
Lutz, in the example, the "r" shouldn't be matched; the $ anchor for end of string is being ignored.  That is why I wanted to use option 32.  But it doesn't seem to help.  Backslashing isn't changing that.


Quote
1439004065 :cameron.freenode.net 322 dvxd ##dashvapes 6 : http://www.dashvapes.com">www.dashvapes.com

1439004065 :cameron.f



ERR: invalid UTF8 string in function regex

called from user function recv&print


Now I've had this crash in my little IRC client when doing a LIST of all the channels on freenode.



Source code for the IRC client here: https://github.com/djwalther/nrc">https://github.com/djwalther/nrc
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

rrq

#5
I think you mis-read the grouping. Try
(regex "(r|n)$" a 0)

TedWalther

#6
Thanks Ralph, that worked.  Lutz, I withdraw half of my original bug report; the error was mine.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

TedWalther

#7
Lutz, that UTF8 error I found, is unrelated to the regex problem I had earlier; the problem is, people are putting malicious codes into the topics of their IRC channel.  So you do a listing of all channels, and there is a chance to be compromised.



This is the error message, don't click the link to that scammy site:


Quote
1439004065 :cameron.freenode.net 322 dvxd ##dashvapes 6 : http://www.dashvapes.com">www.dashvapes.com

1439004065 :cameron.f



ERR: invalid UTF8 string in function regex

called from user function recv&print


Do you mind taking a look at the code I put on github?  I'd like to find a way around this problem without giving up UTF-8 support.



The way to duplicate this bug is to connect to irc.mozilla.org, wait 60 seconds, then run the @list command.



https://github.com/djwalther/nrc">https://github.com/djwalther/nrc



I did this irc client all by myself.  I was happy when I did a search, and there is a very similar IRC client with source code listed in the wikibooks project.  I've never seen such a short, small, and easy to understand IRC client.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

Lutz

#8
You could try to avoid regular expressions in the send&print function, either by using find without the regex option number or by using simple string comparisons when the search is at the beginning or at the end:



>  (set 'str "PING : abcdefghn")
"PING : abcdefghn"
> (= (last str) "n")
true
> (= "PING :" (0 6 str))
true
>

TedWalther

#9
Quote from: "Lutz"You could try to avoid regular expressions in the send&print function, either by using find without the regex option number or by using simple string comparisons when the search is at the beginning or at the end:



>  (set 'str "PING : abcdefghn")
"PING : abcdefghn"
> (= (last str) "n")
true
> (= "PING :" (0 6 str))
true
>


The problem is in the output from the IRC server.  The IRC server doesn't sanitize output, and it isn't guaranteed to be UTF-8.  Only ASCII is guaranteed.


Quote
http://xchat.org/encoding/">http://xchat.org/encoding/

The IRC protocol doesn't define a specific character-set that should be used, so over the years, lots of different character-sets have become popular on IRC. This presents a big problem, because, if you're using a different character-set than someone else on IRC, you'll receive their text as garbage. In western countries, we generally use CP1252 (Windows Latin), this is the most popular character-set on IRC.


I note that in the documentation, regex has a "UTF-8" option to enable UTF8 matching.  Since it is enabled by default in UTF8 enabled newlisp, is there a way to disable it, so it works on raw 8bit octets?



Without regex I'll have to take it to the next level of complexity and have a full blown IRC parser.  It is easy to make a railroad diagram for IRC protocol, but I've never been good at taking a railroad diagram and turning it into a state machine.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

Lutz

#10
Even on UTF8 enabled newLISP, regex and other regex enabled functions will still allow searching for raw octets or combinations of octets:



> (set 'str "我能吞下玻璃而不伤身体。")
"我能吞下玻璃而不伤身体。"
> (unpack (dup "b" (length str)) str)
(230 136 145 232 131 189 229 144 158 228 184 139 231 142 187 231 146 131 232 128
 140 228 184 141 228 188 164 232 186 171 228 189 147 227 128 130)
> (regex "189" str)
("?" 5 1)
> (regex "189" (6 str))
("?" 25 1)
>


this works even when the searched string takes bytes from different adjacent UTF8 characters. The following string is composed from the last byte of the first UTF8 and the first byte of the second UTF8 character.



> (map length (explode str))
(3 3 3 3 3 3 3 3 3 3 3 3)

> (regex "145232" str)
("??" 2 2)
>

TedWalther

#11
Ok, so the regex stuff shouldn't be crashing newlisp when "invalid utf8" is detected?



I went back on irc.freenode.net and this time the @list command succeeded, and the ##dashvapes channel says their topic has been the same since February.  Perhaps the crash was caused by the next channel in the list.  But since the order of the channels in the list varies constantly, I have no way of telling which one it was.  Guess I'll have to create a safe_regex function that wraps regex in a catch block.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

Lutz

#12
The "ERR: invalid UTF8 string" error is only thrown by newLISP when a string finishes before a multibyte UTF8 character has finished.



All other regex errors are generated when the PCRE lib regex call comes back with an error and messages like "offset 0 unknown option bit(s) set" which are generated by PCRE.

rrq

#13
My 2 cents after looking into the code.



The input buffer (buf) is clipped at 4096 characters, and this quite easily may result in the appearance of a composite UTF8 character being cut off at the end. Thus, since regex has its UTF8 glasses solidly stuck on, one can't apply it to buf.



So I thought that, instead of immediately cleaning the parse result when extracting lines, one could let partial always be its last element, chopped off (or cleaning lines afterwards). That would keep to the same functionality and avoid using regex on buf.



EDIT: actually, my "quite easily" assertion might be wrong... but the fix is still vaild...



EDIT 2: the more I study this the more I realize how wrong I were. As far as I can see, that particular error code only arises in quite particular cases: such as functions first and rest, and implicit string indexing. The latter would be the candidate cause here, except that the regex expressions don't have implicit indexing.

TedWalther

#14
Quote from: "ralph.ronnquist"
The input buffer (buf) is clipped at 4096 characters, and this quite easily may result in the appearance of a composite UTF8 character being cut off at the end. Thus, since regex has its UTF8 glasses solidly stuck on, one can't apply it to buf.


Right.  I'm taking raw bytes from the network layer, and trying to reassemble them into something usable at the application layer.


Quote
So I thought that, instead of immediately cleaning the parse result when extracting lines, one could let partial always be its last element, chopped off (or cleaning lines afterwards). That would keep to the same functionality and avoid using regex on buf.


Ah.  So the problem wasn't some nasty channel having a bad topic; it was perhaps that the network layer broke the packet in the middle of a UTF-8 character.  If there were people using the Asian encodings, the bug would still come up, of course.


Quote
EDIT 2: the more I study this the more I realize how wrong I were. As far as I can see, that particular error code only arises in quite particular cases: such as functions first and rest, and implicit string indexing. The latter would be the candidate cause here, except that the regex expressions don't have implicit indexing.


I guess my question is, why is regex assuming UTF8 string, when I didn't set the UTF8 option to it?  And since it is set by default now, could there be a flag to disable it?  When searching a string for r and n, I am in raw byte mode.  IRC allows UTF8, but as a subset of raw byte streams.



Would specifying "0" as the option be enough to disable UTF8 parsing in regex and eliminate that error message?



Ralph, thanks for reading the source. I appreciate the review.  Next I will try to add readline support.  I think I'll make two apps; one for typing into, one for viewing output.  Then you run them both inside tmux, each one inside its own window. :)
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.