Segmentation fault in replace using regex

Started by cormullion, October 10, 2010, 11:32:24 AM

Previous topic - Next topic

cormullion

I've been struggling with some code, and I think it's crashing with a segmentation fault when a replace call attempts to do a complex regex replacement on a large amount of text. After much investigation, I have a suspicion that  the limit is a string around 225,000 characters long. This turns about to be more than two-thirds but less than threequarters of the length of a document such as the Introduction to newLISP. (Ie the code crashes when applied to the whole document but not on a substantial proportion of it. A hard bug to find... :))



First I thought I'd ask to see if there might be some hard limit in replace? The regex is a nested thing, so I'm assuming there's some hard limit to how much you can do once the regexen get a bit complicated...

Lutz

#1
The only limit from the side of newLISP is a maxmimum of 16 parenthesized subexpressions in your regex pattern. There is no limit to the size of the text searched or the string you are searching for from newLISP's side. Just  now, I did a few tests with a 1Mbyte sized text and doing about 10,000 (simple pattern) replacements, and had now problems. I have played a lot with regular expressions and the newLISP manual of about 850,000 bytes, but never have seen a crash, using find, find-all and replace with regular expressions.



There may be limits the PCRE regex implementation (newLISP uses version 5.0) , but I am not aware of it. A quick look through the PCRE source only revealed a MATCH_LIMIT of 10,000,000 which is the default number of times the internal match-function can be called during a single regex execution, and a stack limit for subexpressions: POSIX_MALLOC_THRESHOLD, which is predefined by the platform operating system, and I don't really understand what that means.



I have seen replace/regex crashing after an "invalid UTF-8 string" message, a few years back, when parsing internet pages. The message is one of the PCRE error messages.



The best ways to debug these situations, is trying to isolate the substring/segment in the text which causes the crash.

cormullion

#2
Thanks - that's good news! I'll take a closer look at the regex pattern in that case!

cormullion

#3
The following code gives the same (at least I think it's the same) error as my original. I can't exactly remember what the regex is doing, but it produces a segmentation fault (v.10.2.8 on OSX IPv4 UTF-8).


(set 'txt (read-file {/usr/share/doc/newlisp/newlisp_manual.html}))
(set 'nested-block-regex  [text](^<(p|div|h[1-6]|blockquote|pre)b(.*n)*?</2>[ t]*(?=n+|Z))[/text])
(println (replace nested-block-regex txt "OK!" 2))


The length of txt is 844,537 - that's a lot higher than the 230,000 or so that I was getting the error with originally.



Perhaps this isn't a problem with PCRE - it seems to work with Perl...



Hide your eyes, Perl-lovers :)


#!/usr/bin/perl

open FILE, "/usr/share/doc/newlisp/newlisp_manual.html" or die "Couldn't open file: $!";
while (<FILE>){
 $string .= $_;
}
close FILE;

$string =~ s{(^<(p|div|h[1-6]|blockquote|pre)b(.*n)*?</2>[ t]*(?=n+|Z))}{print "$1n"}egmx;

print "$string";

TedWalther

#4
Perl doesn't use PCRE.  It has its own implementation of regex.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

Lutz

#5
I could verify that the crash occurs in the PCRE functions: pcre_exec(), and it also occurs on later PCRE libraries including the one compiled and shipped by Apple on Mac OX X Snow leopard. When linking to the Mac OX libray instead to the one compiled into newLISP, the same error occurs.



Note, that "PCRE" means "Perl Compatible RegEx", it does not mean they are using the same code base (as Ted also noted). They just handle the same regex syntax. Perl's implementations seems to better handle the nested recursive pattern. Your pattern seems to hit a limit in the PCRE library.



I first thought, that a later version of PCRE might solve it, but the limitation persists even in the latest version shipped with Mac OS X.



Perhaps you can work around this by partitioning the text into smaller portions, and then do replacement on those.



ps: see also here



http://stackoverflow.com/questions/3613121/regular-expression-crashes-apache-due-to-pcre-limitations-need-some-help-optimis">http://stackoverflow.com/questions/3613 ... lp-optimis">http://stackoverflow.com/questions/3613121/regular-expression-crashes-apache-due-to-pcre-limitations-need-some-help-optimis

cormullion

#6
OK, I see now. It's not a problem to split the string up - I'm glad to know that the problem isn't in my code or in yours!



Thanks Lutz.