find-all and empty strings

Started by Fritz, April 08, 2010, 10:53:46 AM

Previous topic - Next topic

Fritz

For some reason find-all operator stops on empty strings:


> (find-all "(?<=,)[^,]*(?=,)" ",1,10,,100,1000,")
("1" "10" "")


I did expect here:


("1" "10" "" "100" "1000")

I think, somewhere inside find-all operator is a check: (if (= found-string prev-string) stop-search)



Is it a feature to make programmers use "parse" operator for empty string searching?

Lutz

#1
'find-all' stops searching if a regular expression doesn't advance the search-offset and the instance found is of length zero. Without this condition 'find-all' would run into an infinite loop on certain regular expressions.



When using 'find-all', the safest method is, to use the regular expression to define the contents of the token, not the borders surrounding it, e.g:
(find-all "[^,]+" ",1,10,,100,1000," ) => ("1" "10" "100" "1000")
When using 'parse' define the borders in-between the tokens:
(parse ",1,10,,100,1000," ",") => ("" "1" "10" "" "100" "1000" "")
... then you could use 'clean' to get rid empty strings:
(clean empty? (parse ",1,10,,100,1000," ",")) => ("1" "10" "100" "1000")

Fritz

#2
Here is an example from Friddle`s book:


(find-all {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
 {Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K})


It will be very hard to make parse operator to process quotes correctly. And I can not just ignore empty fields, becouse their place is important. (I use some complex regexp when parsing data in csv, rtf etc).



However, it is not too important. I can just replace ",," to the ",_empty_," before applying find-all.

Lutz

#3
Here is a solution using 'find' with offset:
(set 'offset 0)
(while (set 'pos (find {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
{Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K} 0 offset ))
(push $0 lst -1)
        (println "->" (offset $0) "<-")
(inc offset (+ 1 (length $0)))
)


lst => ("Ten Thousand" "10000" " 2710 " "" ""10,000"" ""It's ""10 Grand"", baby"" "10K")

(lst -2) => ""It's ""10 Grand"", baby""

but this solution also shows the dilemma, we are in. After each successful  'find', we calculate the new offset adding the length to the old offset +1. The +1 operation is, what avoids the infinite loop, but it also has the potential to leave out an important character in a different situation.



In the current example, the character skipped is the comma field delimiter. But in another case the character might be part of the following token and should not be skipped. In above example if we don't skip, we end up in an infinite loop.

Fritz

#4
I guess additional check for zero-length will help:


(set 'offset 0 'lst '())
(while (set 'pos (find {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
  {Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K} 0 offset ))
  (if (and lst (= (last lst) $0 ""))
    (inc offset)
    (begin
      (set 'offset (+ offset (length $0)))
      (push $0 lst -1))))


As far as I know, there is a metacharacter G in the Perl:



"G  Match only where previous m//g left off (works only with /g)"



I have no experience with PCRE standarts, but, may be, there are some preset rules — how should search engine process empty strings?

Lutz

#5
Unfortunately this strategy will swallow legitimate empty strings, if they follow each other. This example will produce the same result:


{Ten Thousand,10000, 2710 ,,,,,,,"10,000","It's ""10 Grand"", baby",10K}

Fritz

#6
Quote from: "Lutz"Unfortunately this strategy will swallow legitimate empty strings


Yep, I have made an error: I did not took into account difference between starting search position and actual offset of the found item:


(set 'offset 0 'lst '())
(while (set 'offset (find {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
  {Ten Thousand,10000, 2710 ,,,,,,,"10,000","It's ""10 Grand"", baby",10K} 0 offset))
  (set 'offset (+ offset (length $0)))
  (if (and lst (= (list offset $0) (last lst)))
    (inc offset)
    (push (list offset $0) lst -1)))

(silent
  (println "--- Found: " (length lst) " ---")
  (map println lst))

Lutz

#7
Checking the previous offset seems to be sufficient to detect the loop condition, as it implies that a length zero string was found at the same offset.



This is how it will be implemented in C in the next development version:


(set 'offset 0 'lastOffset 0 'lst '())
(while (set 'offset (find {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
  {Ten Thousand,10000, 2710 ,,,,,,,"10,000","It's ""10 Grand"", baby",10K} 0 offset))
  (set 'offset (+ offset (length $0)))
  (if (= offset lastOffset)
    (inc offset)
    (begin
        (set 'lastOffset offset)
        (push $0 lst -1))))


... so far it has passed all tests.

Fritz

#8
As far as I can see, we will not be able to locate two different strings (empty and non-empty) at the same offset.


(find-all "((?<=FILE:)[0-9]*)|(sample.gif)" "FILE:sample.gif")

But, probably, there will be no such situations in real life.