newLISP Fan Club

Forum => newLISP in the real world => Topic started by: Fritz on April 08, 2010, 10:53:46 AM

Title: find-all and empty strings
Post by: Fritz on April 08, 2010, 10:53:46 AM
For some reason find-all operator stops on empty strings:


> (find-all "(?<=,)[^,]*(?=,)" ",1,10,,100,1000,")
("1" "10" "")


I did expect here:


("1" "10" "" "100" "1000")

I think, somewhere inside find-all operator is a check: (if (= found-string prev-string) stop-search)



Is it a feature to make programmers use "parse" operator for empty string searching?
Title: Re: find-all and empty strings
Post by: Lutz on April 08, 2010, 12:43:36 PM
'find-all' stops searching if a regular expression doesn't advance the search-offset and the instance found is of length zero. Without this condition 'find-all' would run into an infinite loop on certain regular expressions.



When using 'find-all', the safest method is, to use the regular expression to define the contents of the token, not the borders surrounding it, e.g:
(find-all "[^,]+" ",1,10,,100,1000," ) => ("1" "10" "100" "1000")
When using 'parse' define the borders in-between the tokens:
(parse ",1,10,,100,1000," ",") => ("" "1" "10" "" "100" "1000" "")
... then you could use 'clean' to get rid empty strings:
(clean empty? (parse ",1,10,,100,1000," ",")) => ("1" "10" "100" "1000")
Title: Re: find-all and empty strings
Post by: Fritz on April 08, 2010, 01:16:44 PM
Here is an example from Friddle`s book:


(find-all {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
 {Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K})


It will be very hard to make parse operator to process quotes correctly. And I can not just ignore empty fields, becouse their place is important. (I use some complex regexp when parsing data in csv, rtf etc).



However, it is not too important. I can just replace ",," to the ",_empty_," before applying find-all.
Title: Re: find-all and empty strings
Post by: Lutz on April 09, 2010, 01:02:55 AM
Here is a solution using 'find' with offset:
(set 'offset 0)
(while (set 'pos (find {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
{Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K} 0 offset ))
(push $0 lst -1)
        (println "->" (offset $0) "<-")
(inc offset (+ 1 (length $0)))
)


lst => ("Ten Thousand" "10000" " 2710 " "" ""10,000"" ""It's ""10 Grand"", baby"" "10K")

(lst -2) => ""It's ""10 Grand"", baby""

but this solution also shows the dilemma, we are in. After each successful  'find', we calculate the new offset adding the length to the old offset +1. The +1 operation is, what avoids the infinite loop, but it also has the potential to leave out an important character in a different situation.



In the current example, the character skipped is the comma field delimiter. But in another case the character might be part of the following token and should not be skipped. In above example if we don't skip, we end up in an infinite loop.
Title: Re: find-all and empty strings
Post by: Fritz on April 09, 2010, 01:46:20 AM
I guess additional check for zero-length will help:


(set 'offset 0 'lst '())
(while (set 'pos (find {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
  {Ten Thousand,10000, 2710 ,,"10,000","It's ""10 Grand"", baby",10K} 0 offset ))
  (if (and lst (= (last lst) $0 ""))
    (inc offset)
    (begin
      (set 'offset (+ offset (length $0)))
      (push $0 lst -1))))


As far as I know, there is a metacharacter G in the Perl:



"G  Match only where previous m//g left off (works only with /g)"



I have no experience with PCRE standarts, but, may be, there are some preset rules — how should search engine process empty strings?
Title: Re: find-all and empty strings
Post by: Lutz on April 09, 2010, 02:28:34 AM
Unfortunately this strategy will swallow legitimate empty strings, if they follow each other. This example will produce the same result:


{Ten Thousand,10000, 2710 ,,,,,,,"10,000","It's ""10 Grand"", baby",10K}
Title: Re: find-all and empty strings
Post by: Fritz on April 09, 2010, 02:56:54 AM
Quote from: "Lutz"Unfortunately this strategy will swallow legitimate empty strings


Yep, I have made an error: I did not took into account difference between starting search position and actual offset of the found item:


(set 'offset 0 'lst '())
(while (set 'offset (find {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
  {Ten Thousand,10000, 2710 ,,,,,,,"10,000","It's ""10 Grand"", baby",10K} 0 offset))
  (set 'offset (+ offset (length $0)))
  (if (and lst (= (list offset $0) (last lst)))
    (inc offset)
    (push (list offset $0) lst -1)))

(silent
  (println "--- Found: " (length lst) " ---")
  (map println lst))
Title: Re: find-all and empty strings
Post by: Lutz on April 09, 2010, 05:02:08 AM
Checking the previous offset seems to be sufficient to detect the loop condition, as it implies that a length zero string was found at the same offset.



This is how it will be implemented in C in the next development version:


(set 'offset 0 'lastOffset 0 'lst '())
(while (set 'offset (find {(^|(?<=,))("(([^"]|"")*)"|([^",]*))}
  {Ten Thousand,10000, 2710 ,,,,,,,"10,000","It's ""10 Grand"", baby",10K} 0 offset))
  (set 'offset (+ offset (length $0)))
  (if (= offset lastOffset)
    (inc offset)
    (begin
        (set 'lastOffset offset)
        (push $0 lst -1))))


... so far it has passed all tests.
Title: Re: find-all and empty strings
Post by: Fritz on April 09, 2010, 05:26:43 AM
As far as I can see, we will not be able to locate two different strings (empty and non-empty) at the same offset.


(find-all "((?<=FILE:)[0-9]*)|(sample.gif)" "FILE:sample.gif")

But, probably, there will be no such situations in real life.