About parse and scannig

Started by Maurizio, February 24, 2005, 08:07:27 AM

Previous topic - Next topic

Maurizio

I try to parse a text  file separating every "word"

according to a regular expression.

The code is reported below

What I see, is that even with a moderate text file (190k)

it is very slow.

I suppose this is due to the continuos string splitting after each

succesfull parse.

Any suggestion ?

Btw, should be fine to have a "scan" function, that works like

"parse", but in which you specify the " token" syntax instead of the separator one.

Regards

Maurizio.





(define (scan expr str)
  (let  ((lst '())  
         (res '()))
     (while (set 'res (regex expr str))
       (push (first res) lst -1)
       (set 'str (slice str (+ (first(rest res)) (first (rest(rest res)))))))
     lst))
     
     
(define (go)
   (let  ((expr  {[[:alpha:]]+[[:alnum:]]*|[[:digit:]]+|"(""|[^"])*"|'(''|[^'])*'|[[:graph:]]} ))
     (scan expr (read-file "myfile.txt"))))
     
[/code]

Lutz

#1
from a program http://www.newlisp.org/code/get-normans.txt">http://www.newlisp.org/code/get-normans.txt on the contributions page, the following trick:

;; get the text for scanning
(set 'txt (read-file "myfile.txt"))

;; set the the regular expression pattern
(set 'expr {[[:alpha:]]+[[:alnum:]]*|[[:digit:]]+|"(""|[^"])*"|'(''|[^'])*'|[[:graph:]]} )

;; push all tokens onto 'tokens'
(replace expr txt (push $1 tokens -1) 0)

;; the list in 'tokens' contains all tokens found


This should be many times faster, because 'replace' walks internally through all the text. The work is done in the replacement expression which pushes the token found to the end of the list and returns the token itself, so 'txt' will be unchanged afterwards.



Basically 'replace' is the scanner function in this solution.



Lutz

Maurizio

#2
Thank you very much.



btw, it seems to me

  (push $0 tokens -1)



and not

  (push $1 tokens -1)



Regards

Maurizio

Lutz

#3
Yes, of course $0, in the original program I extracted a subexpression. I wonder how big the speed improvement will be for you?



Lutz

Maurizio

#4
The speed improvement is about 4/5 times

I mean, before your suggestion the parse time was about 5 seconds,

now only 1 sec.

(I'm just experimenting, I really dont need to parse very big files)



Thanks,



Maurizio

eddier

#5
That's interesting. I use replace on regular expressions very often and I find that this operation is very fast. I would have thought that the time would go down drastically, like maybe to  1/100 of a second.



Eddie

Maurizio

#6
I've seen a certain improvement reorganizing the alternatives in the

regex so that the patterns that recognize the more frequents tokens are

the firsts in the expression.

anyway I'm trying with a 130kb file with 15600 tokens.

Now the scan time is about 1 second.

(pentium 4 - 2.80 ghz, + hyperthreading)





Maurizio

eddier

#7
Yes, different ways of writing regular expressions can drastically change performance. Complex expressions involving a bunch of "|"s will obviously take longer.



Eddie