I try to parse a text file separating every "word"
according to a regular expression.
The code is reported below
What I see, is that even with a moderate text file (190k)
it is very slow.
I suppose this is due to the continuos string splitting after each
succesfull parse.
Any suggestion ?
Btw, should be fine to have a "scan" function, that works like
"parse", but in which you specify the " token" syntax instead of the separator one.
Regards
Maurizio.
(define (scan expr str)
(let ((lst '())
(res '()))
(while (set 'res (regex expr str))
(push (first res) lst -1)
(set 'str (slice str (+ (first(rest res)) (first (rest(rest res)))))))
lst))
(define (go)
(let ((expr {[[:alpha:]]+[[:alnum:]]*|[[:digit:]]+|"(""|[^"])*"|'(''|[^'])*'|[[:graph:]]} ))
(scan expr (read-file "myfile.txt"))))
[/code]
from a program http://www.newlisp.org/code/get-normans.txt on the contributions page, the following trick:
;; get the text for scanning
(set 'txt (read-file "myfile.txt"))
;; set the the regular expression pattern
(set 'expr {[[:alpha:]]+[[:alnum:]]*|[[:digit:]]+|"(""|[^"])*"|'(''|[^'])*'|[[:graph:]]} )
;; push all tokens onto 'tokens'
(replace expr txt (push $1 tokens -1) 0)
;; the list in 'tokens' contains all tokens found
This should be many times faster, because 'replace' walks internally through all the text. The work is done in the replacement expression which pushes the token found to the end of the list and returns the token itself, so 'txt' will be unchanged afterwards.
Basically 'replace' is the scanner function in this solution.
Lutz
Thank you very much.
btw, it seems to me
(push $0 tokens -1)
and not
(push $1 tokens -1)
Regards
Maurizio
Yes, of course $0, in the original program I extracted a subexpression. I wonder how big the speed improvement will be for you?
Lutz
The speed improvement is about 4/5 times
I mean, before your suggestion the parse time was about 5 seconds,
now only 1 sec.
(I'm just experimenting, I really dont need to parse very big files)
Thanks,
Maurizio
That's interesting. I use replace on regular expressions very often and I find that this operation is very fast. I would have thought that the time would go down drastically, like maybe to 1/100 of a second.
Eddie
I've seen a certain improvement reorganizing the alternatives in the
regex so that the patterns that recognize the more frequents tokens are
the firsts in the expression.
anyway I'm trying with a 130kb file with 15600 tokens.
Now the scan time is about 1 second.
(pentium 4 - 2.80 ghz, + hyperthreading)
Maurizio
Yes, different ways of writing regular expressions can drastically change performance. Complex expressions involving a bunch of "|"s will obviously take longer.
Eddie