Print Page - About parse and scannig

Title: About parse and scannig
Post by: Maurizio on February 24, 2005, 08:07:27 AM

I try to parse a text file separating every "word"

according to a regular expression.

The code is reported below

What I see, is that even with a moderate text file (190k)

it is very slow.

I suppose this is due to the continuos string splitting after each

succesfull parse.

Any suggestion ?

Btw, should be fine to have a "scan" function, that works like

"parse", but in which you specify the " token" syntax instead of the separator one.

Regards

Maurizio.

Code Select Expand

(define (scan expr str)
  (let  ((lst '())   
         (res '()))
     (while (set 'res (regex expr str))
       (push (first res) lst -1)
       (set 'str (slice str (+ (first(rest res)) (first (rest(rest res)))))))
     lst))
     
      
(define (go)
   (let  ((expr  {[[:alpha:]]+[[:alnum:]]*|[[:digit:]]+|"(""|[^"])*"|'(''|[^'])*'|[[:graph:]]} ))
     (scan expr (read-file "myfile.txt"))))

[/code]

Title:
Post by: Lutz on February 24, 2005, 10:46:43 AM

from a program http://www.newlisp.org/code/get-normans.txt on the contributions page, the following trick:

Code Select Expand

;; get the text for scanning
(set 'txt (read-file "myfile.txt"))

;; set the the regular expression pattern
(set 'expr {[[:alpha:]]+[[:alnum:]]*|[[:digit:]]+|"(""|[^"])*"|'(''|[^'])*'|[[:graph:]]} )

;; push all tokens onto 'tokens'
(replace expr txt (push $1 tokens -1) 0)

;; the list in 'tokens' contains all tokens found

This should be many times faster, because 'replace' walks internally through all the text. The work is done in the replacement expression which pushes the token found to the end of the list and returns the token itself, so 'txt' will be unchanged afterwards.

Basically 'replace' is the scanner function in this solution.

Lutz

Title:
Post by: Maurizio on February 25, 2005, 01:54:41 AM

Thank you very much.

btw, it seems to me

(push $0 tokens -1)

and not

(push $1 tokens -1)

Regards

Maurizio

Title:
Post by: Lutz on February 25, 2005, 05:47:34 AM

Yes, of course $0, in the original program I extracted a subexpression. I wonder how big the speed improvement will be for you?

Lutz

Title:
Post by: Maurizio on February 25, 2005, 06:32:17 AM

The speed improvement is about 4/5 times

I mean, before your suggestion the parse time was about 5 seconds,

now only 1 sec.

(I'm just experimenting, I really dont need to parse very big files)

Thanks,

Maurizio

Title:
Post by: eddier on February 25, 2005, 07:12:29 AM

That's interesting. I use replace on regular expressions very often and I find that this operation is very fast. I would have thought that the time would go down drastically, like maybe to 1/100 of a second.

Eddie

Title:
Post by: Maurizio on February 25, 2005, 08:53:33 AM

I've seen a certain improvement reorganizing the alternatives in the

regex so that the patterns that recognize the more frequents tokens are

the firsts in the expression.

anyway I'm trying with a 130kb file with 15600 tokens.

Now the scan time is about 1 second.

(pentium 4 - 2.80 ghz, + hyperthreading)

Maurizio

Title:
Post by: eddier on February 25, 2005, 09:10:13 AM

Yes, different ways of writing regular expressions can drastically change performance. Complex expressions involving a bunch of "|"s will obviously take longer.

Eddie

newLISP Fan Club

Forum => newLISP in the real world => Topic started by: Maurizio on February 24, 2005, 08:07:27 AM