parsing large files (> 5GB)

jopython · December 21, 2011, 01:22:26 PM

Parsing large log files (using line-by-line) takes a looong time(several times slow compared to Perl).

------------------

(while (read-line file)

(if (regex p1 (current-line) 0x10000)

(inc foo)))

------------------

Is there anything I could use within newlisp to shorten the time for reading files? Say, Is it possible to read files in big chunks(say 10MB) and then parse that portion in memory for faster access?

Lutz · December 22, 2011, 04:21:35 PM

In the next development or stable release version (January 1012), at least 'read-line' from STDIN (read-line) will be 3 times faster than in the old version reading from STDIN. If you can process your logfiles like this: process < thelogfile.txt, this will help you.

Until then, yes, reading big chungs of memory and doing a:

Code Select Expand
(dolist (line (parse chunk "n")) 
    ...
)

will be much faster.

ps: even in the old version (read-line) via STDIN is already about 3 times faster than reading from channel.

jopython · December 23, 2011, 12:34:44 PM

The difference is really huge 8 Minutes in newlisp vs 2 seconds in Perl for a 100MB log file.

I am not a fan of Perl. But i am forced to use it because of its text processing performance.

$ time ./apache.lsp xaa

3885

real 8m2.346s

user 1m28.495s

sys 6m30.581s

$ cat apache.lsp

Code Select Expand

;(set 'yesterd (date (date-value) -480 "%d/%b/%Y.+"))
(set 'yesterd [text]10/Dec/2011.+[/text])
(set 'reg [text]GET /index.html  HTTP/1.1[/text])
(set 'pattern_str (append yesterd reg))
(set 'p1 (regex-comp pattern_str))
(set 'file (open ((main-args) 2) "read")) ; the argument to the script
(while (read-line file)
        (if (regex p1 (current-line) 0x10000)
          (inc foo))) 
(if foo (println foo) (println 0))
(exit)

Code Select Expand
$ time perl -lne 'BEGIN{$h=0;}if (m/10/Dec/2011S+s+S+]s+x22GETs+/index.htmls+HTTP/1.1/ox){$h++;}END{print $h}' xaa

3885

real 0m2.704s

user 0m2.285s

sys 0m0.387s

Lutz · December 23, 2011, 01:54:41 PM

That is really a big difference, but I don't believe that generally newLISP is much slower than Perl in text processing. It's not what you hear and if you look at benchmarks here: http://www.newlisp.org/benchmarks/">http://www.newlisp.org/benchmarks/, you will see that differences are minor and wherever you see them line-by-line file reading is involved. But that doesn't explain the huge difference you are seeing.

I calculate about 120ms processing time per line! Even with slower read-line I/O that just doesn't make sense.

I generated the following test file test.txt:

Code Select Expand

(set 'file (open "test.txt" "w"))
(dotimes (i 100000) (write-line file (join (map string (rand 100 100)))))

using this program:

Code Select Expand
#!/usr/bin/newlisp

(set 'chan (open (main-args -1) "r"))
(println "time:" (time (while (read-line chan) 
    (inc lcnt) 
    (inc cnt (length (current-line)))
)))

(println "read " lcnt " lines and " cnt " characters")
(exit)

and run it:

Code Select Expand
~> ./readchannel test.txt
time:11787.971
read 100000 lines and 18999043 characters
~>

Which is about a 120 micro seconds per line. Adding simple regular expressions made less than 1 % of a difference.

Now using STDIN to feed the file:

Code Select Expand
#!/usr/bin/newlisp

(println "time:" (time (while (read-line) 
    (inc lcnt) 
    (inc cnt (length (current-line)))
)))

(println "read " lcnt " lines and " cnt " characters")
(exit)

running it:

Code Select Expand
~> ./readstdin < test.txt
time:264.203
read 100000 lines and 18999043 characters

which is about 2.6 micro seconds per line.

The difference of the two methods is, that the fast method uses stream reading fgetc() while the slower method used file handle based read(handle, &chr, 1).

What is the experience of others doing text processing with newLISP?

Ps: all measurements with 10.3.3 and Mac OSX 10.7.2 , on 10.3.10, the faster one takes 1 micro sec or less per line.

jopython · December 23, 2011, 02:05:47 PM

Hmm..

./readchannel test.txt

time:87445.189

read 100000 lines and 19000792 characters

$ ./readstdin < test.txt

time:1704.849

read 100000 lines and 19000792 characters

This is a Ultrasparc 25.

jopython · December 23, 2011, 02:12:08 PM

Now using the stdin method for the original script it went down from 8 minutes to 13 secs. Phew.

Code Select Expand

time ./apache.lsp < xaa
3885

real    0m13.088s
user    0m12.662s
sys     0m0.243s

Looks like fgetc method is a bad idea.

Lutz · December 23, 2011, 02:20:49 PM

Yes, and on 10.3.10 and after you will get down to about 4 seconds versus about 2.7 on Perl and this is more in line with the benchmarks done earlier.

cormullion · December 23, 2011, 02:36:10 PM

~~Quote~~Code Select Expand $ ./readstdin < test.txt time:1704.849 read 100000 lines and 19000792 characters
This is a Ultrasparc 25.

That still seems very slow. Cf:

Code Select Expand
$ ./readstdin < test.txt
time:186.978
read 100000 lines and 18999056 characters

on an iMac. Perhaps your newLISP installation went wrong somewhere...

Generally newLISP is not quite as quick as Perl if you write Perl-y newLISP, but better if you write newLISP-y newLISP. Even if it's not quite as quick it seems more fun to write.

jopython · December 23, 2011, 05:36:55 PM

That still seems very slow.

Yes they(UltraSparc iii) are slow. They belong to the 2001 era. SUN(now Oracle) sparcs are generally not optimized for single threaded performance. Infact the SPARC cpus did even feature out-out-order execution until recently.

newLISP Fan Club

News:

parsing large files (> 5GB)

jopython

Lutz

jopython

Lutz

jopython

jopython

Lutz

cormullion

jopython