parsing large files (> 5GB)

Started by jopython, December 21, 2011, 01:22:26 PM

Previous topic - Next topic

jopython

Parsing large log files (using line-by-line) takes a looong time(several times slow compared to Perl).



------------------

(while (read-line file)

        (if (regex p1 (current-line) 0x10000)

          (inc foo)))



------------------

Is there anything I could use within newlisp to shorten the time for reading files? Say, Is it possible to read files in big chunks(say 10MB) and then parse that portion in memory for faster access?

Lutz

#1
In the next development or stable release version (January 1012), at least 'read-line' from STDIN (read-line) will be 3 times faster than in the old version reading from STDIN. If you can process your logfiles like this: process < thelogfile.txt, this will help you.



Until then, yes, reading big chungs of memory and doing a:
(dolist (line (parse chunk "n"))
    ...
)
 will be much faster.



ps: even in the old version (read-line) via STDIN is already about 3 times faster than reading from channel.

jopython

#2
The difference is really huge 8 Minutes in newlisp vs 2 seconds in Perl for a 100MB log file.

I am not a fan of Perl. But i am forced to use it because of its text processing performance.



$ time ./apache.lsp xaa

3885



real    8m2.346s

user    1m28.495s

sys     6m30.581s



$ cat apache.lsp



;(set 'yesterd (date (date-value) -480 "%d/%b/%Y.+"))
(set 'yesterd [text]10/Dec/2011.+[/text])
(set 'reg [text]GET /index.html  HTTP/1.1[/text])
(set 'pattern_str (append yesterd reg))
(set 'p1 (regex-comp pattern_str))
(set 'file (open ((main-args) 2) "read")) ; the argument to the script
(while (read-line file)
        (if (regex p1 (current-line) 0x10000)
          (inc foo)))
(if foo (println foo) (println 0))
(exit)


$ time perl -lne 'BEGIN{$h=0;}if (m/10/Dec/2011S+s+S+]s+x22GETs+/index.htmls+HTTP/1.1/ox){$h++;}END{print $h}' xaa
3885



real    0m2.704s

user    0m2.285s

sys     0m0.387s

Lutz

#3
That is really a big difference, but I don't believe that generally newLISP is much slower than Perl in text processing. It's not what you hear and if you look at benchmarks here: http://www.newlisp.org/benchmarks/">http://www.newlisp.org/benchmarks/, you will see that differences are minor and wherever you see them line-by-line file reading is involved. But that doesn't explain the huge difference you are seeing.



I calculate about 120ms processing time per line! Even with slower read-line I/O that just doesn't make sense.



I generated the following test file test.txt:

(set 'file (open "test.txt" "w"))
(dotimes (i 100000) (write-line file (join (map string (rand 100 100)))))


using this program:
#!/usr/bin/newlisp

(set 'chan (open (main-args -1) "r"))
(println "time:" (time (while (read-line chan)
    (inc lcnt)
    (inc cnt (length (current-line)))
)))

(println "read " lcnt " lines and " cnt " characters")
(exit)

and run it:
~> ./readchannel test.txt
time:11787.971
read 100000 lines and 18999043 characters
~>

Which is about a 120 micro seconds per line. Adding simple regular expressions made less than 1 % of a difference.



Now using STDIN to feed the file:
#!/usr/bin/newlisp

(println "time:" (time (while (read-line)
    (inc lcnt)
    (inc cnt (length (current-line)))
)))

(println "read " lcnt " lines and " cnt " characters")
(exit)

running it:
~> ./readstdin < test.txt
time:264.203
read 100000 lines and 18999043 characters


which is about 2.6 micro seconds per line.



The difference of the two methods is, that the fast method uses stream reading fgetc() while the slower method used file handle based read(handle, &chr, 1).



What is the experience of others doing text processing with newLISP?



Ps: all measurements with 10.3.3 and Mac OSX 10.7.2 , on 10.3.10, the faster one takes 1 micro sec or less per line.

jopython

#4
Hmm..



 ./readchannel test.txt

time:87445.189

read 100000 lines and 19000792 characters



$ ./readstdin < test.txt

time:1704.849

read 100000 lines and 19000792 characters



This is a Ultrasparc 25.

jopython

#5
Now using the stdin method for the original script it went down from 8 minutes to 13 secs. Phew.



time ./apache.lsp < xaa
3885

real    0m13.088s
user    0m12.662s
sys     0m0.243s



Looks like fgetc method is a bad idea.

Lutz

#6
Yes, and on 10.3.10 and after you will get down to about 4 seconds versus about 2.7 on Perl and this is more in line with the benchmarks done earlier.

cormullion

#7
Quote$ ./readstdin < test.txt
time:1704.849
read 100000 lines and 19000792 characters

This is a Ultrasparc 25.

That still seems very slow. Cf:


$ ./readstdin < test.txt
time:186.978
read 100000 lines and 18999056 characters

on an iMac. Perhaps your newLISP installation went wrong somewhere...



Generally newLISP is not quite as quick as Perl if you write Perl-y newLISP, but better if you write newLISP-y newLISP. Even if it's not quite as quick it seems more fun to write.

jopython

#8
That still seems very slow.



Yes they(UltraSparc iii) are slow. They belong to the 2001 era. SUN(now Oracle) sparcs are generally not  optimized for single threaded performance. Infact the SPARC cpus did even feature out-out-order execution until recently.