(a) tuning "search" I/O (b) bug?

Started by lithper, February 25, 2008, 08:44:05 PM

Previous topic - Next topic

lithper

Quote This post is:

(a) (hopefully) a hint on how to optimize one of NL i/o file operations to at least begin to compete with perl's

(b) a bug report (an unintended behaviour pattern that becomes visible in a certain case)


1. Scenario 1 Supposing we want to write an app with a web user interface, that is to be used by lots of people. The first task is to find a language that would pack the app as one piece, and eliminate the need to install and configure programs, libraries and  systems.



Scenario 2 Then if our application is also to be used in a multiuser environment, best  if it's written in such a way that  the only installation is  copying one file into a cgi-bin directory. It must be light, however, to serve many people - say,  allow the web server give 1 second delays when it is pounded by peak loads of 20-70 requests per second.



New Lisp is somewhat unique in that it could probably do it.



2. For these scenarios - and especially the first one - it's vital not to get tied with a separate DB engine. So the data our app is going to serve and keep should be available, fast, from a filesystem and some simple text file indeces, possibly.

In my experience flat text files are OK for up to tens of thousands of records, and when numbers grow,  an indexer should be used or flat text file indeces split into multiple levels.



3. So, NL could do that. However, its I/O speed when searches in files are performed, is not on par with perl.



4. The (read-line) - (process) - (print your_line) sequence is very, very slow in Newlisp.



On my test file (text) of 31k lines-"records" it does some search of, say, 20 records from somewhere in the end in 1.5 seconds, or even several seconds. Obviously, it is beyond any usable range.

Perl scans the same file with
perl -ne 'm/term_to_match/ && print' filename
in 0.150 seconds, i.e. 10 times faster







5. Probably because of this NL includes another operator, specifically created to search in files, "search"



It loads a chunk of the file, checks for the pattern, and if it's found, seeks back to the first symbol of it. Next you can use (read-line), and it'll show you what remained of the line.

To display the full line it's possible to use a regexp, e.g.

".*my_search_word" -  but at a great cost to speed again.



One solution would be to write the indices in such a way that the truncation does not spoil getting at the result, but that's another question.



One other approach is to slurp the whole file into memory buffer and try to march lines from memory, a method that would exclude many i/o operationss.





6. Tuning "search" operation "Search" operator is better than (read-line - ...) cycle, but still it's slower than perl.



Perl scans my 31k-line file in flat time, 0.120 seconds, picking any number of records from any place in it.

NL slows down towards the end, and competes with perl only in the first, say, 100 records.



Is it possible to find optimizations?



I checked the source code, and it looks like adjusting the size of the buffer "search" uses to read the file in, can speed the thingy roughly 3 times, at least that's what it did on my machine.
QuoteFind in source code file "nl-string.c"

It sets SEARCH_SIZE constant to 0x4000 by default.



Reset it to 0x1000, then recompile newlisp.  It will speed the "search" operator roughly three times.



Possibly because 4k is the page size on my system.

Further shrinking of the buffer did not improve the search operation in any considerable way, so 0x1000 is the size I kept (actually, because memory is allocated as SEARCH_SIZE +1, I decided to set it, rightly or wrongly, to 0xFFF)


Now my scans of lines-records in an "index" file could compete with perl scans to roughly 1000 records, and at 10k records perl was only three times  faster, not 10 times.



As I am unfamiliar with the code, the question is - is this adjustment harmless? At the first sight, from reading the source for p_search in nl-string.c it seems so.  Could someone assure me, just in case?







7. Comparisons.



 Perl on a text file of 31k records, looking for 20 matches at 1400th "depth":
time perl -ne 'BEGIN{$i=0}; (m/allan/) && ($i++) && ( ($i > 1380) && (print))  && ($i > 1400 ) && (
exit)' data.pro|wc -l
........real    0m0.137s


 Newlisp on the same file (before tuning)
time ./lgrep.lsp  1380 1400 'allan' data.pro |wc -l
........real    0m0.407s


 Newlisp after tuning the SEARCH_SIZE parameter:
time ./lgrep.lsp 1380 1400 allan data.pro |wc -l
........real    0m0.137s


and, finally, search for 20 records at 10000 matches tucked in the farther end of the 31k-record file:


QuotePerl:  same invocation,  getting 20 records from 9980th to 10000th

........real    0m0.157s

Perl scans files in flat time



Newlisp: same parameters -    

........real    0m1.480s -- before optimization

........real    0m0.516s -- after tuning



Newlisp adds up time as we dive deeper into a file or the number of extracted records grow. This lag might be further reduced if the need for a dummy i/o operation is eliminated












8. Bug or Feature?

Realistic queries into this text file will not, however, accumulate all records in the manner of "grep", which prints everything it found. Realistically in such an application I would need to get, for example, 20-item chunks of records (e.g. to display TOC of available documents on the site, page by page).



So, I would need to do a query like

"get me info on records between 980 and 1000 from the index file"



This will involve doing a "search", skipping its results, then next, until 980th match is found, and only then I'll start doing (read-lines) and print - until match 1000, when I will abort the procedure.


QuoteThe implementation of "search" does not allow another search to move forward, unless some output was performed: I have not found it in the code, but it seems, the  point in the file is not advanced when "search" is repeated - unless some operation happens that would rewind it forward a bit, and then the next search will be OK.


It might as well be a feature,  I do not know, but to me it looks like unintended behaviour.









9. Can one code around it?  - Yes, but..



 -- One can do a fake (read-line)  -- tested, it slows down the whole thing

 -- One could rewind forward one byte or so - but the NL implementation of "seek" does not allow one to use relative step; some fix like importing a "seek" from libc seems too cumbersome a solution

 -- One could do a fake "search" on a different pattern, which would advance, say, one character or two - so next the real search would be good again. It's way too slow

 -- And, finally, the solution I use for now - do a fake (read-char), which seems most lightweight.



Let's look at a simplistic "test-grep" script to see what this is about:


       (set 'fh (open processed_file "read"))
        (set 'v_cnt 0)

        (while (search fh (string str_pattern) 0)
            (inc 'v_cnt)
            (if ( > v_cnt (int from_rec_num))

                ; yes, match in the range
                (println (read-line fh))

                ; no, skip this match -- here's the ugly
                (read-char fh) ; --> other workarounds R slower
            )

            (and
                ( >= v_cnt (int to_rec_num))
                (close fh)
                (println "nt first " from_rec_num " to " v_cnt  " matches in " processed_file)
                (exit))

        );--end of while = grep for given file--
        (close fh)






As I said , this code (after the SEARCH_SIZE tuning) competes with perl's scan (also with cutoff after the needed number of matches is extracted) until, say, 1000 records in roughly any place in a 31000-line file.



NewLisp begins to lag when either the number of needed matches grows (e.g. you need 1000 of them), and/or when you try to pick up some record starting in the thousands (4000th ot 10000th). At 10k matches (scattered  in various places inside that 31k file) NL seems to be at 1/3 of perl's speed, after the suggested tuning.



I believe that if the need to "jerk" the search operator with an i/o operation is eliminated,  NL "search" operator  can be further sped up.

Jeff

#1
You shouldn't be storing data for a web application in a flat text file.  It's not efficient.  And besides, it's *soooooo* perl cgi from the 90s.



Anyway, if you want a scalable, cross-database solution, write an orm and include an interface for "drivers", which provide a consistent API to a database for the orm to use.
Jeff

=====

Old programmers don\'t die. They just parse on...



http://artfulcode.net\">Artful code

lithper

#2
Quote from: "Jeff"You shouldn't be storing data for a web application in a flat text file.  It's not efficient.  And besides, it's *soooooo* perl cgi from the 90s.


No! The fastest db application on earth stores data in flat text files, as far as I know.



If you drop religious view of thngs, databases are an intermediary application, just like your local Fry's electronics salesmah, who raises prices almost exactly 50% compared to a wholesale price level.  A ddatabase simply duplicates what operating system provides anyway. Just to stick itself in, and collect its dues.

If there is a need than it's for small standalone indexers.



Anyway, I gave background just to explain that a standalone app for naive user must be written as one-piece tiny app (say, up to a hundred kilobytes), and anything else is not acceptable.


QuoteBut MY POST WAS ABOUT  SOMETHING DIFFERENT:

(a) i/o provided by NL is not efficient

(b) one CAN TUNE the "search" operator by simply changing one parameter in the source code before compiling NewLISP

(c) and that it's implemented with a strange behaviour pattern in a certain case: it demands that after a successful search a read from the file happened, or it won't continue to search forward. Which I believe is a bug and which slows its speed down.


This is a bug report and a hint of how to tune some I/O to at least begin to compete with perl.



.. and my use of files to keep information - given just as background to describe the problem in full - is perfectly correct and efficient.

Which a 30-50k app that does about the same, as an 1.2 MB app in php with a DB at the back - and much faster -- proves easily

Lutz

#3
Thanks for this thorough analysis of the 'search' function, and welcome to newLISP. Speed is always a big priority in newLISP.


QuoteAs I am unfamiliar with the code, the question is - is this adjustment harmless? At the first sight, from reading the source for p_search in nl-string.c,h it seems so. Could someone assure me, just in case?


Yes, this adjustment is harmless, and I will set it to 0x1000, in the next development version v9.3.3. 4K seems to be the buffer size on most OSs.


QuoteI believe that if the need to "jerk" the search operator with an i/o operation is eliminated, NL "search" operator can be further sped up.


For many applications you want 'search' to position the file-pointer in front of the found string, for reading in a matching record. But I agree, there are other applications where positioning the file pointer at the end is, what is needed, and that it most likely will increase speed too.



We could change the syntax to include a true/nil flag to move or not move the file pointer after the search operation. We would then have the following:



(search <pattern> <string> <flag> [<regex>])



and



(search <pattern> <string> [<regex>]) ; the the old pattern



The old pattern would assume old functionality without moving the pointer.



Lutz