lazy iteration over strings in memory???

Started by unixtechie, June 29, 2009, 09:54:31 PM

Previous topic - Next topic

unixtechie

Programming language operators  are as a rule  very inconsistent. Some return meaningful values, others exist for "side effects", some are greedy, others are lazy or can be made lazy etc. etc.



Functionally aware languages,. such as lisps and newlisp in particular are great, they have nice uniformity.



However there is one task I do not seem to be able to solve easily. This is lazy iteration over strings



That is, I want for a great number of tasks to have ability to do with strings in memory exactly the same thing I can do with newlisp "search" and files: that is lazy processing.

Search gives me some result and advances the offset in the file (which I can read and use later, if I need it). The next invocation of the same search, say in a loop, will give me the second match and I can pick its offset.

The file can be arbitrarily large, but the operator is not "greedy", as it does not try to process all of it, it is "lazy", it yields result only wneh asked to.



I seem not to be able to do easily and efficiently the same with strings in memory. E.g. if I read that file (or a part of it) into memory, I do not know how to iterate over the string in such a way, that a second invocation of the same matching operator would yield the second match and advance offset, which I'd be able to pick.



My attempts at constructing such behaviour from existing newlisp primitives were clumsy and terribly inefficient (e.g. there is an operator that will give only the first find, but then truncating the string will be very inefficient etc.)



Am I missing something? Is there a way to do efficient processing of largish strings in memory a la files, in the lazy manner?

Lutz

#1
You mean stream like processing of strings, with the stream remembering its position? For many operations, newLISP does this already internally to speed up processing. For example a stream like find can be done with 'replace':



(replace pattern mystring (process-found-string $0) 0)


this example does a successive find internally remembering the positions and processes the string in some user-defined function. Make sure that 'process-found-string' returns $0, so mystring is not changed.

Kazimir Majorinc

#2
unixtechie, why don't you save the position at which you stopped your string function in some global variable, and next time function is called you resume your function from that position?



Are you unsatisfied with that obvious solution because of functional or aesthetic reasons?
http://kazimirmajorinc.com/\">WWW site; http://kazimirmajorinc.blogspot.com\">blog.

unixtechie

#3
Lutz: if I got it right, there is no way to call "replace", get the first match, then call the same invocation the second time and get the second match etc. - i.e. "replace" will provide ALL of the matches in a list, not one by one on demand. As far as I understand it, I'd need "find" and subindexing" to obtain explicitly my matches one after another, and this is very slow.



Kazimir: no, efficiency.

I tried to search with "find" (or regex or sth, I tried several operators), then get the position - and next time do "find" (or sth) on a subindexed string, trying to ensure no copying is done.

This is terribly unefficient.  In place of a stream, I tested it on a large enough file (looking for "species" in the Darwin's book as the stop word), and this was the slowest version compared even to "search" on the file or compared to obtaining greedily a list of all matches and later dealing with it.

I'll try to find that snippet (or redo the test) later.



This is how one could do this lazy interation over strings with REGEXPs in Perl:
PERL CODE (from documentation):
In a coding region, codons are 3-letter sequences, so we can think of the DNA snippet as a sequence of 3-letter records.
    # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC"
    $dna = "ATCGTTGAATGCAAATGACATGAC";
...............................
The solution is to use G to anchor the match to the codon alignment:
    while ($dna =~ /G(www)*?TGA/g) {
        print "Got a TGA stop codon at position ", pos $dna, "n";
    }

I.e. the "ancor" G starts new lookup from the place where the last one stopped, and "pos()" function reports the position (and can be assigned to to reset it)



Instead of "while" one could issue the match line once, get the first match, then a second time etc.

This seems different from the built-in processing of operators like "replace" for strings in newlisp - that one gobbles all up and produces a list of all matches, if I understand it right.



It might be PCRE library does provide all of that under newlisp, I do not know yet.



But the idea is that "lazy" processing of streams or strings is very convenient if you wish to do coroutines (ping-ponging processing btw two functions), or if your input is too large to fit into memory at once, so you set up in-memory buffering of pats of it (and hope to increase speed with this technique), or if you are dealing with a stream that is  "infitine" in practical terms.

Lutz

#4
Quote that one gobbles all up and produces a list of all matches, if I understand it right.


This is not the case for 'replace', but for 'find-all'.



The point I am trying to make is, that replace has the step-wise going through the matches built-in and accessible by the user. So you do obtain your matches one after the other and under your program control (but without showing you the position explicitly).



Look at this example:


(set 'dna "ATCGTTGAATGCAAATGACATGAC")

(replace "[ATCG]{3}" dna  (println "->" $0) 0)

; this will output:

->ATC
->GTT
->GAA
->TGC
->AAA
->TGA
->CAT
->GAC


... without changing 'dna' itself because the statement (println "->" $0) returns the found piece. Except for showing the position, this code is equivalent to the Perl code and handling the found pieces is individually under program control.



If your request of getting the position was only for getting hold of the found piece for coding the incremental search efficiently, then the above form of 'replace' does this already.



If you still insist on needing the position explicitly, you would have to use the 'regex' function, which does return the position and length of the match.



We could could then add an optional offset parameters, which permits to start 'regex' and 'find' on an offset into the string (it wouldn't copy anything). Perhaps we should do this in any case. Like this:



(regex <pattern> <string> [<option> [<offset>]])

(find <pattern> <string> [<option> [<offset>]])



With both you could code a loop, which would increment the offset with the last offset and length found.



But again, look at above 'replace' example, which in most cases covers this case already.

unixtechie

#5
Lutz, thanks for the explanations.



The "lazy" example from Perl will produce ONE, first match and stop. A second rerunning will produce the SECOND match and stop. And so on..



To get the "lazy" behaviour, the modifications of "find" and "regex" you suggested may be exactly what is needed (if internally the offset is remembered as some kind of pointer and not recalculated every time, which I believe is the case)





Why such "lazy" behaviour may be important?



This is the task I was thinking about when I came to consider "lazy" processing of data buffered in strings.



There are plenty of "relational database" systems, from the smallest ones like sqlite to behemoths. They however are optimized for writing data by their very design (roughly, equivalent to "append whole lines to an existing file"), not for reading it efficiently.

The reason for the inefficiency is that reading a relational db drags along the whole of table of data (and often even scans through all of it), while the query in fact may affect only a few columns. This eats up memory and slows the thing down.



So more recently "column-oriented" databases were "invented". There are free ones and practical, too: e.g. one research column-oriented db that is comparable to MySQL, which is  a university project  and is free software. Direct comparison to MySQL using TPC-H benchmarks shows something like an order of magnitude speedup.



Existing relational dbs cannot be adjusted to operating on columns - the first idea would be to create tables exactly one or two column wide, and then adjust the necessary queries. Yes, this would eliminate going through irrelevant (for a given query)  table data, but internal processing tuned to wide tables makes these operations very slow. I tried it - was disappointed - and then found a paper on the Net in which well-known university people did exactly the same tests, only they were thorough and finished them and published the completed results, with the same disappointing conclusion.



So database vendors began to produce "column-oriented back-engines", as is the case with MySQL, for example. You install that add-on, then create tables of a new type. This still presents the customary SQL interface to the user/programmer, but internally operates on columns, sometimes cutting drastically the used memory and speed.



The question is - can column operations on data be modelled with shell, awk, perl or in newlisp? - you bet they can, and all those "selects" and "joins" are not conceptually that difficult. They are just a script per each operation, and not very large one. I think it would be very interesting to see how such processing compares to MySQL-relational and to specialized column-oriented engines. I expect the scripting solution to give speedup compared to the traditional relational processing on TPC-H benchmark data.





But when you deal with tens of megabytes of data in one column-file, which lives inside a directory on a regular unix filesystem, and wish to process it -- depending on the find or match in one column you'd pick a line from another column-file -- you meet with a situation when you cannot slurp the whole column-file into memory, not enough memory.

You do not want to process it while reading directly from the file - this will be too  slow.

Then one might try to buffer the data, reading it from the disk in large enough chunks and processing in-memory as buffered strings.

Newlisp does not do efficient built-in buffering for I/O (in Perl, for example, they do, but then lie to the user, which is presented with language operators as if doint line-by-line processing etc.etc.). I'd have to set up buffering manually.



And that was when I came to the next problem - if processing is "greedy", I still can do such buffered processing, but can I  "ping-pong" to other columns easily to do sth with them (just pick up a line of the same number in the simplest case)  at each match?



Of course, there may be ways to code around the limitation with greedy processing of chunks of a certain standard size, but I believe that "lazy" technique, one result per one request, comes naturally in the context of such a script.



Of course there may be other cases when "lazy" processing of strings or streams would come handy.

Jeff

#6
Another quick note - storing the string in a functor will let you pass the string by reference, rather than value.
Jeff

=====

Old programmers don\'t die. They just parse on...



http://artfulcode.net\">Artful code

Excalibor

#7
Hi there,



the database you are describing reminds me to nosql (not the current, famous one) an ascii text based relational database which used perl scripts for the operations, and files for tables... it was pretty interesting, because you needed zero sql to do all the usual relational db chores.



as for your problem, which being a perlite myself I understand pretty well, brings to mind a couple of ideas which may be feasible (i don't actually know and i'm afraid i'm not good enough to even try them at the moment, leaving time aside, although i'm soon going on holiday and without internet I may use some time to hack some programs away :-)



one is ropes, which has recently resurfaced on reddit, which is storing text in binary trees instead of arrays. it allows for some really neat things.



the other one is Scheme promises. Ideally you would want to get the first match of your search and a promise to get more if you insist on it.



; WARNING: fictional Scheme code in here, don't try at home
(define mymatch
  (lambda (rope)
    (cons (force (match "[ATCG]{3}" (first rope))))
             (delay (match "[ATCG]{3}" (rest rope))))))


for the adequate definitions of match, first and rest. Or just doing a depth-first tree travesal of the rope, step by step.



let's imagine:



 - we have the string adn.



 - in one go we transform it into a rope that



 * matches the search terms and



 * returns a match and a promise to keep matching on the rest of the rope



It's then just a matter of walking on this at the appropiate steps...



I'm not saying this is efficient or even feasible, although I suspect promises would be pretty easy in newLISP, I have thought about implementing them (maybe they are even out there and me not knowing!) but it seems easy, for an open enough definition of easyness, and ropes are, by default, pretty efficient and, well, they are trees... :-)



(Actually, a rope implementation in newLISP ishould work nicely with red-black trees, and considering they are by definition inmutable, work nicely with ORO, mmm, what an idea... ;-) )



I can imagine the trouble of getting this huge string on memory and not being able to efficiently manage to chunk it (e.g. ADN strips) but the overhead of being a rope instead of an array seems reasonable, and the benefits could be great; as for promises, it's one of the best ways to implement lazy streams on eager languages I have ever seen, so it may be worth a try.



Even another approach, again it may be worthless, would be treating the string like a linked list. Using promises you can store indexes to the elements you have already found, and the tail would be a promise to keep advancing through the list. Again, considering anything you do on the list would be, at least, O(N), it does seem efficient enough, right? It's like a flexible, lazy splice'ing of the list...



well, I've written on the spot, because I'm stuck at work and needed to vent, please don't take any nonsense as personal :-)



good luck!