uses of NL - tar indexing

Started by lithper, March 02, 2008, 10:51:35 PM

Previous topic - Next topic

lithper

QuoteShouldn't we talk not only about abstractions and principles, but also see if and how NewLisp could be used for scripting in the spirit of perl?


..some time ago I accidentally read a message of some sysadmins on the main BBC website. They were talking about developing another "perl framework" to take care of multiple problems, one of which seemed too many files in their web directories, they became unmanageable.



There is a very simple solution, but for some incomprehensible reason few people seem to be aware of it: use tar files, and index them. You'll be able to access contents in a tiny fraction of a second, so in effect your tar becomes an equivalent of a compressed read-only filesystem with random file access.



When "tar" itself reads the archive, it linearly scans the full length, and therefore it is way too slow.



I myself use it as a backend for storing archives of, for example, last year's postings on a blogging site, or to keep other collections of documents.

One obvious advantage is that you stop relying on unbundled software (many assume a relational db for storage), and with the ability of NewLisp to create tiny standalone executables one can create a really portable cgi application with a web interface.



One might add indexed tar storage to the NewLisp wikis to manage large file and postings collections.





1. The idea

Tar files are a concatenation of their files, with added headers and written in 512-byte blocks, so the end of the content section can be padded with 00s

Usually you'll want to keep gzip-compressed files inside (or alternatively make a tar of uncompressed files and compress the whole bundle, that is discussed later)



To index a tar therefore one needs to read the archive in 512 pieces, check for the "magic string" - the word "ustar" at position 517 will signal the header of a next member file. Then the indexer should read bytes 0-100 of the header, which will contain the name of the included file, padded with 00s.

There is more information in the header - Wikipedia is a quickest reference, if one wishes to avoid dense descriptions.



Next the indexer counts the 512 blocks until the next header with a filename, and the result, written to a file in a form convenient for the reader program will  look like
Quote...

archive/dir/some/file/name 1235 14

archive/dir2/some/other/file/name 3214 3

...

Meaning for the reader program - wind/seek 1235*512 bytes from the beginning, then get 14 blocks (until the next member file starts), and process to rid them of the header and padding garbage.





A skeleton indexer

I offer a simple "skeleton script" below. My primary uses of the scripts are not as command-line utilities, so the argument reading is primitive etc.


#!/usr/bin/newlisp
;;@module index-tar

(if ( != (length (main-args)) 3)
( and (println "ntUSAGE: " (main-args 1) " Tar_file_namen") (exit))  )
 
(set 'l_a (2 (main-args)))
 
(if (file? (set 'tarball_file (pop l_a)))
(set 'fh_tarfile (open tarball_file  "read"))
(and (println "ntcould not open tar filen") (exit))  )

(set 'cnt 0)
(set 'prev_offset 0)
(set 'prev_name "dummy")
(set 'str_accumulator "" )


Now the main part of the processing begins.

After experimentation, i found that the fastest NL can provide - is writing into a memory buffer (a string).


Quote from: "redundant string cast"
One "trick" I use is - after reading the filename from the first 100 bytes of the header  - which is a string - I cast it as strng again:
(set 'prev_name (string (0 100 str_processed_block)) )
The extracted filename is padded with 00s:



"archive/my/file/name000000000000....00

 0---------------------------------------------------------100



Casting this string as string again seemed an inexpensive way to truncate the padding.



P.S. Correction and speedup the fastest way to create the string is through the use of "format" command and without explicit casts. The script below is updated


(while (read-buffer fh_tarfile 'tar_chunk 4096)
(dotimes (block_cnt 8)

(set 'str_processed_block ((* block_cnt 512) 512 tar_chunk))
(and
(= (257 5 str_processed_block) "ustar") ; --> this is the header block
(if  ( = cnt 0 )
(set 'prev_name (string (0 100 str_processed_block)) )
(write-buffer str_accumulator
(format "%s %d %dn" prev_name prev_offset (- cnt prev_offset) ) )
 
)
(set 'prev_name (string (0 100 str_processed_block) ))
(set 'prev_offset cnt)  
)
(inc 'cnt)

); /end of dotimes/ - 512k inspection of chunks
);end of while - 4k chunks gobbling

(write-buffer str_accumulator
(format "%s %d %dn" prev_name prev_offset (- (- cnt 1) prev_offset) ) )

(write-buffer 1 str_accumulator) ; /*result to STDOUT*/

(close fh_tarfile)
(exit)

..and write the result to STDOUT.



the length of the last  file in the archive  will be wrong (too large), but it won't interfere with correct file extraction.



......USAGE: scriptname tar_file_name > index_file_name



The indexer is sort of slow. For my tests I use a 90MB tar of tiny (5-50kb) text/html files, gzipped before they were put into the archive. The whole archive uncompressed was 250MB and consists of roughly 31k files.



The indexer goes through 90MB and 37000 items (dirnames, some trash add to the 31000 useful files) in roughly 3.9 seconds on an old machine (pentium 500MHZ).

Indexers written in C do that in roughly 1.0-1.3 seconds.



Because indexing is a one-time and infrequent job, I decided I could really live with it if the main script for the project, file listing/search and extraction from this as if "read-only tar filesystem" were fast enough.







2. List and get files quickly from large tar archives



I found two ways to scan files for their contents fast in NL:

(a) use the "search" operator. To speed it up roughly 3 times one might want to reset a buffer size in the newlisp code (see my previous posting), and then it becomes comparable to perl up to thousands of lines. It will lag after that

(b)   slurp the contents of the whole file into memory, and use string operators to search for information in it.



Below are the two short functions,  "f_search_in_file" and "f_slurp_and_search"  that do that.



The remaining code snippets can be concatenated into one working skeleton script, which is fed the index file in the format shown above and the tar file.

Depending on the options, the script will either find filenames (by its substrings etc - using regexps), the number of lines of output is given as options too - or cut out the files. So first adjust your regexp to "list" what you need, then change the option to "get" the file.


Quote
USAGE:

<prg> list|get search|slurp from_match_num to_match_num index_file tar_file




for example:

 (a)  ./script.lsp list slurp 100 120 '2006-.*'  index.txt archive.toz.tar

(find btw 100th and 120th matches for string 2006- )



(b) ./script.lsp get slurp 10000 10001 '/2006-.*' index.txt archive.toz.tar |tar xOf - |zcat

cut out the 1000th match by reading the index file into memory (a faster find in this case), getting the record, then using the offset and length to get it




In this skeleton script I simply cut the file out. With the header and trailing padding it's just a small one-file tar archive itself (or - of as many files as you decided to extract at once).

I could untar in NL, but because tar itself is a tiny 140k utility, I simply redirect into "tar xOf - " to get the stream on STDOUT, and because files inside the archive are gzipped, one might decompress them with zcat (also tiny), or serve as "gz" file.

Most web browsers support gzip compression


#!/usr/bin/newlisp
;;@module .......

#------FUNCS---------

(define (f_search_in_file)

(set 'v_cnt 0)
(while (search fh_index (string str_pattern) 0)
(inc 'v_cnt)
(if ( > v_cnt (int from_rec_num))
(apply f_output (list (read-line fh_index)) )
(read-char fh_index)
)
(and
( >= v_cnt (int to_rec_num))
(close fh_index)
(exit)  )   );--end of while = grep for given file--

(close fh_index)
); /* end of f_search */


The "search" grepping inside files is very fast for files at the beginning of the archive - faster than slurping the index file into memory.


(define (f_slurp_and_search)

(set 'cnt 0)
(replace (string str_pattern) ; -- replace what
(read-file index_file) ; -- replace where
(and ; -- pseudo-replace with what actions
(inc 'cnt)
(if (> (int cnt) (int from_rec_num))
(apply f_output (list $0)))
;(println $0))
;(f_get $0))

(and (>= (int cnt) (int to_rec_num))
(exit)  ); --end of and
); end of "and" = replace-actions
0); end of replace  
) ; /*end of f_slurp_and_search*/


This one is faster if the number of matches is high (e.g. you are grepping for 4000, or 10000 files in the index), or when (even a few) files lie at the end of the tar archive.



Next comes the primitive function to cut the contents out of the tar:


(define (f_get str_tar_entry)

(set 'str_parsed (parse str_tar_entry))
(seek fh_tarball ( * (int (nth -2 1 str_parsed)) 512 ))
(read-buffer fh_tarball 'tar_item (* (int (nth -1 1 str_parsed)) 512))

; writing the cutout (which is tar itself) on STDOUT
(write-buffer 1 tar_item )
; writing into a tar file
;....
; writing and untarring into fsystem  
;....

); /*end of f_get*/




Main sets command-line options.



#------MAIN----------

(if ( != 9 (length (main-args)))
( and
(println "ntUSAGE: " (main-args 1) " 'list|get' 'slurp|search' 'From_rec_num' 'To_rec_num' 'RegexpPattern' 'Indexname' 'Tarballname'n")
(exit))  
)


; -- these are globals: no args passing to the funcs
(set 'l_a (2 (main-args)))
(set 'grep_untar (pop l_a))
(set 'slurp_search (pop l_a))
(set 'from_rec_num (pop l_a))
(set 'to_rec_num (pop l_a))
(set 'str_pattern (pop l_a) )

(if
(= grep_untar "list") (constant 'f_output 'println)
(= grep_untar "get") (constant 'f_output 'f_get)
(println "nt action not defined: list|get
grep-like find files -- or get a cutout tar of file(s)
)


(if (file? (set 'index_file (pop l_a)))
(set 'fh_index (open index_file "read"))
(println "nt could not find the index file")
)

(if (file? (set 'tarball (pop l_a)))
(set 'fh_tarball (open tarball "read"))
(println "nt could not find the tar file")
)

(if
(= slurp_search "slurp") (f_slurp_and_search)
(= slurp_search "search") (f_search_in_file)
(println "nt type of processing not defined: slurp|search")
)


(exit)




3. The results

The results of this experiment are good, the script is quite usable.



I know of 2 other utilities for indexing tars.



One is called "tarix" and its home is sourceforge.net. It's written in C, but the author made an unfortunate choice: claiming a need for compatibility, he did read index file line-by-line, but rather character-by-character.  As a result extraction is slow: on my machine with 90MB tar (and 2MB index file of 37k entries, 31k useful entries) it gets the needed line from the index in appr. 0.300 seconds (and then seeking and cutting out is sufficiently fast)



The "tarix" author, however, also implemented an idea put forward by the author of the (well-known) "zlib" library. It's also possible to index in the similar manner compressed files!

Basically, if instead of tar of many compressed file1.gz ... fileX.gz you use  tar.gz, it's possible to create a double index. So when extracting a file, (a) find it in the index, (b) stream uncompression from the point before the file begins in the tar, to a point after it ends, and cut out the file tar blocks the roughly way we have done here.

Zlib author has prototype utility to show how it can be done, and "tarix" is the only program I know that actually picked up the idea.

Another note: tarix' invocation is quirky, it's not easy to use manually, however it may be Ok in the scripts.



My tests of the scripts above are much better. Extraction is

0.035 - 0.070 - 0.090 - 0.110  of a second for the majority of the files.

Getting files from the far end of the tar archive is 0.200 second for "slurp" and 0.600 second for "search"



The third set of utilities for such applications I found in some scientific software for working with biology data. I had to tear out the sources from the project tree, and they use some interesting tricks with memory management. Therefore, they are not usable without their accompanying library, around 200k (the utilities themselves are tiny 50-60k).

Those can extract files in tens of microseconds (e.g. 0.025 - 0.070 up to 0.100), even for the farthest files.



So, this humble and rather quick NewLisp hack comes as a strong second, and is especially good for wiki and blog uses, where new files are accessed frequently, and farthest and old are read much less.



I believe using indexed tars is a really good solution to management of file collections, for websites, or, if anyone adds necessary scripts to a file manager (such as "Midnight Commander" many use on Linux), as a means to browse and read from tars without the need to linearly scan (slooow) and/ot extract their contents into temporary directories (overfilling your file systems).



some results on an old 500MHZ machine


Quote
1. Grepping for files:

list/grep one file (end of index) -- "search":    0m0.675s

list/grep  one file (end of index)  -- "slurp":             0m0.243s



This is what the utility is not supposed to do. In real web usage extraction is done by pages (say, 20 at a time):

grep/find/list 12300 filenames - "search"    0m3.072s

grep/find/list 12300 filenames - "slurp"       0m1.246s



2. Extracting files

get/extract one file from the end  (13000th match) -- "search"  0m0.675s

get/extract one file from the end (13000th match) -- "slurp"      0m0.243s



Extraction from beginning or the middle of the file:

get/extract one 122th matching file (search)   0m0.043s

get/extract one 122th matching file (slurp)      0m0.065s



get/extract one 633th matching file (search)     0m0.088s

get/extract one 633th matching file (slurp)        0m0.089s



get/extract one 1220th matching file (search)    0m0.100s

get/extract one 1220th matching file (slurp)        0m0.108s


P.S. Comparison with sqlite indexer (an excellent indexer program) shows that unless you know precisely the filename you're looking for, you (or the web site user) are likely to search for a substring.

If indexed sqlite access is very quick (0.015s for my test file), any searches with "LIKE username%" will simply scan the index linearly, and in my tests sqlite provided worse results than the described utility.



P.P.S. One can index files not only by their filenames, but by other keys (categories, date, some key words etc.) The speed of reading the index, obviously, depends on the size of the index, so all unnecessary information should be cut out.

Lutz

#1
Very nice idea! A couple of comments:


(set 'prev_name (string (0 100 str_processed_block)) )

there is no need for this, as slicing already returns a string copy:


(set 'prev_name (0 100 str_processed_block))

QuoteTo speed it up roughly 3 times one might want to reset a buffer size in the newlisp code


'search' in 9.3.3 (due by the end of this week) has a shorter buffer size and lets you position the filepointer at the end of the searched string using an optinal flag.

lithper

#2
Quote from: "Lutz"Very nice idea! A couple of comments:


(set 'prev_name (string (0 100 str_processed_block)) )
there is no need for this, as slicing already returns a string copy:
(set 'prev_name (0 100 str_processed_block))


..as is buried somewhere in the posting, I use a redundant cast into string as a cheap way to cut trailing 00s, which in the tar header pad the filename to the first 0-100 bytes.

"my/file/name000000 ... 00" --> "my/file/name" (with possibly one 00 that ends the string)


Quote from: "Lutz" 'search' in 9.3.3 (due by the end of this week) has a shorter buffer size and lets you position the filepointer at the end of the searched string using an optinal flag.


Thanks a lot, it means we would be able to write all kinds of  tiny "databases"  as "key1 key2 key3 data", where "data" is what we are really interested in. Then a search for example for "key2" would get us "key3 data", destroying the line - which is perfectly OK for such uses.



And the lines that match our queries but we are not interested in (e.g. we want only results from 30 to 40 skipping the first 30) won't need an additional operator in the script to move the search forward, so decreasing the speed of the search.



Looking forward to trying that one ;))

Lutz

#3
QuoteI use a redundant cast into string as a cheap way to cut trailing 00s


Oops, I overlooked this. Here is another trick, to do it faster:


> (set 's "abc000000")
"abc000000"
> (time (string s) 1000000)
518
> (time (get-string s) 1000000)
226
> (get-string s)
"abc"


The 'get-string' function is normally used when importing C-library functions (it also works when passing it an integer memory address), but works for your purpose well and about double as fast as 'string' ;-)

cormullion

#4
Nice post - thanks!



I like the way that, in some areas of this forum, there's talk about newLISP and functional programming concepts, and in other areas like here newLISP is hard at work doing scripting 'in the spirit of Perl' and being kind of discussed alongside C...



I've always been interested in the interplay between efficient scripts and elegant ones. In newLISP it's sometimes true, I think, that the most elegant solution isn't always the fastest. But for my purposes the language scores highly on both elegance and performance so I'm quite happy with different compromises between the two.

lithper

#5
Lutz: aha! - thanks for the tip



cormullion:  - I've come from the perl world, and am just trying nlisp.



I've been looking for a language that would (a) have higher-level operators, like perl or even php has, that allow much faster scripting or prototyping, and (b) at the same time would preserve the logical programming or functional programming paradigm, i.e. the way those operators are tied together.



Why? - because the structure of a programming language MUST go along with the innate patterns of our thinking. Our thinking is best expressed by natural languages.



There is something very, very appealing in lisps' way of putting things as
     (drink (fill glass water))
         1. fill what? with what?
         2. drink what?
         3. fold into one nested "Verb+Object (+obj...)" structure


rather than
temp_var_water_container = (fill glass water)
 temp_var_resulting_state = drink temp_var_water_container


This economy - eliminating the need to create and keep track of intermediate results - makes for faster programming and fewer errors.



To give you one example: when I started to rewrite the indexer in Perl (which I have known for a number of years), I hit an error, and could not immediately trace it.



With nLISP because of its logical economy I can produce (incrementally) scripts that work the first time I write them, an experience considered magical and rare among procedural programmers.

With nLisp all I have to do is test a snippet - and immediately insert it into a longer sequence, which is very often immune from errors because a produced "nil" could be treated nicely by the enclosing operator.



A useful essay on logic of programming languages written in comparison with human languages, is here:

http://steve-yegge.blogspot.com/2006/03/execution-in-kingdom-of-nouns.html">http://steve-yegge.blogspot.com/2006/03 ... nouns.html">http://steve-yegge.blogspot.com/2006/03/execution-in-kingdom-of-nouns.html

itistoday

#6
Neat stuff lithper!  Oh, and I just wanted to say that I love your name, that is awesome. :-)
Get your Objective newLISP groove on.

Ryon

#7
I'm in favor of the RPN 'water glass fill drink' syntax of Forth, which like Lisp, also deals elegantly with intermediate results, yet doesn't require Lots of Irritating Silly Parenthises.



Not relevant in a newLISP discussion, but I like to mention it anyway.
\"Give me a Kaypro 64 and a dial tone, and I can do anything!\"

xytroxon

#8
Very interesting!!!



But I have some thoughts here, to be the proverbial "devils advocate" :)



1: Is using a mature, optimized SQLite3 "blobs" database faster??? You would create a simplified user interface to clean up the normal SQL table noise...



2: Have you seen Daniel J. Bernstein's "cdb", Constant Database concept??? For example, as used recently by lua-tinycdb module...

http://asbradbury.org/projects/lua-tinycdb/">//http://asbradbury.org/projects/lua-tinycdb/



lua-tinycdb



A binding to the tinycdb library by Michael Tokarev, which is a public domain implementation of Daniel J. Bernstein's Constant Database (cdb). A cdb is a key-value store (much like BDB/gdbm), but it cannot be updated at runtime (only rebuilt). Reading is very fast, and rebuilding is done atomically by building the new database and then renaming. lua-tinycdb includes the tinycdb source so there are no external dependencies.

Features



I include below the list some of the features of the cdb database structure, stolen (and slightly modified) from D.J. Bernstein's cdb page:



    * Fast lookups: A successful lookup in a large database normally takes just two disk accesses. An unsuccessful lookup takes only one.

    * Low overhead: A database uses 2048 bytes, plus 24 bytes per record, plus the space for keys and data.

    * Fast atomic database replacement



---------


Quote from: "Ryon"I'm in favor of the RPN 'water glass fill drink' syntax of Forth, which like Lisp, also deals elegantly with intermediate results, yet doesn't require Lots of Irritating Silly Parenthises.



Not relevant in a newLISP discussion, but I like to mention it anyway.


Maybe it is relevant... Just how simular are languages like Lisp and Forth???



Much like newLISP's challenge to older Lisps, Ron Aaron has created Reva, a "Batteries Included" dialect of Forth, that claims to be almost as fast as C/C++... It also, coincidently/interestingly, has "Contexts"...



Wiki:

http://ronware.org/reva/wiki/Main_Page">//http://ronware.org/reva/wiki/Main_Page



Forum:

http://ronware.org/reva/">//http://ronware.org/reva/



Would a silly idea like a newLISP to Reva translator work??? That could be then used to run newLISP scripts at machine code speeds...



Oh no!!! Thinking of all these newLISP inspired ideas is making my head spin again :) LOL
\"Many computers can print only capital letters, so we shall not use lowercase letters.\"

-- Let\'s Talk Lisp (c) 1976

Ryon

#9
Reva seems to be a compiled Forth, which means it's no longer an interactive language. Some of the of the nicest features of Forth, and NewLISP, come from their interactivity. Code can be translated into C or assembly for speed after the program is stable. The elegance of both languages help to make this a straightforward job.



I remember a saying that 90% of the time will be spent in 10% of the code. Make it work first, then optimize where needed.
\"Give me a Kaypro 64 and a dial tone, and I can do anything!\"

xytroxon

#10
No... Reva is a REPL interpreter that is "subroutine threaded" for speed, but it also has available inline and macro capabilities. It also does generalized tail-call elimination... In a 31,284 byte .exe file... With no C/C++ source used...



In Reva, when you define "words" they can be compiled after you enter the code for them (by file or keyboard and then it usually finishes before you can lift your fingertips from the keyboard...)



So it's a real time incremental compiler that lets you compile all, part, or none of your defined "word" as optimization is needed... The trick is that you build new "words" from the hundreds of other words that have been defined and optimized by you, or come as part of the Reva language, until the final "word" is just your program...



And if you need it, you can use pure assembly language for even more speed...



These endlessly customizable "words" are the true power of Reva... And is why I think a very powerful highlevel language could be built from them... Instead of endlessly compiling C/C++, you experiment from the keyboard and focus on only what needs to be fixed or optimized...



But of course, Common Lisp has solved all these problems :)



And simular to newLISP, you can then bind your Reva code to the interpreter executable for an all in one package... And it comes with 20 or so .dlls from pcre3 and pdcurses to sdl and sqlite3... So Reva programs can be written that achieve very high levels of functionality...



But the bad part is that Reva only supports Windows and Linux... And programming in Reva takes real dedication to understand and learn and makes LISP programming look like 1970's BASIC :)



When you can understand this code:
Quote
: gcd   ( a b -- c) | return the greatest common divisor of 2 numbers on the stack

    dup if swap over mod gcd else drop then ;


Then, and only then, have you ascended to the next level of beiing "one with the machine"...
\"Many computers can print only capital letters, so we shall not use lowercase letters.\"

-- Let\'s Talk Lisp (c) 1976

Ryon

#11
The top of stack is treated as a test for the if else then loop, which interestingly is the logic of this definition. gcd then calls itself recursively until it drops out of the loop. The rest is simply stack handling. Also interesting is that the commenting takes more ink than the definition itself.



I may have been wrong about Reva being a compiled-only language. The article that I googled called it a "forth compiler", and I jumped to that conclusion.



I'd love to find an explanation of why Forth didn't make it as a major language, in its own discussion thread. Just terming it "weird" because it isn't immediately understandable isn't an acceptable answer to me; just look at any of the C's and their spawn if you want weird!
\"Give me a Kaypro 64 and a dial tone, and I can do anything!\"

Lutz

#12
QuoteI may have been wrong about Reva being a compiled-only language


Forth cannot be classified along the lines of compiled or not compiled, its a totally different concept, both of you are right  ;-)



In Forth pre-compiled pieces of assembly, called words, are simply concatenated to form new words. This is also the reason that Forth uses reverse Polish notation pushing intermediate results on a stack. Only this way it is possible to simply knit pieces in a serially fashion together to form new user-defined words.



Only for flow control words like 'if; and 'while' you need to do a little bit of compilation to calculate jump offsets/addresses in the resulting code.



Forth syntax is bound to reverse Polish notation and must be a stack language. A specific implementation is also always bound to a certain CPU architecture and calling conventions of the specific OS (if it wants to be callable or call other libraries). E.g. you have to preserve the contents of certain registers or restore them if you change them.



I wrote a simple Forth called GokuScript in 1998 for the x86 family of CPU's trying to make Forth friendlier. You can see the manual here: http://newlisp.org/Projects/Gokumanual.html">http://newlisp.org/Projects/Gokumanual.html and the source here: http://www.newlisp.org/Projects/goku-asm.txt">http://www.newlisp.org/Projects/goku-asm.txt there is also a standard library here: http://www.newlisp.org/Projects/std-goku.txt">http://www.newlisp.org/Projects/std-goku.txt .



Forth is very fast (close to C speed), but also hard to read, debug and maintain. In the end I felt that my intentions to domesticate Forth had failed. Forth is great for BIOS monitor and/or boot software and small scripts, I think it is still used in the BIOS of Sun Sparc workstations and Apple computers (?).

Elica

#13
My first experience with Forth was with a dialect called GraForth somewhat 20 years ago. It was Forth + some 3D graphical primitives. This was the first time ever in my life to see a 3D animation on a computer screen.



The only significant problem I had at the time was editing my GraForth programs. There was a command to forget some definition, but because they were stacked, forgetting ABC would also forget everything defined after ABC.

cormullion

#14
I was going to mention Factor. But I want to mention newLISP instead... :)