Parse very big XML file (Openstreetmap)

Started by hilti, May 28, 2013, 01:15:23 PM

Previous topic - Next topic

hilti

Hi!



Does anyone have experience in parsing large OSM (Openstreetmap) files? I'm trying to parse them with (xml-parse) but I get an error from newLISP telling me that there's not enough memory for (read-file)



The file is 32GB (gigabytes!).



Here's the error message:



newlisp -m 4096 -s 10000 parse.lsp
newlisp(18433) malloc: *** mmap(size=4258476032) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug

ERR: not enough memory in function read-file


Thanks for any suggestion.

Marc
--()o Dragonfly web framework for newLISP

http://dragonfly.apptruck.de\">http://dragonfly.apptruck.de

conan

#1
DISCLAIMER: I haven't worked with big XML files.



It seems you have to split your XML file but I don't know how that will affect xml-parse.



However, from the manual:
Quote
Using a call back function



Normally, xml-parse will not return until all parsing has finished. Using the func-callback option, xml-parse will call back after each tag closing with the generated S-expression and a start position and length in the source XML:


Maybe that could help.

rickyboy

#2
If you know something about the makeup of the file, using search with regexes might help:



http://www.newlisp.org/downloads/newlisp_manual.html#search">http://www.newlisp.org/downloads/newlis ... tml#search">http://www.newlisp.org/downloads/newlisp_manual.html#search



Maybe this way you can grab certain "chunks" from the file, pass the chunk to xml-parse, and then (optionally) write out (save ?) the chunk out to another file.  If you're still memory constrained, you may not want to accumulate "chunks" on the heap (I guess you could remember each chunk with the same symbol (in the loop); in doing that, I'm not sure how fast the old chunks would get garbage collected).



Good luck.  I'll bet someone has a better idea though. :)
(λx. x x) (λx. x x)