I meet some problems during attepts to create (html-parse) function. Here are two functions, which do the same: both function suppose to extract data from "td" tags.
(define (sacar-td linea)
(set 'alveolos (find-all "(<td)(.*?)(</td>)" linea $0 1))
(map (fn (x) (replace "</?td(.*?)>" x "" 1)) alveolos))
(define (crash-td linea)
(find-all "(<td)(.*?)(</td>)" linea (replace "</?td(.*?)>" $0 "" 1) 1))
(set 'testrow "<tr><td class='kin'>Alpha</td><td>Gamma</td></tr>")
Longer one, "(sacar-td testrow)" works ok. Shorter one, "(crash-td testrow)", crashes the shell:
> (sacar-td testrow)
("Alpha" "Gamma")
> (crash-td testrow)
*** glibc detected *** /usr/bin/newlisp: double free or corruption (fasttop): 0x080cc808 ***
======= Backtrace: =========
/lib/tls/i686/cmov/libc.so.6[0xb7e31a85]
/lib/tls/i686/cmov/libc.so.6(cfree+0x90)[0xb7e354f0]
...
It is the first strange thing.
Second problem -- my repexps works ok only if I replace all "n" to " " before searching.
Update: accidentaly found solution for the second problem, it is "(?s)" key. Now my "parse-html" function works:
; Usage (parse-html (get-url "http://www.newlisp.org/downloads/newlisp_manual.html"))
(define (parse-html texto)
(map sacar-table (find-all "(?s)(<table)(.*?)(</table>)" texto $0 1)))
(define (sacar-td linea)
(set 'alveolos (find-all "(<t[dh])(.*?)(</t[dh]>)" linea $0 1))
(map (fn (x) (replace "</?t[dh](.*?)>" x "" 1)) alveolos))
(define (sacar-table linea)
(map sacar-td (find-all "(?s)(<tr)(.*?)(</tr>)" linea $0 1)))
Do it this way:
(define (crash-td linea)
(find-all "(<td)(.*?)(</td>)" linea (replace "</td>" (copy $0) "" 1) 1))
> (set 'testrow "<tr><td>Alpha</td><td>Gamma</td></tr>")
> (crash-td testrow)
("Alpha" "Gamma")
>
Replace is trying to make replacement in $0 while at the same copying to it the piece to replace. This will throw a protection error in the future.
(define (mangle str)
(replace "</td>" str "" 1)
(define (crash-td linea)
(find-all "(<td)(.*?)(</td>)" linea (mangle str) 1))
Use (copy $0) or (copy $it).
Quote from: "Lutz"
In a future version $0 only the anaphoric system variable $it will contain the found piece. Trying to change $it will cause a protection error. You would then use (copy $it). Today both $0 and $it contain the found piece.
Not sure what this means? Are you proposing to change the operation of $0 in replace?
sorry I mistyped, now corrected.
Nothing will change for 'replace' or 'find' and all other functions doing using regular expressions.
Currently in 'set-ref', and 'set-ref-all', both $0 and $it are set to the found item. For the next version I only mention the usage of $it for these functions and took the usage of $0 for these functions out of the documentation. They will work, but are deprecated and usage of $0 for 'set-ref', and 'set-ref-all' be removed in 10.2 or 10.3, sometime 2010 or 2011. They will be mentioned in the deprecation chapter (2) of the manual.
When doing 'replace' on $0 this can cause a crash and will be flagged with a protection error in the future.
In other words in the future the usage of $0 to $15 will be limited to regular expression searches, all other situations will use the anaphoric $it.
There is one other usage of $0, as a count in 'replace' and 'read-expr', and I haven't decided yet if this good or not good. Perhaps a more descriptive $count should be introduced?
Thank you, now it is a bit shorter
(define (parse-html texto)
(map (fn (x) (map (fn (y)
(find-all "(?si)(<t[dh])(.*?)(</t[dh]>)" y
(replace "(?si)</?t[dh](.*?)>" (copy $it) "")))
(find-all "(?si)(<tr)(.*?)(</tr>)" x)))
(find-all "(?si)(<table)(.*?)(</table>)" texto)))