Print Page - de-tag

Title: de-tag
Post by: newdep on March 07, 2004, 04:02:20 AM

Hello All,

Anyone tried to build a html de-tagger from string/buffer?

Norman.

Title:
Post by: Lutz on March 07, 2004, 08:12:13 AM

Try this:

Code Select Expand

(define (strip-html buff)
    (set 'page '())
    (dolist (lne (parse buff "rn|n" 0))
        (push (replace "<.+?>" lne " " 0) page))
    (join (reverse page) "n"))

(strip-html (get-url "http://newlisp.org"))

On smaller buffers (replace "<.+?>" buff " " 0) may be just enough code, but on bigger numbers the line-by-line solution will be much faster.

You may also want to try the 'greedy' option 512 instead of 0 for better/faster results.

Lutz

Title:
Post by: newdep on March 07, 2004, 08:27:10 AM

Thank you lutz, you also answered a small question on a previous post

with the strip-html example.. I always forget to move towards list from string..

I should think more into lists using newlisp :-)

Norman.

Title:
Post by: Lutz on March 07, 2004, 09:07:18 AM

I did some benchmarks (strip-html buff) versus (replace "<.+?>" buff " " 512) and it turns out that the simpler solution without splitting into lines is also the fastest on 'newlisp_manual.html' a 400Kbyte file.

The biggest speedup is using the greedy option 512, cutting the time to a quarter:

(strip-html buff) -> 12 seconds

(replace "<.+?>" buff " " 0) 13 seconds

(replace "<.+?>" buff " " 512) 3 seconds !!!

Lutz

ps: still don't forget the push/join method, which sometimes is superior (see base64 example). Also: 'greedy' gives different output!

Title:
Post by: William James on June 16, 2006, 08:49:30 PM

Code Select Expand
> (regex {<.*>} "Be <b>very</b> careful.")
("<b>very</b>" 3 11)
> (regex {<.*?>} "Be <b>very</b> careful.")
("<b>" 3 3)
> (regex {<.*>} "Be <b>very</b> careful." 512)
("<b>" 3 3)
> (regex {<.*?>} "Be <b>very</b> careful." 512)
("<b>very</b>" 3 11)

In .*?, ? makes the * non-greedy. Since option 512 inverts greediness, (regex {<.*?>} "..." 512) is equivalent to (regex {<.*>} "..."). Both of these may strip out too much from the html string. However, note that the dot, as in Ruby and Perl, won't match a newline unless you select that option.

Code Select Expand

> (setq str "<foonbar>")
"<foonbar>"
> (replace {<.*>} str "" 0)
"<foonbar>"
> str
"<foonbar>"
> (replace {<.*>} str "" 4)
""
> str
""

newLISP Fan Club

Forum => Anything else we might add? => Topic started by: newdep on March 07, 2004, 04:02:20 AM