how to xml-parse HTML pages?

Started by ino-news, March 27, 2007, 05:10:09 AM

Previous topic - Next topic

ino-news

i would like to xml-parse the type of HTML pages one finds "in the

wild".  some tags are used in HTML which lead to non well-formed XML,

most notably <p> or
.  i could replace them with something like

a XML NOOP or a simple space before running xml-parse.



is there a recipe you could recommend to turn "normal" HTML pages into

something xml-parseble?  --clemens
regards, clemens

Lutz

#1
This utility will try to fix bad HTML automatically, but if it can't it will still emit errors. You could use 'exec' to capture the returned content/error messages in a list.



http://www.w3.org/People/Raggett/tidy/">http://www.w3.org/People/Raggett/tidy/



From own experience with parsing man-made HTML I would say it is better to use regular expressions to isolate certain content then trying to use XML parsing, which expects well formed HTML. It all depends how good 'tidy' works on your pages.



Lutz

Dmi

#2
Try http://en.feautec.pp.ru/store/libs/tags.lsp">//http://en.feautec.pp.ru/store/libs/tags.lsp

This isn't an xml instrument, but was mainly written for parsing a complex html.

There is no separate documentation, but an example at the and of module.

Pay separate attention to "parse-tags" and "structure-tags" functions.
WBR, Dmi

ino-news

#3
Quote from: "Dmi"
Try http://en.feautec.pp.ru/store/libs/tags.lsp">//http://en.feautec.pp.ru/store/libs/tags.lsp



This isn't an xml instrument, but was mainly written for parsing a

complex html.  There is no separate documentation, but an example

at the and of module.  Pay separate attention to "parse-tags" and

"structure-tags" functions.


that's what i had in mind.  lutz:  using "tidy" is not a bad idea, but

i was thinking more along the lines of Dmi's solution.  he uses regular

expressions in his recognizers, all wellt kept together in a table at

the beginning of his code.



Dmi:  your code returns a flat "tag-soup".  it recognizes the (few) tags

it has recognizers for and leaves the rest as strings.  not bad, but not

what i need.  actually, i want code to fill in a form.  for this,

i could ignore (leave as strings to be discarded after a cursory check)

everything _except_



"<input type=(checkbox|hidden|submit) (name|value) ..." and

"<textarea name=...>"



your central piece of code is:

; tags: ((tag open? (close-tag close-tag ...) (tag ...))
(define (parse-tags tags str)
  (let (res (list str))
    (dolist (t tags)
      (set 'res (map (fn (x)
       (if (string? x)
         (letn (lst (parse x (t 1) 0)
          len (length lst))
           (if (< len 1) (set 'len 1))
           (dotimes (l (lesser len))
             (push (t 0) lst (greater (* 2 l))))
           (filter (fn (x) (or (not (string? x))
             (not (empty? x))))  lst))
         x))
         (flat res))))
    (flat res)))


what happens if i leave out the "(flat res)" in "(parse-tags)"? do you

say "(list str)" only to have a list for "map" to work on?



also, my current version is an awk-script.  is does it's job well, but

frequently name, value pairs in <input...> tags span multiple lines, so

i have to set flags stating "collect-in-progress", because awk doesn't

handle regular expressions spanning multiple lines.  now i could do away

with all that if i can get the proper regular expression into the

recognizer item.  example:



<input type="checkbox" name="send3" checked>

<input type="hidden" name="type3" value="www">

<input type="hidden" name="master3" value="abuse-gd@china-netcom.com">

<input type="hidden" name="info3"

 value="http%3A%2F%2Fdestine.descabc.com%2F%3Fcharitable%2F">To: <a href="mailto:abuse-gd@china-+netcom.com">abuse-gd@china-netcom.com</a>



does the pcre-lib built into newlisp allow pattern matching across

lines?  --clemens
regards, clemens

ino-news

#4
Quote from: "ino-news"
does the pcre-lib built into newlisp allow pattern matching across

lines?


also, in addition to just recognizing the tags, i see no immediate

possibility to extract their attributes, which would eliminate another

step for my program.  Dmi, do you see the possibility to use

parenthesized groups in the regular expressions of the tags table to

have this info ready in the output?  --clemens
regards, clemens

Lutz

#5
Quotedoes the pcre-lib built into newlisp allow pattern matching across lines?


yes, it does so by default, and can be limited to one line using a special options flag described here:



http://newlisp.org/downloads/newlisp_manual.html#regex">http://newlisp.org/downloads/newlisp_manual.html#regex



but used also with 'find','replace', 'parse' and all other functions using regular expressions. By default it always does multiline, but be careful with the "." dot in regex patterns which does not match newline characters except when using flag-bit number 4



Lutz

Dmi

#6
Clemens,



I will respond as I remember the questions :-)



"parse-tags" is really returns a tag-soup. It's its goal - to parse plain html string into the parts.

Next, you must use "structure-tags" on the result, which makes an encapsulated list structure according to rules, described in a html-tags variable. So you'll get table with table rows and table data into them or so.



If you want more tags to be recognized, extend html-tags lst by your own rules and enjoy.



Btw, you may also define similar rules for prety all text reports, not only for html-like ones.



About attributes... When I wrote this, the tag attributes where not a subject of my interest. I'll think about this tomorrow.
WBR, Dmi

Dmi

#7
Quotewhat happens if i leave out the "(flat res)" in "(parse-tags)"? do you

say "(list str)" only to have a list for "map" to work on?

(list str) is a starting condition for the 'res', which is iteratively rewritten by a sublists of 'str'.

Using 'flat' is mandatory for algo. Simply insert (println res) just before (flat res) and U'll see the the difference.



In fact the current algo isn't good to differ where the current closing angle brace ">" closes "<input>", "<textarea>" or another tag.



U have several ways here:

1. Preparse an html contents to transform all "<input attrs> into <input> attrs </input>" and so on. (replace will help U).

and use rules like
(input "<input> " tag-open ())
(input/ "</input> " tag-close (input))

etc.



2. Preparse an html contents to transform all "<[^ ]* [^ ]*>" into

"<1> <attrs> 2 </attrs>"

and use one universal rule:
(attrs "<attrs> " tag-open ())
(attrs/ "</attrs> " tag-close (input))

And U'll have _all_ tag attributes wrapped as (attrs attribute-string)



Btw, about awk: U may redefine FS to something unusual and regexps will works through "n".

Also U may remember the rest of "$0" of previous iteration and concatenate it in the current one...
WBR, Dmi

ino-news

#8
Quote from: "Dmi"
In fact the current algo isn't good to differ where the current closing

angle brace ">" closes "<input>", "<textarea>" or another tag.


i thought so. the problem with your current approach lies in the puny

little ">" which closes tags like "<input ...>", because a single ">"

closes any tag anywhere. i was thinking about making your code more

selective.  for every tag recognized there is a list of tags it closes,

and i could add specific tags after "tag-open" to tell it which tags to

look for that will definitely going to close it.  anyway, your code

defines a tag stack, which can be useful to attack the structure.


Quote from: "Dmi"
U have several ways here:



1. Preparse an html contents to transform all "<input attrs> into

<input> attrs </input>" and so on. (replace will help U).



and use rules like


(input "<input> " tag-open ())
(input/ "</input> " tag-close (input))


etc.


that's a good idea!  i had already thought of preprocessing the input to

make it more legible.


Quote from: "Dmi"
2. Preparse an html contents to transform all "<[^ ]* [^ ]*>" into "<1>

<attrs> 2 </attrs>"



and use one universal rule:


(attrs "<attrs> " tag-open ())
(attrs/ "</attrs> " tag-close (input))


And U'll have _all_ tag attributes wrapped as (attrs attribute-string)


another good one, which i could combine with the previous suggestion.


Quote from: "Dmi"
Btw, about awk: U may redefine FS to something unusual and regexps will

works through "n".  Also U may remember the rest of "$0" of previous

iteration and concatenate it in the current one...


this one i don't understand, you'd have to explain from "remeber the

rest of $0 ..." onwards.



currently, i'm playing with the following idea:



since there are only a few tags needed for the task, i can preprocess

the input to delete all the tags that don't have to be interpreted. in

that same step i'd make the input uniform to "(parse...)" by inserting

spaces or other markers. then, only "<input...>" and "<textarea>" tags

will be left. they vary in the number and type of tags used. then

i'd run a table like the one from your code against them, but use

"(match...)" to find and extract the information i need.



since i don't control the markup on the page and it can change anytime,

i need to be flexible, but only on a small number of items.



there are a number of leads i looked up in the 'net.  once i'm finished

with the parsing, i need to cook up the "multipart/form-data"

content-type for the PUT method.



thanks for your suggestions,  --clemens
regards, clemens

ino-news

#9
Quote from: "ino-news"
Quote from: "Dmi"
Btw, about awk: U may redefine FS to something unusual and regexps will

works through "n".  Also U may remember the rest of "$0" of previous

iteration and concatenate it in the current one...


this one i don't understand, you'd have to explain from "remeber the

rest of $0 ..." onwards.


sorry, i had overlooked that you were talking about awk.  now it makes

sense to me.



--clemens
regards, clemens

ino-news

#10
Quote from: "ino-news"
there are a number of leads i looked up in the 'net.  once i'm finished

with the parsing, i need to cook up the "multipart/form-data"

content-type for the PUT method.


(i meant "POST" there.)



there are two promising lisp html-parsers out there:



http://www.neilvandyke.org/htmlprag/">//http://www.neilvandyke.org/htmlprag/

http://www.stklos.org/Doc/extensions/htmlprag/htmlprag.html">//http://www.stklos.org/Doc/extensions/htmlprag/htmlprag.html

(which is a stklos port of the former). htmlprag is a bunch of

functions: a tokenizer generates a function which delivers the next tag

on successive invocations, and there is a structure parser and a HTML

emitter. the structure comes out as:



(html->shtml
 "<html><head><title></title><title>whatever</title></head><body>
<a href="url">link</a><p align=center><ul compact style="aa">
<p>BLah<!-- comment <comment> --> <i> italic <b> bold <tt> ened</i>
still < bold </b></body><P> But not done yet...")
=>
(*TOP* (html (head (title) (title "whatever"))
             (body "n"
                   (a (@ (href "url")) "link")
                   (p (@ (align "center"))
                      (ul (@ (compact) (style "aa")) "n"))
                   (p "BLah"
                      (*COMMENT* " comment <comment> ")
                      " "
                      (i " italic " (b " bold " (tt " ened")))
                      "n"
                      "still < bold "))
             (p " But not done yet...")))


then there's a file named "html4each.scm" in the famous slib:

http://www.google.com/search?q=html4each">//http://www.google.com/search?q=html4each.  if your system

allows installing the slib, look at this file.  it scans HTML calling

a user-supplied procedure with a tag-string as its sole argument.



since the guile scheme interpreter is quite user-friendly and the slib

was present as well, i did my program in guile instead of newlisp.  it

took about three days until it was alpha ready, just like the newlisp

program before.  seems to be standard ...  --clemens
regards, clemens