html parser

Started by Dmi, April 29, 2006, 02:34:21 PM

Previous topic - Next topic

Dmi

I wrote one a few days ago: http://en.feautec.pp.ru/store/libs/tags.lsp">http://en.feautec.pp.ru/store/libs/tags.lsp



It is able to parse structured tagged text like an html, is aware of unclosed tags and uses regexp for tags definition.



example:

for html:
<html><body>
test
<table align=center><tr><td>test1</td><tr><td>test2<td>test3</table>
</body></html>

here closed and unclosed tags are present



with syntax rules:
; tag format: (tag-sym tag-pattern tag-open|close|self (closes-tag closes-tag))
; tag-open - open a sublist and lead it
; tag-close - close a sublists and don't leave himself
; tag-self - close a sublists and leave himself
(set 'html-tags '(
      (table "<table(| [^>]*)>" tag-open ())
                  (table/ "</table>" tag-close (table th tr td))
                  (tr "<tr(| [^>]*)>" tag-open (tr th td))
                  (tr/ "</tr>" tag-close (tr th td))
                  (th "<th(| [^>]*)>" tag-open (th td))
                  (th/ "</th>" tag-close (th td))
                  (td "<td(| [^>]*)>" tag-open (th td))
                  (td/ "</td>" tag-close (th td))
                  (br "<br>" tag-self ())
                  (hr "<hr(| [^>]*)>" tag-self ())
                  (p "<p>" tag-self ())))


You can get following:
> (set 'htm (TAGS:parse-tags TAGS:html-tags (read-file "example.html")))

("<html><body>ntestn" TAGS:table TAGS:tr TAGS:td "test1" TAGS:td/ TAGS:tr
  TAGS:td "test2" TAGS:td "test3" TAGS:table/ "n</body></html>n")

Text is parsed and defined tags are replaced with symbols. One-dimension list.


> (TAGS:structure-tags TAGS:html-tags htm)

(TAGS:data "<html><body>ntestn"
  (TAGS:table (TAGS:tr (TAGS:td "test1"))
    (TAGS:tr (TAGS:td "test2") (TAGS:td "test3")))
  "n</body></html>n")

Preparsed list is converted to nested list according to defined tagging rules.

With such nested list, parsing of html-tables becames relatively useful...
WBR, Dmi