I wrote one a few days ago: http://en.feautec.pp.ru/store/libs/tags.lsp
It is able to parse structured tagged text like an html, is aware of unclosed tags and uses regexp for tags definition.
example:
for html:
<html><body>
test
<table align=center><tr><td>test1</td><tr><td>test2<td>test3</table>
</body></html>
here closed and unclosed tags are present
with syntax rules:
; tag format: (tag-sym tag-pattern tag-open|close|self (closes-tag closes-tag))
; tag-open - open a sublist and lead it
; tag-close - close a sublists and don't leave himself
; tag-self - close a sublists and leave himself
(set 'html-tags '(
(table "<table(| [^>]*)>" tag-open ())
(table/ "</table>" tag-close (table th tr td))
(tr "<tr(| [^>]*)>" tag-open (tr th td))
(tr/ "</tr>" tag-close (tr th td))
(th "<th(| [^>]*)>" tag-open (th td))
(th/ "</th>" tag-close (th td))
(td "<td(| [^>]*)>" tag-open (th td))
(td/ "</td>" tag-close (th td))
(br "<br>" tag-self ())
(hr "<hr(| [^>]*)>" tag-self ())
(p "<p>" tag-self ())))
You can get following:
> (set 'htm (TAGS:parse-tags TAGS:html-tags (read-file "example.html")))
("<html><body>ntestn" TAGS:table TAGS:tr TAGS:td "test1" TAGS:td/ TAGS:tr
TAGS:td "test2" TAGS:td "test3" TAGS:table/ "n</body></html>n")
Text is parsed and defined tags are replaced with symbols. One-dimension list.
> (TAGS:structure-tags TAGS:html-tags htm)
(TAGS:data "<html><body>ntestn"
(TAGS:table (TAGS:tr (TAGS:td "test1"))
(TAGS:tr (TAGS:td "test2") (TAGS:td "test3")))
"n</body></html>n")
Preparsed list is converted to nested list according to defined tagging rules.
With such nested list, parsing of html-tables becames relatively useful...