Separating markup from text

Started by Tim Johnson, November 20, 2009, 07:51:00 AM

Previous topic - Next topic

Tim Johnson

Rebol has a function/refinement called load/markup that parses a string, url or file

into alternating text and tags. I would like to be able to do that in newlisp.

I've tried xml-parse with no luck, although xml-parse does a wonderful job of parsing

individual markup tags into a data structure.



Did I miss something as to xml-parse or is there another way?

thanks

tim
Programmer since 1987. Unix environment.

cormullion

#1
I have this in one of my files somewhere - it might be markdown.lsp ...


(define (tokenize-html xhtml)
; return list of tag/text portions of xhtml text
  (letn (
       (tag-match [text]((?s:<!(-- .*? -- s*)+>)|
(?s:<?.*??>)|
(?:<[a-z/!$](?:[^<>]|
(?:<[a-z/!$](?:[^<>]|
(?:<[a-z/!$](?:[^<>]|
(?:<[a-z/!$](?:[^<>]|
(?:<[a-z/!$](?:[^<>]|
(?:<[a-z/!$](?:[^<>])*>))*>))*>))*>))*>))*>))[/text]) ; yeah, well...
      (str xhtml)
      (len (length str))
      (pos 0)
      (tokens '())
      )
 (while (set 'tag-start (find tag-match str 8))
    (if (< pos tag-start)
       (push (list 'text (slice str pos (- tag-start pos))) tokens -1))
    (push (list 'tag $0) tokens -1)
    (set 'str (slice str (+ tag-start (length $0))))
    (set 'pos 0))
 ; leftovers
  (if (< pos len)
     (push (list 'text (slice str pos (- len pos))) tokens -1))
  tokens)
)

(set 'tokens (tokenize-html (get-url {http://newlispfanclub.alh.net/forum/viewtopic.php?f=16&t=3386})))


I have no idea whether it works or not, but I know it struggles with some stuff such as Javascript embedded in Script elements.

Tim Johnson

#2
I just did a test and it seems to work fine. I included a simple javascript function between a<script></script> tag.

This is great. I recommend this as a native. Between that and xml-parse there would be a powerful tool.



BTW: This is the beginning of a project for me: And that is to decompose html text into a data structure that allows

modification in a pseudo-dom fashion, like loading records into forms, setting form actions etc. I've such

functionality with rebol and python and need the same for newlisp.

Thanks very much cormullion, you've saved me a bunch of time.



cheers

tim (a boob when it comes to regex)
Programmer since 1987. Unix environment.

cormullion

#3
Cool - it's a start!



I think it fails on this page because there's a greater than sign in the Javascript code and it starts with lessthan-bang-bracket-CDATA. It's possible that the regexes could be tweaked but I wonder whether that would be the start of a never-ending job. Your big problem might be not with this kind of valid HTML but with invalid HTML...



- I hate regexes more than you! :)

Tim Johnson

#4
One could use a brute-force, iterate-on-every-character approach that would overcome this problem by consolidating

a tag, identifying its type and ignore certain '>s' and '<s'. It would be a performance hit, but the data structure could

be stored by 'save and only rebuilt on an mtime check when the source document was changed. Because I do work for

those who like to push the limits and play with ideas, no doubt I'm going to end up using the "brute force" method, but

you've given me a starting point.

thanks again

tim
Programmer since 1987. Unix environment.