Parsing Markup Tags. Code Optimization

Started by Tim Johnson, December 26, 2009, 12:07:40 PM

Previous topic - Next topic

Tim Johnson

I've written a newlisp function that processes a string and returns a list of of elements where markup

tags are separated from plaintext.

There is an implementation featured here:

http://newlispfanclub.alh.net/forum/viewtopic.php?f=16&t=3386">http://newlispfanclub.alh.net/forum/vie ... =16&t=3386">http://newlispfanclub.alh.net/forum/viewtopic.php?f=16&t=3386

But it doesn't handle javascript code.

The function works for me. I expect, that once I put it to work, it

will need some tweaking. However, I have thus far used newlisp only intermittenly and I would deeply

appreciate it if some of you newlisp veterans would review this code and suggest optimizations. I have

based this on a function that I wrote for python (rebol has this feature builtin), but I would like suggestions as

to how to make the code more "newlispish". Such suggestions would certainly contribute to my overall

grasp of newlisp.

And also, I want to wish you all the best for this holiday season and for the New Year.

code follows:

;; <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
;;  @syntax [-parse-markup <str>-]
;;  @Description Parse 'str' into alternating plain text and markup elements
;;  @Returns a list. Adjacent tags are seperate elements.
;; >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
(define (parse-markup str)
  (let ((res '())(buf "")(inTag)(inScript)(chr)(nxt)(ndx -1)(endp (- (length str) 1))
      ;; "Private" functions
      (data (fn()(not(empty? buf))))             ;; test for data in temporary buffer
      (add2buf (fn()(set 'buf(append buf chr)))) ;; append char to buffer
      (sflag (fn()(set 'inScript (if (find "javascript" (lower-case buf))))))
      (add2res
        (fn(c)  ;; Add buffer and re-initialize. Set 'inScript flag
          (sflag)              ;; set/unset 'inScript
          (push buf res)       ;; Add buffer to results
          (set 'buf "")        ;; Reinitialize the buffer
          (if c (add2buf))))) ;; End 'let initialization form
    (dostring (c str)     ;; scan string char-by-char
      (inc ndx)           ;; position of char
      (set 'chr (char c)) ;; one-char string
      (if (< ndx endp)    ;; keep track of next char to process
        (set 'nxt (str (+ ndx 1))))
      (cond
        ((= chr "<")      ;; Begin a tag insertion if not javascript
          (cond
            (inScript       ;; Still processing javascript code
              (cond
                ((= nxt "/")  ;; Finishing javascript code block.
                  (set 'inTag true
                       'inScript nil)    ;; set boolean flags
                  (cond
                    ((data)              ;; if buffer has data, push and clear
                      (add2res chr))     ;; Add buffer to results and re-initialize with char
                     (true (add2buf))))  ;; add char to empty buffer
                 (true (add2buf))))      ;; Keep filling 'buf
            (true            ;; Not in javascript code block. Starting new tag
              (set 'inTag true)
              (cond
                ((data)               ;; If 'buf has data.
                  (add2res chr))      ;;    push and re-initialize with char
                (true (add2buf))))))  ;; Buffer is empty, keep filling 'buf
        ((and (= chr ">") (not inScript)) ;; finishing a tag.
          (set 'inTag nil)
          (sflag)       ;; set flags
          (add2buf)     ;; add char to buffer
          (add2res))    ;; Push buffer and reinitialize
        (true           ;; still in script block
          (add2buf))))  ;; just add to 'buf, end dostring/outermost cond
      (if (data)        ;; If data in 'buf, add to result
        (add2res))
      (reverse res)))
Programmer since 1987. Unix environment.

unixtechie

#1
do not understand at all what you are trying to do here.



1. If you need to separate tags from text, then:

(a) "canonize" the text by adding  "n" (newline) after each ">"

(b) consider each line in the substituted text your needed outcome.



That's it.



This is implemented with exactly 2 operators, "read-file" and

"replace", then read the modified buffer string line-by-line with something like "regex" with offsets.



2. If  you do not wish to slurp the file into memory,

(a) use "search"  to get to needed positions and "seek" to keep a list of offsets,

(b) next read your strings jumping between the known offsets.



That's all.



You write in abstractions - can you give primitive examples (input - expected output) of what you are trying to achieve?

cormullion

#2
Tim - nice code, and a pleasant xmas gift! :)



I've yet to look at it closely, but it looks good. I'll see if I can run it over some sample pages sometime.



When I see (reverse..) I look to see if there's been pushing to the front or end of lists or strings - if you push to the end you can sometimes omit the reverse.



unixtechie - I think the problem is that simple parsing of HTML by angle brackets usually breaks when the page contains Javascript code. (Not sure what the standards say, but for practical reasons it doesn't matter...)

Tim Johnson

#3
Hopefully by this time unixtechie groks the issue with javascript....

the code here http://newlispfanclub.alh.net/forum/viewtopic.php?f=16&t=3386">http://newlispfanclub.alh.net/forum/vie ... =16&t=3386">http://newlispfanclub.alh.net/forum/viewtopic.php?f=16&t=3386 is both

shorter and much faster, but doesn't handle the javascript.

And I'm sure that is the solution that  unixtechie refers to.

A more complete  (language agnostic) solution might be something like this pseudo-code:

(define (load-markup s)
  (if (find "<script" s 1) (parse-markup-the-hard-way s)
     (parse-markup-with-regexes s))

And cormullion, I note your comment about using 'reverse on the result set:

That was deliberate. I pondered whether is would be faster to use reverse once

than using (push item list -1) every time.



thanks folks.
Programmer since 1987. Unix environment.

xytroxon

#4
Quote from: "Tim Johnson"And cormullion, I note your comment about using 'reverse on the result set:

That was deliberate. I pondered whether is would be faster to use reverse once

than using (push item list -1) every time.


Lutz has optimized newLISP  for the (push item list -1) form... (So you don't need to use coding tricks ;)



-- xytroxon
\"Many computers can print only capital letters, so we shall not use lowercase letters.\"

-- Let\'s Talk Lisp (c) 1976

TedWalther

#5
There is an O'Reilly book called "Javascript: The Good Parts". It includes a nice parser in the back, only takes 3 pages of very clean code.  Perhaps it would be best to include a javascript parser inside your parser?



How about the built-in xml-parse function?  Would it be easier for you to do that, and then manipulate the s-ml expression tree directly?



Ted
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

Tim Johnson

#6
Quote from: "xytroxon"

Lutz has optimized newLISP  for the (push item list -1) form... (So you don't need to use coding tricks ;)

-- xytroxon

Understood.

Thanks!
Programmer since 1987. Unix environment.

Tim Johnson

#7
Quote from: "TedWalther"There is an O'Reilly book called "Javascript: The Good Parts". It includes a nice parser in the back, only takes 3 pages of very clean code.  Perhaps it would be best to include a javascript parser inside your parser?

Ted

I do a lot with javascript <sigh!>

I have javascript functions that load data into forms, but there are gotchas when the HTML source is rendered

dynamically. An example of where I have had problems is when a page is rendered and then an interior

form is rendered via AJAX, I have had problems finding an event to attach a handler to. Also, javascript

makes me cautious. After all, when one is writing client-side code, one has to consider any number

 (potentially millions) of interpreters running inside of any number (and my customers do hope millions) of browsers.



Whereas, when I write server-side code, I only have to consider one interpreter, given that the

interpreters should behave identically across a small number of different operating systems.
Quote from: "TedWalther"
How about the built-in xml-parse function?  Would it be easier for you to do that, and then manipulate the s-ml expression tree directly?

Ted

Consider the following code: (And look out for wrapped strings)

(set 'res(xml-parse "abcdefghijk<Script type="Javascript">var a=1; if(a > 1)alert("Yes");else alert("No");</script><div>lmnopq</font>rstuvwxyz"))
(println "Using xml-parse: " res)
(set 'res(parse-markup "abcdefghijk<Script type="Javascript">var a=1; if(a > 1)alert("Yes");else alert("No");</script><div>lmnopq</font>rstuvwxyz"))
(println "Using parse-markup: " res)

Result:

Using xml-parse: nil
Using parse-markup: ("abcdefghijk" "<Script type="Javascript">" "var a=1; if(a > 1)alert("Yes");else alert("No");"
 "</script>" "<div>" "lmnopq" "</font>" "rstuvwxyz")

But then, maybe this is a function of my lack of familiarity with xml-parse, because it has a very complex

interface. Preliminary tests that I made indicated that I could use 'xml-parse to process the separated tags, which is

another piece in my objective.

thanks

tim
Programmer since 1987. Unix environment.

unixtechie

#8
still there is much talk "about" the issue, but no specifications.

Tell using very short one-line examples what is input and what is the expected output - otherwise all talk is useless.



Supposing you got this as input:



<fieldset><legend><a href="javascript:;" onmousedown="toggleCombined('18');">
<font class='lnum'><i>(18)</i></font>&nbsp; Markup of code and documentation sections </a>&nbsp;<font class='lnum' size=-1><sub><i>(line 962)</i></sub></font> <font size=-2><i><a href='#tocancor'>toc</a></i></font><a name='18'></a></legend></fieldset>
<p>
<div id="18" style="display:none">  
<p>
    <b> <i> Markup </i> </b><br>
</div>
</fieldset>


What do you expect as "correct"  output for your task?

Please explain what you are expecting.

Tim Johnson

#9
Quote from: "unixtechie"still there is much talk "about" the issue, but no specifications.

Tell using very short one-line examples what is input and what is the expected output - otherwise all talk is useless.

 

Supposing you got this as input:



<fieldset><legend><a href="javascript:;" onmousedown="toggleCombined('18');">
<font class='lnum'><i>(18)</i></font>&nbsp; Markup of code and documentation sections </a>&nbsp;<font class='lnum' size=-1><sub><i>(line 962)</i></sub></font> <font size=-2><i><a href='#tocancor'>toc</a></i></font><a name='18'></a></legend></fieldset>
<p>
<div id="18" style="display:none">  
<p>
    <b> <i> Markup </i> </b><br>
</div>
</fieldset>


What do you expect as "correct"  output for your task?

Please explain what you are expecting.

The output is as follows:
res ==> ("<fieldset>" "<legend>" "<a href="javascript:;" onmousedown="toggleCombined('18');">" " <font class='lnum'><i>(18)" "</i>" "</font>" "&nbsp; Markup of code and documentation sections " "</a>" "&nbsp;" "<font class='lnum' size=-1>" "<sub>" "<i>" "(line 962)" "</i>" "</sub>" "</font>" " " "<font size=-2>" "<i>" "<a href='#tocancor'>" "toc" "</a>" "</i>" "</font>" "<a name='18'>" "</a>" "</legend>" "</fieldset>" "<p>" "<div id="18" style="display:none">" "<p>" " " "<b>" " " "<i>" " Markup " "</i>" " " "</b>" "<br>" " " "</div>" " " "</fieldset>" "'")

And is correct to my specs.

Here is a shorter input example:

(set 'res(parse-markup "<form method ="POST" action="http://localhost/cgi-bin/render.lsp">Password:&nbsp;<input type="password" name="pwd"></form>"))

And here is the result. And is what I want:

res ==> ("<form method ="POST" action="http://localhost/cgi-bin/render.lsp">" "Password:&nbsp;" "<input type="password" name="pwd">" "</form>")

My original intent was to solicit comments on the correctness, efficiency and appropriate style of my code.
Programmer since 1987. Unix environment.

cormullion

#10
Presumably you can use (push chr buf -1) rather than (append ... Haven't checked but it might be OK.



Also, I think dostring has a built-in indexing - $idx - this might be usable and save you running your own counter.



That cond structure is deep - but why not!? :)

Tim Johnson

#11
Quote from: "cormullion"Presumably you can use (push chr buf -1) rather than (append ... Haven't checked but it might be OK.

Cool. Would save some 'set forms
Quote from: "cormullion"
Also, I think dostring has a built-in indexing - $idx - this might be usable and save you running your own counter.

Of course!
Quote from: "cormullion"
That cond structure is deep - but why not!? :)

Would there be another approach that you would recommend? (other than going so deep into 'cond)

Thanks.

-----------

I will implement your suggestions and look forward to your further comments.

Cheers

tim
Programmer since 1987. Unix environment.

Tim Johnson

#12
Maybe I should elaborate further on my originating intent for this posting. I see that cormullion "gets it", but

I fear that other may not:



First some history: As a web programmer, I have written a lot of modules in both python and rebol.

I need that same functionality from newlisp if I am going to step out with some serious web programming

assets in newlisp.



One of the modules that I have written in both rebol and python inputs an html document or a portion of an html document

and outputs a writable data structure. This module has many applications for me, my company and my clientele.



The starting point for such a module is to reduce this input to a list in which plain text is

separated from tags and individual tags are separated from each other. Anyone reading this should be

able to see the examples that I illustrated for unixtechie.



Rebol provides a native (part of the binary) function, called load/markup that does exactly what is described

in the previous paragraph. Python does not. Therefore I had to write my own function for that purpose.



In writing my first draft of this functionality in newlisp, I used my python code as a "prototype". In fact,

those of you who have been around this business for some years might remember when python was introduced

as a "prototyping" tool. However, the result is "pythonish" rather than "newlispish", I.E. is not idiomatic to

newlisp.



When cormullion introduces the suggestion of using $idx instead of a counter, he is pointing me in the

"newlispish" direction. Then, in turn, I can apply his suggestions to other newlisp code that I might write

and this becomes a valuable tutorial for me and hopefully is helpful to others.

How am I doing so far? Do you all understand what I am after? :)

thanks

tim
Programmer since 1987. Unix environment.