A Proposal for a new 'expr2xml' function

Started by rickyboy, April 27, 2006, 03:12:11 AM

Previous topic - Next topic

rickyboy

Recall the function 'expr2xml' from http://newlisp.org/index.cgi?page=S-expressions_to_XML">//http://newlisp.org/index.cgi?page=S-expressions_to_XML:
;; translate s-expr to XML
;;
(define (expr2xml expr level)
 (cond
   ((or (atom? expr) (quote? expr))
       (print (dup "  " level))
       (println expr))
   ((list? (first expr))
       (expr2xml (first expr) (+ level 1))
       (dolist (s (rest expr)) (expr2xml s (+ level 1))))
   ((symbol? (first expr))
       (print (dup "  " level))
       (println "<" (first expr) ">")
       (dolist (s (rest expr)) (expr2xml s (+ level 1)))
       (print (dup "  " level))
       (println "</" (first expr) ">"))
   (true
      (print (dup "  " level)
      (println "<error>" (string expr) "<error>")))
 ))

It's a beautiful exposition of how you can do some powerful programming in (new)Lisp.  But we need an "ugly" version :-) that handles element attributes and childless elements.  I propose the following.
(context 'SXML)

(define (element? maybe-element)
  (and (list? maybe-element)
       (> (length maybe-element) 0)
       (symbol? (maybe-element 0))))

(define (has-attrs? maybe-element)
  (and (SXML:element? maybe-element)
       (> (length maybe-element) 1)
       (list? (maybe-element 1))
       (= '@ (maybe-element 1 0))))

(define (get-attr-string maybe-element)
  (if (SXML:has-attrs? maybe-element)
      (let ((attr-alist (1 (maybe-element 1))))
        (join (map (lambda (attr-pair)
                     (string (attr-pair 0) "="
                             """ (attr-pair 1) """))
                   attr-alist)
              " "))
    ""))

(define (return-sans-attrs maybe-element)
  (if (SXML:has-attrs? maybe-element)
      (pop maybe-element 1))
  maybe-element)

(define (childless? maybe-element)
  (= 1 (length (SXML:return-sans-attrs maybe-element))))

(define (get-children element)
  (if (SXML:has-attrs? element) (2 element) (1 element)))

(define (name-in-MAIN symbul)
  (if (starts-with (string symbul) "MAIN:")
      (name symbul)
    (string symbul)))

;; The following function is modified from 'expr2xml' in:
;;   http://newlisp.org/index.cgi?page=S-expressions_to_XML
(define (print-xml sxml level)
  (let ((level (or level 0)))
    (cond ((or (atom? sxml) (quote? sxml))
           (print (dup "  " level))
           (println sxml))
          ((list? (first sxml))
           (dolist (s sxml) (print-xml s (+ level 1))))
          ((symbol? (first sxml))
           (let ((attr-string (SXML:get-attr-string sxml))
                 (tag-name (SXML:name-in-MAIN (sxml 0))))
             (print (dup "  " level))
             (println "<" tag-name
                      (if (= attr-string "") "" " ")
                      attr-string
                      (if (SXML:childless? sxml) "/" "")
                      ">")
             (unless (SXML:childless? sxml)
               (let ((kids (SXML:get-children sxml)))
                 (dolist (k kids) (print-xml k (+ level 1)))
                 (print (dup "  " level))
                 (println "</" tag-name ">")))))
          (true
           (print (dup "  " level))
           (println "<error>" (string sxml) "<error>")))))

(context MAIN)

And the usage is something like:
(SXML:print-xml sxml)
where 'sxml' is some SXML expression, e.g. '(html (body (p "Hello, World!")))'.



Please let me know what you think.  I sometimes get "wrapped around the axle" during coding time, so I may have done things in a less than stellar way.
(λx. x x) (λx. x x)

Lutz

#1
No need to prefix functions, which are defined inside the SXML context, when using them inside. But it doesn't do any harm either.



Inside SXML you just can say: element?, has-attrs? etc., without the SXML: prefix. You only would need to prefix if you define the default function SXML:SXML or when overwriting built-in functions, i.e. SXML:println for overwriting println etc.



Else I cannot find anything wrong with it ;) . It is a natural extension of the 'symbol?' part of the original function.



Lutz

noah

#2
Hi, Rickyboy.



Your additions to Lutz'x expr2xml function can make it possible for me to roundtrip xml into and out of newLisp data structures. With that, I can take advantage of the scripting capabilities of newLisp, while still being able to filter xml with a feature-complete XSLT processor (saxon) accessible in my shell.



There may be a bug in the output of your XSLT test case into your print-xml function. I got the following when running your script on my XP machine from my copy of win32 8.8.0 newLisp:




<html xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xsl:version="1.0">
    <body>
      <table>
        <xsl:for-each select="/persons/person">
          <tr>
            <td>
              <xsl:value-of select="first-name"/>
            </td>
            <td>
              <xsl:value-of select="last-name"/>
            </td>
            <td>
              <xsl:value-of select="phone"/>
            </td>
          </tr>
        </xsl:for-each>
      </table>
    </body>
  </html>
">"


The extra ending angle-bracket shows up with more general HTML code, as well. Given the following s-expression:


(setq s-expression '(html
  (body
    (table
      (@ (bgcolor "black"))
        (tr
          (td "hi, this is a simple test case" (hr) "With some more content." (hr (@ (width "1")))
)))))
      )


(context 'SXML)

;;rickyboy's print-xml function definition with subfunction definitions

(context 'MAIN)

(SXML:print-xml s-expression 0)


It creates output:


<html>
  <body>
    <table MAIN:bgcolor="black">
      <tr>
        <td>
          hi, this is a simple test case
          <hr/>
          With some more content.
          <hr MAIN:width="1"/>
        </td>
      </tr>
    </table>
  </body>
</html>
">"


Here, both the ending bracket and the 'MAIN' prefixes are probably bugs.



In addition to handling basic xml, a complete round-tripping needs to handle:



* processing instructions (for example,
Quote<?xml-stylesheet href="mystyle.css" type="text/css"?>
appearing at the root level of the xml document)



* the xml declaration (For example,
Quote<?xml version="1.0" encoding="UTF-8"?>
appearing (always) at the top of the document)



* xml content according to the encoding declared in the xml header (for example, the UTF-8 declaration in the example above. Other content encodings include ISO-8859-1, windows-1252, et cetera.)



* namespace declarations (for example, the prefix associated with an imaginary namespace declared for an xmlized version of newlisp code: xmlns:lsp="http://www.newlisp.org/2006/CODE/">http://www.newlisp.org/2006/CODE/")



* mixed namespace associations with declarations (For example,
Quote<html xmlns:lsp="http://www.newlisp.org/2006/code/">http://www.newlisp.org/2006/code/" xmlns="http://www.w3.org/1999/xhtml/%22%3E%3Chead%3E%3Ctitle%3EnewLisp">http://www.w3.org/1999/xhtml/"><head><title>newLisp powering CGI</title></head><body>

<h1><lsp:println>Hello, World</lsp:println></h1><p>The above is an example of embedded newLisp.</p></body></html>
.)



Thank you for your efforts! You've made it easier for someone to write s-expressions rather than xml. I absolutely agree that your code is a useful extension to Lutz's sexpr2xml and let's hope that it makes its way into the newlisp code snippets.



-Noah

noah

#3
Hi, Rickyboy.



Your additions to Lutz'x expr2xml function can make it possible for me to roundtrip xml into and out of newLisp data structures. With that, I can take advantage of the scripting capabilities of newLisp, while still being able to filter xml with a feature-complete XSLT processor (saxon) accessible in my shell.



There may be a bug in the output of your XSLT test case into your print-xml function. I got the following when running your script on my XP machine from my copy of win32 8.8.0 newLisp:




<html xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xsl:version="1.0">
    <body>
      <table>
        <xsl:for-each select="/persons/person">
          <tr>
            <td>
              <xsl:value-of select="first-name"/>
            </td>
            <td>
              <xsl:value-of select="last-name"/>
            </td>
            <td>
              <xsl:value-of select="phone"/>
            </td>
          </tr>
        </xsl:for-each>
      </table>
    </body>
  </html>
">"


The extra ending angle-bracket shows up with more general HTML code, as well. Given the following s-expression:


(setq s-expression '(html
  (body
    (table
      (@ (bgcolor "black"))
        (tr
          (td "hi, this is a simple test case" (hr) "With some more content." (hr (@ (width "1")))
)))))
      )


(context 'SXML)

;;rickyboy's print-xml function definition with subfunction definitions

(context 'MAIN)

(SXML:print-xml s-expression 0)


It creates output:


<html>
  <body>
    <table MAIN:bgcolor="black">
      <tr>
        <td>
          hi, this is a simple test case
          <hr/>
          With some more content.
          <hr MAIN:width="1"/>
        </td>
      </tr>
    </table>
  </body>
</html>
">"


Here, both the ending bracket and the 'MAIN' prefixes are probably bugs.



In addition to handling basic xml, a complete round-tripping needs to handle:



* processing instructions (for example, <?xml-stylesheet href="mystyle.css" type="text/css"?> appearing at the root level of the xml document)



* the xml declaration (For example, <?xml version="1.0" encoding="UTF-8"?> appearing (always) at the top of the document)



* xml content according to the encoding declared in the xml header (for example, the UTF-8 declaration in the example above. Other content encodings include ISO-8859-1, windows-1252, et cetera.)



* namespace declarations (for example, the prefix associated with an imaginary namespace declared for an xmlized version of newlisp code: xmlns:lsp="http://www.newlisp.org/2006/template/">http://www.newlisp.org/2006/template/")



* mixed namespace associations with declarations (For example,
<?xml version="1.0" encoding="windows-1252"?>
<?xml-stylesheet href="mystyle.css" type="text/css"?>
<!--xml shows its guns-->
<html
xmlns:lsp="http://www.newlisp.org/2006/template/"
xmlns="http://www.w3.org/1999/xhtml/"
xml:base="http://newlisp.org/"
xmlns:xlink="http://www.w3.org/1999/xlink">
<head>
<title>newLisp powering CGI</title>
</head>
<body>
<h1><lsp:print>Hello, World</lsp:print></h1>
<p>The above is an example of <a xlink:type="simple" xlink:href="xml-roundtripping.xml">embedded newLisp</a>.</p>
</body>
</html>
.)



You can imagine s-expressions embedded in xmlized newlisp expressions, all output as either an s-expression or an xml document.  What a mess! Right now, though, I really wish I could do it, along with roundtripping xml into and out of newlisp.



-Noah

rickyboy

#4
Hello Noah!



Thank you so much for the feedback.  Please feel free to correct or expand that code.  I think it would be wonderful to write it together as a community (hey I schleped this code from somebody else.  Lutz?).  I enjoy very much this kind of exchange.  Besides, I am no great shakes at writing production code.  ;-)



I believe that I can, in some feeble fashion, explain the "stray '>'."  The ">" at the end of the interpreter output is that actual value returned by the 'print-xml' function call.  That is,
(SXML:print-xml s-expression 0)
evaluates to
">"
only.  The lines which occur above it in the output are the side-effect of displaying to the screen (stdout) via all the 'print(ln)' calls in the function.  (The value of the function being ">" is due to the value of the last expression the function evaluates before it drops out -- I think that's the '(println "</" tag-name ">")' form and the value of that is the last string argument to it, namely the ">".)



But anybody could change 'print-xml' to output the XML as a string that the function returns (instead of ">") and never side-effect 'print' anything.  (If so, I'd suggest renaming the function to something like 'sxml->xml', since it would no longer do the 'print'ing suggested by the name 'print-xml'.)  I like this solution.   Then it is easier/cleaner to re-direct this string output to a file.



The MAIN prefixes are annoying, aren't they.  Yes, this is a bug in my code, and it looks like we need to beef up the capability of the 'get-attr-string' function to fix this.



The other things you mentioned are all worthy of being addressed.  I am no expert (as I said before) with XML, but I'd be willing to give it a try.  What motivated me to write this bit of code in the first place was http://www.alh.net/newlisp/phpbb/viewtopic.php?t=1067#6162">Lutz's post wherein he explains how clean it is to use certain SXML expressions as actual function calls to do the XSLT-type processing.  I thought that was cool.
(λx. x x) (λx. x x)

rickyboy

#5
Quote from: "rickyboy"The MAIN prefixes are annoying, aren't they.  Yes, this is a bug in my code, and it looks like we need to beef up the capability of the 'get-attr-string' function to fix this.


Just as I thought -- 'get-attr-string' is the culprit:
> (s-expression '(1 1))
(table (@ (bgcolor "black")) (tr (td "hi, this is a simple test case"
   (hr) "With some more content."
   (hr (@ (width "1"))))))
> (SXML:get-attr-string (s-expression '(1 1)))
"MAIN:bgcolor="black""


This new version of 'get-attr-string' has a call to 'name-in-MAIN', judiciously placed, which fixes this problem:
(define (get-attr-string maybe-element)
  (if (SXML:has-attrs? maybe-element)
      (let ((attr-alist (1 (maybe-element 1))))
        (join (map (lambda (attr-pair)
                     (string (name-in-MAIN (attr-pair 0))
                             "="" (attr-pair 1) """))
                   attr-alist)
              " "))
    ""))

> (SXML:get-attr-string (s-expression '(1 1)))
"bgcolor="black""
> (SXML:get-attr-string '(table (@ (my:attr "my-value"))))
"my:attr="my-value""


Now 'print-xml' yields:
> (SXML:print-xml s-expression)
<html>
  <body>
    <table bgcolor="black">
      <tr>
        <td>
          hi, this is a simple test case
          <hr/>
          With some more content.
          <hr width="1"/>
        </td>
      </tr>
    </table>
  </body>
</html>
">"
(λx. x x) (λx. x x)

rickyboy

#6
Quote from: "rickyboy"But anybody could change 'print-xml' to output the XML as a string that the function returns (instead of ">") and never side-effect 'print' anything.  (If so, I'd suggest renaming the function to something like 'sxml->xml', since it would no longer do the 'print'ing suggested by the name 'print-xml'.)


Here is the string-only, no-side-effecting analogy to 'print-xml' -- as promised, I call it 'sxml->xml':
(define (sxml->xml sxml level)
  (let ((level (or level 0)))
    (cond ((or (atom? sxml) (quote? sxml))
           (append (dup "  " level) (string sxml) "n"))
          ((list? (first sxml))
           (mappend (fn (s) (sxml->xml s (+ level 1))) sxml))
          ((symbol? (first sxml))
           (let ((attr-string (SXML:get-attr-string sxml))
                 (tag-name (SXML:name-in-MAIN (sxml 0))))
             (append (dup "  " level)
               "<" tag-name
               (if (= attr-string "") "" " ")
               attr-string
               (if (SXML:childless? sxml)
                   "/>n"
                 (append ">n"
                   (mappend (fn (kid) (sxml->xml kid (+ level 1)))
                            (SXML:get-children sxml))
                   (dup "  " level)
                   "</" tag-name ">n")))))
          (true
           (append (dup "  " level)
             "<error>" (string sxml) "<error>")))))

Here's a quick comparison test:
> (SXML:print-xml s-expression)
<html>
  <body>
    <table bgcolor="black">
      <tr>
        <td>
          hi, this is a simple test case
          <hr/>
          With some more content.
          <hr width="1"/>
        </td>
      </tr>
    </table>
  </body>
</html>
">"
> (silent (print (SXML:sxml->xml s-expression)))
<html>
  <body>
    <table bgcolor="black">
      <tr>
        <td>
          hi, this is a simple test case
          <hr/>
          With some more content.
          <hr width="1"/>
        </td>
      </tr>
    </table>
  </body>
</html>

Here's the string that 'sxml->xml' evaluates to (for our input):
> (SXML:sxml->xml s-expression)
"<html>n  <body>n    <table bgcolor="black">n      <tr>n        <td>n          hi, this is a simple test casen          <hr/>n          With some more content.n          <hr width="1"/>n        </td>n      </tr>n    </table>n  </body>n</html>n"

That's why I had to use '(silent (print ...))' earlier for the "eyeball test."  :-)
(λx. x x) (λx. x x)

rickyboy

#7
Oops!  I almost forgot ...



For 'sxml->xml' you're going to need the following:
(define (mappend) (apply append (apply map (args))))
(constant (global 'mappend))

'mappend' is a handy function from Common Lisp which we newLISPers should have in our toolbox also.  Whenever you find yourself saying '(apply (append (map ...)))', just say '(mappend ...)' instead.  --Ricky
(λx. x x) (λx. x x)

noah

#8
Hi, Rickyboy.




QuoteThank you so much for the feedback. Please feel free to correct or expand that code. I think it would be wonderful to write it together as a community (hey I schleped this code from somebody else. Lutz?). I enjoy very much this kind of exchange. Besides, I am no great shakes at writing production code. ;-)


Please excuse the double-posted feedback. Your response to it was more than generous: thank you! I've no delusions about my stature and ability as a newlisp programmer, or any kind of programmer. It's important to time code, from what I've read. Other than that, as soon as you posted your code, it became production code. I at least, plan on using it.



There are two things I'd like to do with your code.



1. allow roundtripping that preserves namespaces and additional types of xml data. Types of xml data to preserve include:



*xml declarations

*processing instructions

*doctype declarations

*xml comments

*CDATA sections

*general entities

*namespace identifications

*namespace-prefix associations

*embedded dtd content



2. allow roundtripping that preserves contexts and additional types of newlisp data. Types of newlisp data to preserve include:



*symbols

*lists

*contexts

*comments

*operators

*implicit indexing





Lutz's server code creates embedded newlisp inside <%,%> tags. That's great, and good enough for many uses that I could put XSLT to as well.  The second goal, if accomplished, provides a couple advantages over asp-style embedded content:



*you can embed xml directly inside your embedded newlisp code, without quoting it. You can even use existing xml tools (xquery, xslt) on what your embedded newlisp manipulates before you evaluate the newlisp.



*you don't have to use quotes to designate values when you write your newlisp code.



Whatever work I accomplish wrt goal 1 I'll certainly share on this board. If there's interest in work on the 2nd goal, I'll share that as well. I suspect that most newlispers will not welcome an xmlized newlisp, but the 2nd goal might actually be easier, thanks to xml-parse.  When I produce additions to your production-ready (IMO) sxml->xml code, I'll share them here. If you continue to work on it, great!



-Noah

rickyboy

#9
Hello, Noah!


Quote from: "noah"1. allow roundtripping that preserves namespaces and additional types of xml data. Types of xml data to preserve include:



...

*processing instructions

*doctype declarations

...

I'm not really sure about the others on the list above (I have to crack open an XML book or two to remember what they are), but processing instructions and doctype declarations are definitely skipped by 'xml-parse', as you probably well know.



For the roundtripping to work on this side, we need to beg Lutz to change the 'xml-parse' processing to start emitting some sxml for processing instructions and doctype declarations.  Right now, it looks like it's not too big of a problem to change it to emit something.  Currently, for the code to skip processing instructions and doctype declarations, it has to know their syntax, and Lutz already has this code in place, cf. functions 'parseProcessingInstruction()' and 'parseDTD()', both in 'nl-xml.c'.  The punchlines of these functions look like 'source = source + ...' which effectively skip over these components, but the rest of the statements in the function are basically a validation routine.



The question now is: what should the sxml look like for processing instructions and doctype declarations?  Like this?
((?xml (@ (version "1.0") (encoding "windows-1252")))
 (?xml-stylesheet (@ (href "mystyle.css") (type "text/css")))
 (!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd")
 (html (@ (xmlns:lsp "http://www.newlisp.org/2006/template/")
          (xmlns "http://www.w3.org/1999/xhtml/")
          (xml:base "http://newlisp.org/")
          (xmlns:xlink "http://www.w3.org/1999/xlink"))
   (head (title "newLisp powering CGI"))
   (body (h1 (lsp:print "Hello, World"))
         (p "The above is an example of "
            (a (@ (xlink:type "simple")
                  (xlink:href "xml-roundtripping.xml"))
              "embedded newLisp")
            "."))))

Hmm...  I wonder if Lutz is up for it.



I'll "talk" to you tomorrow (later today, that is) as it's after midnight here and I need to get some sleep.
(λx. x x) (λx. x x)

noah

#10
Hi, Rickyboy.



The question of how the imported SXML looks is important. To answer that questions, we might need to answer the following:



* do you want to modify the content or structure of every section of the imported s-expression? (for example, a <![CDATA[...]]> section usually contains xml mark-up that an xml processor treats as a text node, non-xml content.)



* will imported namespace prefixes conflict with existing context prefixes needed to process the imported SXML?



* do you want to preserve xml nodes that use the reserved prefix "xml:" during roundtripping? (for example, xml:base.)



A possible problem with including processing instructions, et cetera, is keeping them within the scope of the document. Your example surrounds the whole s-expression with a set of parentheses, and that might be enough to solve this problem.



 However, a newlisp function that operates on only part  of an imported sxml document's s-expression might need to know the following:

* that the s-expression is actually imported xml, not arbitrary s-expressions.

* what its associated DOCTYPE is, if any.

* what its associated namespaces are, if any.



You could move a couple parentheses from the xml and DOCTYPE declaration to the end of the s-expression:


((?xml (@ (version "1.0") (encoding "windows-1252"))
 (?xml-stylesheet (@ (href "mystyle.css") (type "text/css")))
 (!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"
 (html (@ (xmlns:lsp "http://www.newlisp.org/2006/template/")
          (xmlns "http://www.w3.org/1999/xhtml/")
          (xml:base "http://newlisp.org/")
          (xmlns:xlink "http://www.w3.org/1999/xlink"))
   (head (title "newLisp powering CGI"))
   (body (h1 (lsp:print "Hello, World"))
         (p "The above is an example of "
            (a (@ (xlink:type "simple")
                  (xlink:href "xml-roundtripping.xml"))
              "embedded newLisp")
            "."))))
))


Not much difference, really, but maybe worth doing anyway.



Preserving namespaces might be a trickier issue during roundtripping. In newLisp, you identify a  namespace with a prefix. In xml, you can choose whatever prefix for all tags that you want to associate with that namespace, so long as the tags are within the scope of that namespace-prefix association declaration



For example, xmlns:xlink="http://www.w3.org/1999/xlink">http://www.w3.org/1999/xlink" in the code sample above. you could use xl or lk rather than xlink, or whatever prefix, so long as you identify a namespace that the processor recognizes.



This is when I start wishing for newlisp to allow you to associate context prefixes with namespace declarations. Like:




(load "your_lib.lsp" "some_other_lib.lsp")

(context 'foo "your_foo_namespace")
(foo:that_func "an argument")

(context 'foo "some_other_foo_namespace")
(foo:that_func "another argument")



This works even if the 'foo prefix was used in one or both of the loaded newlisp libraries. The loaded libraries came with namespaces that you were aware of before you loaded them. I wonder what the newLisp alternative is to this.



-Noah