newLISP interface to Tidy

Started by rickyboy, June 16, 2007, 12:21:48 PM

Previous topic - Next topic

rickyboy

http://www.w3.org/People/Raggett/tidy/">Tidy is a nice program which cleans up malformed (or just plain ugly) HTML.  I've always wanted access to it in newLISP without having my newLISP script run it as a command line program.  Well, today I finally saw that there was a library version of tidy (called TidyLib).



Here is a first cut of a tidy interface module for newLISP.  The interface code is based on the example C code given in http://tidy.sourceforge.net/libintro.html">http://tidy.sourceforge.net/libintro.html.  


;;;; tidy.lsp -- A module to interface TidyLib
;;;; Author: Rick Hanson
;;;; Date: 17 June 2007

(context 'tidy)

;;;---------------------------------------------------------
;;;       U S E R    C O N F I G U R A T I O N
;;;
;;; Read the desciptions of the following two variables,
;;; and change as appropriate for your needs.

;; This is the location of your TidyLib shared library
;; On Macs it's called libtidy.dylib, on Win32 machines
;; it's called libtidy.dll, on the Penguin and Unices it's
;; called libtidy.so.

(define libtidy "/usr/lib/libtidy.dylib")

;; According to Lutz, you probably don't need to change this.
;; Change it to 64, ONLY IF you know your TidyLib (and probably
;; the rest of your system + newLISP) is LP64.

(define machine-address-size-in-bits 32)

;;;---------------------------------------------------------
;;;    B O I L E R P L A T E   C O D E   F O L L O W S
;;;
;;; (meaning that, if you're
;;;     (a) just a user of this module AND
;;;     (b) you're lucky,
;;; then you won't need to change the code below this line.)
;;; :-)

(import libtidy "tidyCreate")
(import libtidy "tidyOptSetBool")
(import libtidy "tidySetErrorBuffer")
(import libtidy "tidyParseString")
(import libtidy "tidyCleanAndRepair")
(import libtidy "tidyRunDiagnostics")
(import libtidy "tidySaveBuffer")
(import libtidy "tidyBufFree")
(import libtidy "tidyRelease")
(import libtidy "tidyReleaseDate")

(define machine-address-size-in-bytes
  (/ machine-address-size-in-bits 8))
(define size-of-u_int machine-address-size-in-bytes)
(define size-of-address-pointer machine-address-size-in-bytes)

(define tidy-release-date
  (let ((pd (parse (get-string (tidy:tidyReleaseDate))))
        (months '("Month0" "January" "February" "March" "April"
                  "May" "June" "July" "August" "September"
                  "October" "November" "December")))
    (if (= (length pd) 4)
        (date-value (int (pd 3)) (find (pd 2) months) (int (pd 0)))
      (date-value (int (pd 2)) (find (pd 1) months) (int (pd 0))))))

;;; Since TidyBuffer (in buffio.h) changed on 2006-12-29, this code
;;; checks to see if your TidyLib's release date is before or
;;; on-or-after this date, and tries to do the right thing.  This
;;; would all be easier if the Tidy developers used version numbers.
;;;
;;; The right thing is the setup of the following two variables:
;;;
;;;    empty-TidyBuffer: an allocation of enough space to account
;;;    for the size of a TidyBuffer.
;;;
;;;    bp-offset: the offset from the start of the TidyBuffer
;;;    struct to struct member `bp', where the TidyLib text output
;;;    is stored.

(let ((TidyBuffer-change-date (date-value 2006 12 29)))
  (cond
    ((< tidy-release-date TidyBuffer-change-date)
     ;; struct _TidyBuffer
     ;; {
     ;;     byte* bp;           /**< Pointer to bytes */
     ;;     uint  size;         /**< # bytes currently in use */
     ;;     uint  allocated;    /**< # bytes allocated */
     ;;     uint  next;         /**< Offset of current input position */
     ;; };
     (define empty-TidyBuffer
       (dup "00" (+ size-of-address-pointer
                      (* 3 size-of-u_int))))
     (define bp-offset 0))
    (true
     ;; struct _TidyBuffer
     ;; {
     ;;     TidyAllocator* allocator;  /**< Memory allocator */
     ;;     byte* bp;           /**< Pointer to bytes */
     ;;     uint  size;         /**< # bytes currently in use */
     ;;     uint  allocated;    /**< # bytes allocated */
     ;;     uint  next;         /**< Offset of current input position */
     ;; };
     (define empty-TidyBuffer
       (dup "00" (+ (* 2 size-of-address-pointer)
                      (* 3 size-of-u_int))))
     (define bp-offset size-of-address-pointer))))

;;; The following flags are recovered from tidyenum.h of
;;; TidyLib. (Fortunately, the developers did not change the enums
;;; -- the old ones should stay the same from version to version.)

(define TidyXmlOut 22)      ; Output XML.
(define TidyXhtmlOut 23)    ; Output extensible HTML.
(define TidyHtmlOut 24)     ; Output plain HTML, even for XHTML input.
(define TidyForceOutput 64) ; Output document even if errors were found.
(define no 0)
(define yes 1)

(define (tidy:tidy output-type input)
  (let ((output empty-TidyBuffer)
        (output-contents nil)
        (errbuf empty-TidyBuffer)
        (rc -1)
        (ok nil)
        (tdoc (tidyCreate)))
    (setq ok (tidyOptSetBool tdoc output-type yes))
    (if ok (setq rc (tidySetErrorBuffer tdoc errbuf)))
    (if (>= rc 0)
        (setq rc (tidyParseString tdoc input)))
    (if (>= rc 0)
        (setq rc (tidyCleanAndRepair tdoc)))
    (if (>= rc 0)
        (setq rc (tidyRunDiagnostics tdoc)))
    (if (> rc 1)
        (setq rc (if (not (= 0 (tidyOptSetBool tdoc
                                               TidyForceOutput
                                               yes)))
                     rc -1)))
    (if (>= rc 0)
        (setq rc (tidySaveBuffer tdoc output)))
    (if (>= rc 0)
        (setq output-contents
              (get-string
               (first
                (unpack "lu"
                  (bp-offset size-of-address-pointer output)))))
      (println (format "A severe error (%d) occurred.n" rc)))
    (tidyBufFree output)
    (tidyBufFree errbuf)
    (tidyRelease tdoc)
    output-contents))

(define xml<- (curry tidy TidyXmlOut))
(define xhtml<- (curry tidy TidyXhtmlOut))
(define html<- (curry tidy TidyHtmlOut))

(context MAIN)


Here's an example (assume that the above code is in file "tidy.lsp"):
> (load "tidy.lsp")
MAIN
> (print (tidy:xhtml<- "<title>Foo</title><p>Foo!"))
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Mac OS X (vers 1st December 2004), see www.w3.org" />
<title>Foo</title>
</head>
<body>
<p>Foo!</p>
</body>
</html>

Enjoy! (as Norman would say)

--Rick
(λx. x x) (λx. x x)

rickyboy

#1
Here's a real example (as opposed to the toy example I gave in the previous post).  Say you want to use newLISP to scrape HTML off the web.  It would be nice if all pages were in XHTML (i.e. XMLized HTML), but most are not.  So using xml-parse is not going to get you anywhere:
> (silent (setq nytimes (get-url "http://nytimes.com")))

> (setq nytimesx (xml-parse nytimes))
nil
:-(  Boo!  Hiss!



Enter tidy:
> (load "tidy.lsp")
MAIN
> (setq nytimesx (xml-parse (tidy:xhtml<- nytimes)))
... (bunch of output) ...

Yeah!  :-)
(λx. x x) (λx. x x)

cormullion

#2
Good stuff. Thanks!



I think these tools should be in a library/contributions section of the installation... I don't know whether Tidy is pre-installed on all OSs, though, so perhaps a platform check would be a useful addition.

rickyboy

#3
Good point, cormullion.  Perhaps Lutz can tell us the best way to do this.  (TidyLib was already installed on my Mac when I bought it -- way cool.)



Another piece of "cheese" in my code is the portion that sets up machine-address-size-in-bits and friends.  Perhaps someone can tell us about a good way to do this also, so that our 64-bit friends can use the tidy module without changing the code any (or very little).



Adthanksvance  for any help!  --Rick
(λx. x x) (λx. x x)

Lutz

#4
There have been several very useful and well crafted library modules on this board recently. I suggest that all of them go into a central place either on http://www.alh.net/newlisp/wiki">http://www.alh.net/newlisp/wiki or on http://newlisp-on-noodles.org">http://newlisp-on-noodles.org or some other site.



They should not be part of the binary installers or the distribution. Every module in the distribution is maintained, tested and documented in a standard way. All software hosted on the main site and in the distribution packages has to be tested, maintained and kept up to date with every release. This takes considerable time and all software in the distribution and on the newlisp.org ends up beeing my responsebility.



newLISP-wiki is useful for small installations and websites, but I don't think it is featured enough for the task at hand. Perhaps software from http://www.mediawiki.org/">http://www.mediawiki.org/ is more appropiate with its facilies for history keeping, discussion pages etc. newLISP code beeing very compact can be nicely published as text and the media wiki's history feature is a nice way to keep track of changes and bug fixes in modules. Installer links like (load "http://newlisp-super-wiki/get-all-packages.lsp">http://newlisp-super-wiki/get-all-packages.lsp") can be used for downloading and installing.



But I don't want to push a certain way of doing things here. The point is: to do it right somebody has to take the time and responsebility to organize content, keep contact with contributers, maintain, make backups, fight spammers etc..

 

So who will step up to the task :-?



Lutz

rickyboy

#5
Thanks Lutz!  I'm sure we'll figure something out.  I didn't envision the third-party library portion to be that formal anyway, but a good idea.



On a more specific (and smaller scale) note, what is the best way to handle this in newLISP parlance?
Quote from: "rickyboy"Another piece of "cheese" in my code is the portion that sets up machine-address-size-in-bits and friends.  Perhaps someone can tell us about a good way to do this also, so that our 64-bit friends can use the tidy module without changing the code any (or very little).

Is there a way for the program to figure out, by itself, how big address sizes are that get passed back to it by a library?



Thanks!  --Rick
(λx. x x) (λx. x x)

cormullion

#6
I think we should sort out something. There's lots of useful code floating about here, and the forum search doesn't seem to find it easily.  There's also the stuff in http://www.newlisp.org/downloads/newlisp-9.1.1-modex.tgz">//http://www.newlisp.org/downloads/newlisp-9.1.1-modex.tgz. This isn't downloaded with the binary installers - I didn't find them for ages since I don't do source code downloads much, and there's some cool things!



It would indeed be cool to have a single place where this stuff could be easily found. I'm happy to help as far as my skills allow! :-)

Lutz

#7
Quotes there a way for the program to figure out, by itself, how big address sizes are that get passed back to it by a library?


Today virtually all MacOs X, LINUX and Win32 systems, we deal with today, are programmed in the ILP32 programming model:



http://www-128.ibm.com/developerworks/library/l-port64.html">http://www-128.ibm.com/developerworks/l ... ort64.html">http://www-128.ibm.com/developerworks/library/l-port64.html



The newLISP library import and execution routines (nl-import.c) push all values and pointers as 32 bit entities. For floats two 32 bit values are pushed. 64 bit integers are coerced into 32 bit integers. To pass 64 bit integers one would have two push the lower and higher 32 bit portions separately.



All values coming back are 32 bit.



Even though many computers today (e.g.: PPC G5, Dual Core 2 Intel) have 64 bit CPUs the programming model is still ILP32 with a stack managing 32 bit entities. In the LINUX world some people install LINUX-64 which expects a LP64 programming model, but most standard (i.e. Redhat) installations of LINUX-64 install two sets of libraries one for ILP32 and the other for LP64 into two different library paths. Some of the minor distributions are completely messed up in this respect.



So currently you use the normal newlisp binary on virtually any computer and expect 32-bit entities to be passed in and out to and from library functions.



If you must deal with LP64 libraries. then there is a 64-bit flavor for newLISP for Linux, Solaris and MacOS X.



MacOS X on the new Intel machines has 2 sets of C libraries installed, one for ILP32 the other for LP64. There is a makefile_osxLP64 and others for LINUX-6 and Solaris-64 in the distribution.



There is a special issue with passing Floating point (all floats in newLISP are IEEE 754 64-bit doubles) on the PPC architecture. They don't pass them through the normal C stack but use special registers on PPC CPU. For this reason it is a problem executing libraries routines from newLISP on the PPC which use floats. Because of this the opengl-demo.lsp file from the source distribution works well on Intel machines but not on the older PPC architectures.



BTW, when you do modules for newLISP it is also worth reading this chapter

   http://newlisp.org/CodePatterns.html#extending">http://newlisp.org/CodePatterns.html#extending



As everybody knows, who has written library modules for newLISP, good C knowledge is important for this work. Although this part is not as easy to master as other areas in newLISP the result of importing C libraries in newLISP is extremely speed efficient and flexible.



Lutz

Dmi

#8
tidi interface will be very helpful! Thanks Rick!



Package/site with all known libraries is a good idea!

The only one thing about this is that all the "official" code issuers must be trusted.



Probably the good idea for now will be to have the libraries spread through personal sites and a page on a wiki with general index.
WBR, Dmi

Lutz

#9
QuoteThe only one thing about this is that all the "official" code issuers must be trusted.


which is also a reason that somebody must monitor/administer such a site.


QuoteProbably the good idea for now will be to have the libraries spread through personal sites and a page on a wiki with general index.


yes, and I will be happy to link to any of those sites. You don't even need your own site provider. As the http://newlisper.blogspot.com">http://newlisper.blogspot.com site shows you can perfectly publish code on a free Google blog account.



Or one could have a newlisp-index.blogspot.com site which just catalogs and indexes (and reviews ?) existing scripts.



Technically all this is not a problem. What we need is initiative ;-)



Lutz

Dmi

#10
Probably we need 2 new tags in newlispdoc: one for brief module description and one for module original location.



Then the work for site maintainer will be around maintaining module list by the contributor's requests and the general script which can be pretty various.



...moreover, we can exclude a script ;-) only a list will also be sufficient.
WBR, Dmi

cormullion

#11
We're all too nice - we need a dictator... :-)



What about this business of


(load "http://harmless-looking-url.com/nice-code.lsp")

This is probably not a good idea, is it?

cormullion

#12
Quote from: "Dmi"Probably we need 2 new tags in newlispdoc: one for brief module description and one for module original location.



Then the work for site maintainer will be around maintaining module list by the contributor's requests and the general script which can be pretty various.



...moreover, we can exclude a script ;-) only a list will also be sufficient.


This seems more sensible!

Lutz

#13
QuoteCode:

(load "http://harmless-looking-url.com/nice-code.lsp">http://harmless-looking-url.com/nice-code.lsp")



This is probably not a good idea, is it?


Not more dangerous than any other download of a program from the internet. On the contrary, you can inspect the code right away clicking on the link portin of the load statement. I have come to appreciate this method with nodep's and fanda's postings on this board. It is a quick way to try out the program, without having to copy the source into an editor, save it etc..



It boils down to what DMI said:


QuoteThe only one thing about this is that all the "official" code issuers must be trusted.


QuoteProbably we need 2 new tags in newlispdoc: one for brief module description and one for module original location.


yes, this seems to be a good idea. I will put these tags in the next version:



@description A one line module description goes here

@location http://asite.com/somefile.lsp">http://asite.com/somefile.lsp



I will also add a -http option to newlispdoc:


newlispdoc -http file-with-urls.txt

A simple script than can from a collection of URL's assemble a nice module index and documentation + highlighted source from remotely stored programs. As most of the tags in newlispdoc are optional even utilities and games can be indexed this way. The only thing the maintainer has to do is running the script and veryfying the trustworthyness of the URLs. I love this idea!



Lutz

rickyboy

#14
I like this idea too!



BTW, you can put me on your "don't trust" list -- I just changed the tidy module code above (rather than post another message).  Turns out that I was running an old version of TidyLib and the current version has TidyBuffer (the struct that stores the Tidyfied output) in a different format.  :-(  The code I edited above should fix the problem though, i.e. it should work for old and new TidyLibs.  I test it both on my Mac's original version (release date 1 Dec 2004) and a CVS version I just pulled down today and built (release date 14 Jun 2007).



Let me know if you have any problems with it -- just post a question/issue on this site or PM me on this site.  Better yet, let me know if you can use it successfully.  Good news is always better than bad.  :-)



Cheers, --Rick
(λx. x x) (λx. x x)