a standard way to parse the newlisp manual

Started by Tim Johnson, October 30, 2006, 05:13:03 PM

Previous topic - Next topic

Tim Johnson

Hi:

Lutz and I have been having a conversation of off the forum and we've

decided that it is worth sharing.

FYI: I have written an application in elisp to create a newlisp syntax mode

for emacs. Part of this project was to provide on-demand documentation

from a keyword context in three different forms:

1)Verbose documentation in a popup window.

2)Verbose documentation in a temporary buffer

3)One-liner description of the interface in the echo area.



To do this I had to make some modifications to newlisp_manual.html.

Here is what I wrote to Lutz:

"""

Lutz said:

> Regarding the doc strings: it should not be very difficult to write a script

> for extracting them from newlisp_manual.html, which strictly uses a limited

> HTML subset. You also could just use 'lynx -dump newlisp_manual.html' to

> generate a pure text file, which then is even easier to parse.



I said:

  Actually, I've used that method. But I couldn't really find a

  pattern to use as a delimiter, so I manually inserted some

  markup - specifically a pseudo-tag <break>.



  Do you have any ideas for a search pattern?



  I think that, given some time

  (:-) and guidance),



  I could come up with a method that generated a file that could

  be used as an intermediary for any number of editors. Certainly

  emacs and vim, and for emacs a text file parsed by emacs itself

  would get around some of the problems with escaping.



  Alternatively here's an approach that would use the website and

  the documentation directly:



  Rebol has a feature called load/markup that auto-parses a document or

  a string into an array of alternating tag and string datatypes -  all

  it takes is:



  load/markup http://www.newlisp.org/downloads/newlisp_manual.html">http://www.newlisp.org/downloads/newlisp_manual.html



  from the command line.



  Anything like that in newlisp?



  Maybe you would like to take this discussion to the forum?

  :-) I've got the time.

 " " "

Lutz replied again:

"""

finding some kind of standard way to parse the newlisp_manual.html into

pieces sounds like a wonderful idea. This idea should definitely be brougth

to the discussion group.



Some short newLISP script using regular expressions should do the thing. As

you mentioned, perhaps some addtions/changes to the manual will facilitate

it further.



Mention this idea on the discussion group.There are several people

experimenting with newLISP development environments based on emacs, vi, gtk,

etc. They all could benefit on a method to quickly extract relevant help

from the manual.



"""

I'll add a couple of other thoughts:

1)There is one - situation where keyword documentation is combined and that is for the arithmetic operators.

2)One should be thinking about ways to include documentation for

   user libraries, third-party contributions etc.



I'm sure there's many other ideas.

thanks

tim
Programmer since 1987. Unix environment.

cormullion

#1
Sounds good. I've had some skirmishes with this myself. For the TextMate bundle I tinkered with, I wrote something like this:


; load the whole manual. Gulp.

(set 'file (read-file  "/usr/share/newlisp/doc/newlisp_manual.html"))

; we're looking for the selected text

(set 'func-name "atan2")

; find the matching bit with regex

(set 'doc-section (find (string {(<h2><span>)(} func-name {)(</span></h2>)(.*?)(<h2><span>)} ) file 4))

; found it, output it to Show as HTML

(if doc-section
   (println  $1 $2 $3 $4)
   (println "couldn't find it"))
(exit)


TextMate has a nice HTML window available for online documents, so no need to strip out the markup.



It's obviously a sledgehammer approach. I'm looking forward to seeing the scalpel version.

[/code]

Tim Johnson

#2
<GRIN> That easy huh?

I don't grok it all, but you have a pattern to parse on right?
Programmer since 1987. Unix environment.

cormullion

#3
it was looking for five stretches of text:



<h2><span>

atan2

</span></h2>

.*?

<h2><span>



and giving you back the first four. The fifth pattern was the start of the next function so unwanted.



In fact, looking at the manual again, the source text is different now, there's a span class="function" to cater for now.



I dunno, I'm no regex wiz...

Tim Johnson

#4
Quote from: "cormullion"
In fact, looking at the manual again, the source text is different now, there's a span class="function" to cater for now.

I'm seeing this this pattern consistently:
<a></a>
<h2><span>*function-name-here*</span></h2>

NOTE: This forum is obfuscating the anchor name and span class attributes, but I see a usable patern emerging
Quote
I dunno, I'm no regex wiz...

:-) Me neither, but count yer blessings, regexes are more of a headache

in elisp
Programmer since 1987. Unix environment.

Tim Johnson

#5
See http://www.johnsons-web.com/demo/newlisp/parse-nl-docs.r.txt">http://www.johnsons-web.com/demo/newlis ... docs.r.txt">http://www.johnsons-web.com/demo/newlisp/parse-nl-docs.r.txt



The following labels=>

char-entities:   ;; data structure

clean:              ;; subroutine

parse-all:         ;; subroutine



are the operational components. It's kind of quick-and-dirty, but

I tried to write it in a way that a newlisper could easily follow,

and provided some documentational comments.



If one can follow the logic:

1)I'd appreciate Lutz evaluating the accuracy of the logic.

2)It should be easy to write a newlisp script to accomplish the same.



See http://www.johnsons-web.com/demo/newlisp/newlisp-docs.txt">http://www.johnsons-web.com/demo/newlis ... p-docs.txt">http://www.johnsons-web.com/demo/newlisp/newlisp-docs.txt

For the output
Programmer since 1987. Unix environment.