baby steps...

Started by tom, July 16, 2004, 03:57:57 PM

Previous topic - Next topic

tom

May I see some trivial examples of newlisp in action?



w3m is a text-mode browser that handles tables well.  You can

effectively strip the tags from an html file, while preserving (pretty

much) all the whitespace caused by the tags.


$ w3m file.html > file.txt How can I loop through a

directory full of files.html converting them all to files.txt, using

w3m, 1.  in the same directory 2.  in a different, new directory



Thanks, off to read the manual!

Lutz

#1
you can do the following:



(dolist (fle (directory)) (exec (append "wm3 " fle " > " fle ".txt")))





The 'directory' function can also take a parameter for an argument.



Lutz

nigelbrown

#2
I've not fully tested it but it worked once:



(define (htm2txt indir outdir)

   (map (fn (x) (exec (append "w3m " indir (first x) " > " outdir (nth 3 x) ".txt")))

            (filter 'list? (map (fn (x) (regex "(.*).html*$" x 1)) (directory indir)))))



indir and out dir should end in /

eg

(htm2txt "./" "/tmp/")





Nigel

nigelbrown

#3
Seeing the thread was 'baby steps' I'll step through the code:



Directory returns a list of files in the directory

> (directory "./")

("." ".." "3.htm" "bab-ubhelp.htm" "exlfus.htm" "gtk-server.log"

 "images" "License.txt" "models" "quit.blend" "Readme.txt" "stdout.txt"

 "test.html")

>



next map applies a regex to the directory file list. The regex matches .htm or .html and also extracts the filename upto that extension due to the brackets in the regex. Nil is returned if that extension is not found. The total result comes back as a list with nils or sublists:

> (map (fn (x) (regex "(.*).html*$" x 1)) (directory "./"))

(nil nil ("3.htm" 0 5 "3" 0 1) ("bab-ubhelp.htm" 0 14 "bab-ubhelp"

  0 10)

 ("exlfus.htm" 0 10 "exlfus" 0 6) nil nil nil nil nil nil nil

 ("test.html" 0 9 "test" 0 4))

>



filtering by whether it is a list removes the nils:

> (filter 'list? (map (fn (x) (regex "(.*).html*$" x 1)) (directory "./")))

(("3.htm" 0 5 "3" 0 1) ("bab-ubhelp.htm" 0 14 "bab-ubhelp" 0 10)

 ("exlfus.htm" 0 10 "exlfus" 0 6)

 ("test.html" 0 9 "test" 0 4))

>

the next steps maps the anonymously defined function:

(fn (x) (exec (append "w3m " indir (first x) " > " outdir (nth 3 x) ".txt")))

which looks at a sublist viz  ("test.html" 0 9 "test" 0 4)) and uses

first to get the full htm(l) file name eg

> (first '("test.html" 0 9 "test" 0 4))

"test.html"

>

and nth to get position 4 which is the name only extracted

 -nth indexes start at zero- eg

> (nth 3 '("test.html" 0 9 "test" 0 4))

"test"

>

then the function uses append to make a final string with directories which is then passed to exec for execution.

Map applies this function to all of the list that survived filter.



The define, of course, bundles this all up into a callable function.



Hope this makes it clear. dolist is another approach but I'm more used to map.



Regards

Nigel



PS I was not aware to w3m prior to this but it is a useful addition to my linux setup (easier to install from binary rpm as the gc6 dependencies were a bit tricky when trying to build from source)

nigelbrown

#4
As a further is refinement if the length of the final list returned by map is taken the number of files processed is returned Viz

(define (htm2txt indir outdir)(length (map (fn (x) (exec (append "w3m " indir (first x) " > " outdir (nth 3 x) ".txt")))(filter 'list? (map (fn (x) (regex "(.*).html*$" x 1)) (directory indir))))))

then

> (htm2txt "./" "./")

4





if you don't have w3m you can test the workings of the function by substituting something like type (say in windows) for w3m viz

(define (htm2txt indir outdir)(length (map (fn (x) (exec (append "type " indir (first x) " > " outdir (nth 3 x) ".txt")))(filter 'list? (map (fn (x) (regex "(.*).html*$" x 1)) (directory indir))))))



to generalise the function:



> (define (any2any indir outdir progname fileregex outext)

  (length (map (fn (x) (exec (append progname " " indir (first x) " > " outdir (nth 3 x) outext)))(filter 'list? (map (fn (x) (regex fileregex x 1)) (directory indir))))))



thus

> (any2any "./" "./" "type" "(.*).html*$" ".txt")

4

>



then htm2txt can be defined as

(define (htm2txt indir outdir) (any2any indir outdir "w3m" "(.*).html*$" ".txt"))





Regards

Nigel

tom

#5
thanks guys, I'm studying your solutions. More baby steps to follow!

tom

#6
I'm putting my question here because I'm still at an

infant level in my newlisp understanding.  Anyway,

I have some questions.



I would like to process the output of a command.  I

think I need to put the output into a list to start

with (correct me if I'm wrong).



here's some example output. The first question is

obvious, how do I get it into a list?



(sorry, very basic stuff)



~ >> pacman -Qi frozen-bubble
Name           : frozen-bubble
Version        : 1.0.0-6
Groups         : None
Packager       : Arch Linux (http://www.archlinux.org)
URL            : http://www.frozen-bubble.org
License        : None
Architecture   : i686
Size           : 12116129
Build Date     : Mon Jun  6 17:17:58 2005 UTC
Install Date   : Wed Dec 21 07:43:32 2005 UTC
Install Script : No
Reason:        : explicitly installed
Provides       : None
Depends On     : sdl_mixer sdl_perl
Required By    : None
Conflicts With : None
Description    : A game in which you throw colorful bubbles and build
                 groups to destroy the bubbles

Lutz

#7
Put the command in to an 'exec' statement:



(exec "pacman -Qi frozen-bubble")


All standard-out from the command output will come back in a list



Lutz

cormullion

#8
If you want to find a particular piece of information somewhere in the list of strings returned by, say, exec, you can use the 'replace' command.



I recently discovered that this command can do something clever - instead of just replacing the text matched by a regular expression, you can 'replace' it with another newLISP expression altogether.



(dolist (_line _list)

    (replace "(Architecture.*: )(.*)" _line (println $0 "n" $1 "n" $2) 0))



This loops through _list, a list of strings, and when a string contains the matched regular expression it prints out the stored matches. For example, this looks for the Architecture line in your example, so that you could get the "i686" string.



There's probably many alternatives!



hope this helps - I'm an infant too!