counting words quickly

Started by cormullion, March 23, 2007, 09:21:43 AM

Previous topic - Next topic

cormullion

An unfamiliar sight, perhaps. A Perl script to cound the number of 'and's in a novel:


#!/usr/bin/perl -w
@ARGV = ("/Users/me/study-in-scarlet.txt");
my $line = 0;
$count = 0;
$word = 0;
while (<>) {
    foreach $word (m/band/ig) {
      ++$count;
      }
    }
print $count,"n";


I want a newLISP script that runs as faster than this Perl one and uses a similar approach. But I can't do it. My best attempt is 3 times slower. What am I doing wrong?



(set 'path "/Users/me/study-in-scarlet.txt")

(set 'count_ 0)
(dolist (w (parse (read-file path) "\s" 1))
  (if (= w "and")
    (inc 'count_)))
(println count_)
(exit)

Lutz

#1
Perl is hard to beat if you do it the Perl way. Do it the newLISP way and you are more than 2 times faster:


#!/usr/bin/newlisp

(set 'argv "/Users/Lutz/mrdemo/Sherlock.txt")
(set 'cnt (length (find-all "(?i)\band" (read-file argv))))

(println cnt)
(exit)


I used http://www.gutenberg.org/dirs/etext99/advsh12.txt">http://www.gutenberg.org/dirs/etext99/advsh12.txt for Sherlock.txt.


~> time ./andcount.pl
3065

real    0m0.136s
user    0m0.090s
sys     0m0.008s

~> time ./andcount.lsp
3065

real    0m0.059s
user    0m0.015s
sys     0m0.009s
~>


Lutz

cormullion

#2
:-) Yes. That's a nice version.



Why do you think my version was slower, though...? Is parse working harder than find-all - I suppose it is returning much more info...

ino-news

#3
Quote from: "cormullion"
Why do you think my version was slower, though...? Is parse

working harder than find-all - I suppose it is returning much more

info...


i guess find-all lets the pcre-lib do most of the work.  --clemens
regards, clemens

Jeff

#4
Quote(set 'path "/Users/me/study-in-scarlet.txt")

(set 'count_ 0)
(dolist (w (parse (read-file path) "\s" 1))
  (if (= w "and")
    (inc 'count_)))
(println count_)
(exit)


My take: what takes so long is that you first expand the file to a list using parse, this assigns a huge number of strings (which according to the description of the memory management on the site, is not what newlisp is optimized for) to the list, then iterates over it.  This means that the internal dolist macro (or built in c function or whatever) is having to index its way through the list and then test for equality on each item against another string that is being assigned each iteration: "and".  Then, it has to increment 'count_, although that is probably pretty easy on resources.



Lutz's is only assigns the regexp string once and probably is not using dolist for iteration.  The best way to write lisp (at least from my experience with older lisps) is to think of the code in terms of the expansion/evaluation of each list element.



This is what makes lisps both more cryptic and more interesting to use.  The inside-out evaluation allows you to do things and express code in ways that other languages would never permit.
Jeff

=====

Old programmers don\'t die. They just parse on...



http://artfulcode.net\">Artful code

cormullion

#5
Hi Jeff - thanks for the explanations!



Glad to see your recent contributions, too. What do you think of newLISP, given your experience with other languages?

rickyboy

#6
Also, did you notice that Lutz's newlisp program is much more concise than the perl program?  It beats perl at it's own game there too.  He, he, he.  :-)



It can also be made more concise by removing the unneeded symbol 'set'ting; however, I don't think you can get away with the same thing in the perl version (but I may be wrong).  So, chalk up another one for newlisp!  --Ricky
(λx. x x) (λx. x x)

Jeff

#7
I like it. It's neat to have a lisp that does regex natively (and more cleanly than clisp) and abstracts so many things that are a real pain in older lisps.  And, of course, I think it's much more elegant and clean than most interpreted imperative languages.



I do have a bit of trouble, though, because all variables and symbols that are not explicitly declared appear to evaluate to nil.  This has huge effects on recursion, although I understand that recursion isn't always the preferred method with newLISP.



Not that I am a lisp expert; I just think it's a better way of doing things.  I started on Perl and came to lisp last and with no degree in CS, so it's often difficult for me to pick up some things in the language.



One thing I do miss from common lisp, though, are plists (a la (:foo bar :blah yadda) rather than ((foo bar) (blah yadda))).  They made things much more concise, in my (albeit uneducated) opinion.



The scoping is also a bit odd for me, especially where contexts are concerned.  I'm used to using symbols rather than strings where possible and only converting them to strings when output is needed.  If I do this in a separate context, that context needs to know which context is calling it in order to use symbols passed to an argument. For example:



(context 'FOO)
(define (hello person) (println "Hello " (string person)))
(context 'BAR)
(FOO:hello 'Jeff) => "Hello BAR:Jeff"


This can be important if you are trying to make a module that can be used from anywhere.



Lutz, is there something I'm missing on this?



EDIT:  Fixed the code to demonstrate what I was intending to.  Wrote too quickly and forgot what I was trying to illustrate.
Jeff

=====

Old programmers don\'t die. They just parse on...



http://artfulcode.net\">Artful code

Lutz

#8
You could do this:


(context 'FOO)
(define (hello person)
    (println "Hello " (name person) " from "  (name person true)))
(context 'BAR)

(FOO:hello 'Jeff) => Hello Jeff from BAR


See: http://newlisp.org/downloads/newlisp_manual.html#name">http://newlisp.org/downloads/newlisp_manual.html#name



Lutz

Jeff

#9
Thanks, I missed that boolean option on the end of name my first time through the docs.  That helps a ton!
Jeff

=====

Old programmers don\'t die. They just parse on...



http://artfulcode.net\">Artful code