read-file speed

Started by cormullion, February 04, 2006, 12:03:39 AM

Previous topic - Next topic

cormullion

I"m trying to create a newLISP version of a little Ruby script that I quite liked, with the same structure and at least equivalent speed:


class Histogram
def initialize
@tally = Hash.new(0)
end
def analyze(s)
s.split.each do |word|
myword = word.downcase.gsub(/[^a-z0-9]/,"")
@tally[myword] = @tally[myword] + 1 if !myword.empty?
end
@tally.sort
end
end
analysis = Histogram.new.analyze(File.new(ARGV[0]).read)
print analysis.length    


The hash table/symbol stuff isn't tha problem, but I've found that the way I'm using parse to build the word hash is much slower, because I'm using regular expressions.



This is what I have so far:


(context 'HASH)
(context 'MAIN)

(dolist (file-name (2 (main-args)))
(set 'buff (parse (read-file file-name) "s" 8))
(dolist (the-word buff)
(set 'the-word (replace "[^a-z0-9]" (lower-case the-word) "" 0)))
(if (!= (context HASH (string "_" the-word)) nil)
(begin
(set 'occurrences (context HASH (string "_" the-word)))

(if (!= occurrences nil)
(set (context HASH (string "_" the-word)) (inc 'occurrences))))
(begin
(set (context HASH (string "_" the-word)) 1))))
(sort (symbols HASH))
(dolist (s (symbols HASH))
(push (list (name (sym s HASH)) (eval (sym s HASH))) word-list -1))
(println (length word-list))


This sort of works, but it's so much slower than the Ruby one that I'd like to know how to make it faster. Any clues or suggestions welcome!

Fanda

#1
I tried something... Maybe you get some ideas...


(setq word-list '())

(context 'HASH)
(context 'MAIN)

(define (word-count1 files , buff occurrences)
  (set 'word-list '())
  (dolist (file-name files)
     (set 'buff (parse (read-file file-name) {s} 8))
     (dolist (the-word buff)
        (set 'the-word (replace "[^a-z0-9]" (lower-case the-word) "" 0))
        (if (!= the-word "")
          (if (set 'occurrences (context HASH (append "_" the-word)))
            (context HASH (append "_" the-word) (+ 1 occurrences))
            (context HASH (append "_" the-word) 1)))))

  (dolist (s (sort (symbols HASH)))
     (push (list (name (sym s HASH)) (eval (sym s HASH))) word-list -1))

  (length word-list))


(define (word-count2 files , buff uniq-buff)
  (set 'word-list '())
  (let (not-empty? (lambda (s) (!= s ""))
        cleanup    (lambda (s) (replace "[^a-z0-9]" (lower-case s) "" 0)))
    (dolist (file-name files)
      (set 'buff (filter not-empty? (map cleanup (parse (read-file file-name) {s} 0))))
      (set 'uniq-buff (unique buff))
      (set 'word-list (append word-list (map list uniq-buff (count uniq-buff buff))))))

  (set 'word-list (sort (unique word-list)))
  (length word-list))


(println (time (setq cnt (word-count1 (2 (main-args))))) " ms")
(println cnt " words")
(println)
(println (time (setq cnt (word-count2 (2 (main-args))))) " ms")
(println cnt " words")


Fanda

cormullion

#2
Hey Fanda - thanks for the post. I worked through your code, and thought there were some great ideas! It's always good to see an expert at work...



It's interesting to explore the trade-off between coding elegance and performance. It often seems that the most elegant and attractive high-level code is the least efficient. I noticed that the filter and map functions seemed to add some time to the execution, although I like the way they clarify the code. I suppose this is what expert programmers spend their time doing.



Anyway, I eventually found that the main reason for the relative slowness of my newLISP version was the regex parsing of the source file, which I didn't really need to do in this particular case in this way. This version is nearly twice as fast as the Ruby original, but slowed down whenever I tried to introduce filter or map:


(context 'HASH)
(context 'MAIN)

(dolist (file-name (rest (rest (main-args))))
(set 'data (parse (replace "rn" (read-file file-name) " ") " "))
(dolist (the-word data)
(set 'the-word (replace "[^a-z0-9]" (lower-case the-word) "" 0))
(set 'occurrences (context HASH the-word))
(if (> occurrences 0)
(set (sym the-word HASH)  (+ 1 occurrences))
(set (sym the-word HASH) 1))))
(sort (symbols HASH))
(dolist (s (symbols HASH))
(push (list (name (sym s HASH)) (eval (sym s HASH))) word-list -1))
(println (length word-list))
(exit)


Thanks for your help![/code]

Fanda

#3
Your version is much faster than mine, so who is an expert now?  ;-)





Anyway, I can see 3 things happenning in your code:



1) (sort (symbols HASH)) isn't really gonna sort the symbols - it will only sort the temporary variable returned by (symbols HASH).

Currently it's sorted because symbols are implicitly sorted inside the context. It's better to write:
(dolist (s (sort (symbols HASH)))

2) parse and replace are returning "" (empty string) and this string is being counted too. Try doing:
(println {"": } (context HASH ""))

(It means, that (length word-list) is off by 1).



3) (set 'occurrences (context HASH the-word)) will return either a number or nil.

You could rewrite
(if (> occurrences 0) ...
to just
(if occurrences ...



I just wanted you to be aware of it, don't take it seriously ;-)



Fanda

Fanda

#4
map and filter can run fast, but not always. If you are using a map, you can do some optimizations.



Let's use a timing function:
;;
;; Time expressions for N times
;;
(define (time-expr _expressions _N)
  (if (not (apply = (map eval _expressions)))
    (begin
      (map (lambda (_e) (println _e " => " (eval _e))) _expressions)
      (throw-error "Return values of timed expressions do not match!")))

  (println)

  (letn (times (map (lambda (_e) (time (eval _e) _N)) _expressions)
         lentimes (apply max (map (lambda (x) (length (string x))) times))
         maxtime (apply max times))
    (if (= maxtime 0)
      (println "All times are 0!")
      (map (lambda (t _e) (println (format (append "%" (string lentimes) "d") t) " ms : " (format "%5.1f" (mul (div t maxtime) 100.0)) "% | " _e )) times _expressions))))


If you need to get the first value in the nested list 'lst', you can do:
(map first lst)
This is quite fast. The rule is - try to stay away from mapping a lambda function. Doing that means that you are calling a function, passing an argument(s) and returning back from it. All that takes some time. Let's see timings for different cases:
(setq lst (dup '(1 2 3) 1000))

(time-expr '((map first lst)
             (map (lambda (x) (first x)) lst)
             (begin (setq r '()) (dolist (x lst) (push (first x) r -1)) r)) 500)

----------------
266 ms :  60.7% | (map first lst)
438 ms : 100.0% | (map (lambda (x) (first x)) lst)
406 ms :  92.7% | (begin
 (setq r '())
 (dolist (x lst)
  (push (first x) r -1)) r)


Another example - add 2 to every element of a list:
(setq lst (sequence 1 1000))

(time-expr '((map (lambda (x) (+ 2 x)) lst)
             (map + (dup 2 (length lst)) lst)
             (begin (setq r '()) (dolist (x lst) (push (+ 2 x) r -1)) r)) 800)

----------------
500 ms : 100.0% | (map (lambda (x) (+ 2 x)) lst)
328 ms :  65.6% | (map + (dup 2 (length lst)) lst)
328 ms :  65.6% | (begin
 (setq r '())
 (dolist (x lst)
  (push (+ 2 x) r -1)) r)

The trick is: Instead of using a lambda function, make additional list filled with constants:
(map + (dup 2 (length lst)) lst)

And another example - multiply list elements by 2:
(setq lst (sequence 1 1000))

(time-expr '((map (lambda (x) (* 2 x)) lst)
             (map * (dup 2 (length lst)) lst)
             (map + lst lst)
             (begin (setq r '()) (dolist (x lst) (push (+ x x) r -1)) r)) 800)

----------------
484 ms : 100.0% | (map (lambda (x) (* 2 x)) lst)
328 ms :  67.8% | (map * (dup 2 (length lst)) lst)
328 ms :  67.8% | (map + lst lst)
375 ms :  77.5% | (begin
 (setq r '())
 (dolist (x lst)
  (push (+ x x) r -1)) r)


Timings vary if run several times, but still show a general picture faster > slower.



Fanda

cormullion

#5
Hey Fanda - cool! I've got some homework study this week... :-)



Thanks!

cormullion

#6
Quote from: "cormullion"Anyway, I eventually found that the main reason for the relative slowness of my newLISP version was the regex parsing of the source file


By some spooky coincidence the latest version of newLISP (8.8) runs a regex parsing of a source file much faster now. So I can revert to the earlier version, which is now nearly three times faster than the Ruby original...



Great work, Lutz!