read-file speed

cormullion · February 04, 2006, 12:03:39 AM

I"m trying to create a newLISP version of a little Ruby script that I quite liked, with the same structure and at least equivalent speed:

Code Select Expand
class Histogram
	def initialize
		@tally = Hash.new(0)
	end
	def analyze(s)
		s.split.each do |word| 
			myword = word.downcase.gsub(/[^a-z0-9]/,"")
			@tally[myword] = @tally[myword] + 1 if !myword.empty?
		end
		@tally.sort
	end
end
analysis = Histogram.new.analyze(File.new(ARGV[0]).read)
print analysis.length

The hash table/symbol stuff isn't tha problem, but I've found that the way I'm using parse to build the word hash is much slower, because I'm using regular expressions.

This is what I have so far:

Code Select Expand
(context 'HASH) 
(context 'MAIN)

(dolist (file-name (2 (main-args))) 
	(set 'buff (parse (read-file file-name) "s" 8))
	(dolist (the-word buff) 
		(set 'the-word (replace "[^a-z0-9]" (lower-case the-word) "" 0)))
		(if (!= (context HASH (string "_" the-word)) nil)
			(begin
				(set 'occurrences (context HASH (string "_" the-word)))
				
				(if (!= occurrences nil)
					(set (context HASH (string "_" the-word)) (inc 'occurrences))))
			(begin
				(set (context HASH (string "_" the-word)) 1))))
(sort (symbols HASH))
(dolist (s (symbols HASH))
	(push (list (name (sym s HASH)) (eval (sym s HASH))) word-list -1))
(println (length word-list))

This sort of works, but it's so much slower than the Ruby one that I'd like to know how to make it faster. Any clues or suggestions welcome!

Fanda · February 04, 2006, 10:35:26 PM

I tried something... Maybe you get some ideas...

Code Select Expand
(setq word-list '())

(context 'HASH)
(context 'MAIN)

(define (word-count1 files , buff occurrences)
  (set 'word-list '())
  (dolist (file-name files)
     (set 'buff (parse (read-file file-name) {s} 8))
     (dolist (the-word buff)
        (set 'the-word (replace "[^a-z0-9]" (lower-case the-word) "" 0))
        (if (!= the-word "")
          (if (set 'occurrences (context HASH (append "_" the-word)))
            (context HASH (append "_" the-word) (+ 1 occurrences))
            (context HASH (append "_" the-word) 1)))))

  (dolist (s (sort (symbols HASH)))
     (push (list (name (sym s HASH)) (eval (sym s HASH))) word-list -1))

  (length word-list))


(define (word-count2 files , buff uniq-buff)
  (set 'word-list '())
  (let (not-empty? (lambda (s) (!= s ""))
        cleanup    (lambda (s) (replace "[^a-z0-9]" (lower-case s) "" 0)))
    (dolist (file-name files)
      (set 'buff (filter not-empty? (map cleanup (parse (read-file file-name) {s} 0))))
      (set 'uniq-buff (unique buff))
      (set 'word-list (append word-list (map list uniq-buff (count uniq-buff buff))))))

  (set 'word-list (sort (unique word-list)))
  (length word-list))


(println (time (setq cnt (word-count1 (2 (main-args))))) " ms")
(println cnt " words")
(println)
(println (time (setq cnt (word-count2 (2 (main-args))))) " ms")
(println cnt " words")

Fanda

cormullion · February 06, 2006, 11:42:12 AM

Hey Fanda - thanks for the post. I worked through your code, and thought there were some great ideas! It's always good to see an expert at work...

It's interesting to explore the trade-off between coding elegance and performance. It often seems that the most elegant and attractive high-level code is the least efficient. I noticed that the filter and map functions seemed to add some time to the execution, although I like the way they clarify the code. I suppose this is what expert programmers spend their time doing.

Anyway, I eventually found that the main reason for the relative slowness of my newLISP version was the regex parsing of the source file, which I didn't really need to do in this particular case in this way. This version is nearly twice as fast as the Ruby original, but slowed down whenever I tried to introduce filter or map:

Code Select Expand
(context 'HASH)
(context 'MAIN)

(dolist (file-name (rest (rest (main-args))))
	(set 'data (parse (replace "rn" (read-file file-name) " ") " "))
	(dolist (the-word data)
		(set 'the-word (replace "[^a-z0-9]" (lower-case the-word) "" 0))
		(set 'occurrences (context HASH the-word))
		(if (> occurrences 0)
				(set (sym the-word HASH)  (+ 1 occurrences))
				(set (sym the-word HASH) 1))))
(sort (symbols HASH))
(dolist (s (symbols HASH))
	(push (list (name (sym s HASH)) (eval (sym s HASH))) word-list -1))
(println (length word-list))
(exit)

Thanks for your help![/code]

Fanda · February 07, 2006, 11:21:39 PM

Your version is much faster than mine, so who is an expert now? ;-)

Anyway, I can see 3 things happenning in your code:

1) (sort (symbols HASH)) isn't really gonna sort the symbols - it will only sort the temporary variable returned by (symbols HASH).

Currently it's sorted because symbols are implicitly sorted inside the context. It's better to write:

Code Select Expand
(dolist (s (sort (symbols HASH)))

2) parse and replace are returning "" (empty string) and this string is being counted too. Try doing:

Code Select Expand
(println {"": } (context HASH ""))

(It means, that (length word-list) is off by 1).

3) (set 'occurrences (context HASH the-word)) will return either a number or nil.

You could rewrite

Code Select Expand
(if (> occurrences 0) ...

to just

Code Select Expand
(if occurrences ...

I just wanted you to be aware of it, don't take it seriously ;-)

Fanda

Fanda · February 08, 2006, 12:03:31 AM

map and filter can run fast, but not always. If you are using a map, you can do some optimizations.

Let's use a timing function:

Code Select Expand
;;
;; Time expressions for N times
;;
(define (time-expr _expressions _N)
  (if (not (apply = (map eval _expressions)))
    (begin
      (map (lambda (_e) (println _e " => " (eval _e))) _expressions)
      (throw-error "Return values of timed expressions do not match!")))

  (println)

  (letn (times (map (lambda (_e) (time (eval _e) _N)) _expressions)
         lentimes (apply max (map (lambda (x) (length (string x))) times))
         maxtime (apply max times))
    (if (= maxtime 0)
      (println "All times are 0!")
      (map (lambda (t _e) (println (format (append "%" (string lentimes) "d") t) " ms : " (format "%5.1f" (mul (div t maxtime) 100.0)) "% | " _e )) times _expressions))))

If you need to get the first value in the nested list 'lst', you can do:

Code Select Expand
(map first lst)

This is quite fast. The rule is - try to stay away from mapping a lambda function. Doing that means that you are calling a function, passing an argument(s) and returning back from it. All that takes some time. Let's see timings for different cases:

Code Select Expand
(setq lst (dup '(1 2 3) 1000))

(time-expr '((map first lst)
             (map (lambda (x) (first x)) lst)
             (begin (setq r '()) (dolist (x lst) (push (first x) r -1)) r)) 500)

----------------
266 ms :  60.7% | (map first lst)
438 ms : 100.0% | (map (lambda (x) (first x)) lst)
406 ms :  92.7% | (begin 
 (setq r '()) 
 (dolist (x lst) 
  (push (first x) r -1)) r)

Another example - add 2 to every element of a list:

Code Select Expand
(setq lst (sequence 1 1000))

(time-expr '((map (lambda (x) (+ 2 x)) lst)
             (map + (dup 2 (length lst)) lst)
             (begin (setq r '()) (dolist (x lst) (push (+ 2 x) r -1)) r)) 800)

----------------
500 ms : 100.0% | (map (lambda (x) (+ 2 x)) lst)
328 ms :  65.6% | (map + (dup 2 (length lst)) lst)
328 ms :  65.6% | (begin 
 (setq r '()) 
 (dolist (x lst) 
  (push (+ 2 x) r -1)) r)

The trick is: Instead of using a lambda function, make additional list filled with constants:

Code Select Expand
(map + (dup 2 (length lst)) lst)

And another example - multiply list elements by 2:

Code Select Expand
(setq lst (sequence 1 1000))

(time-expr '((map (lambda (x) (* 2 x)) lst)
             (map * (dup 2 (length lst)) lst)
             (map + lst lst)
             (begin (setq r '()) (dolist (x lst) (push (+ x x) r -1)) r)) 800)

----------------
484 ms : 100.0% | (map (lambda (x) (* 2 x)) lst)
328 ms :  67.8% | (map * (dup 2 (length lst)) lst)
328 ms :  67.8% | (map + lst lst)
375 ms :  77.5% | (begin 
 (setq r '()) 
 (dolist (x lst) 
  (push (+ x x) r -1)) r)

Timings vary if run several times, but still show a general picture faster > slower.

Fanda

cormullion · February 08, 2006, 01:08:26 AM

Hey Fanda - cool! I've got some homework study this week... :-)

Thanks!

cormullion · February 13, 2006, 12:24:06 AM

~~Quote from: "cormullion"~~Anyway, I eventually found that the main reason for the relative slowness of my newLISP version was the regex parsing of the source file

By some spooky coincidence the latest version of newLISP (8.8) runs a regex parsing of a source file much faster now. So I can revert to the earlier version, which is now nearly three times faster than the Ruby original...

Great work, Lutz!

newLISP Fan Club

News:

read-file speed

cormullion

Fanda

cormullion

Fanda

Fanda

cormullion

cormullion