Code Select
# trigrams in clojure (author: Ric Szopa)
(defn- trigrams [str]
(let [n 3
padding (repeat (dec n) $)
cleaned (remove #(Character/isWhitespace %) (concat padding
(clojure.string/lower-case str)))
freqs (frequencies (partition n 1 cleaned))
norm (Math/sqrt (reduce + (map #(Math/pow % 2) (vals freqs))))]
(into {} (for [[k v] freqs] [k (/ v norm)]))))
(defn similarity [left right]
(if (= left right)
1.0
(let [left (trigrams left)
right (trigrams right)]
(apply + (for [[trigram value] left] (* value (or (right trigram) 0)))))))
I decided to try writing it in newlisp...
Code Select
(define (ngrams d str, (norm 0) (str-clean (append " " (lower-case (replace " " str "")))))
(dotimes (i (- (length str-clean) 2))
(bayes-train (unpack "s3" (+ i (address str-clean))) d))
(dotree (x d true)
(inc norm (pow ((context d x) 0))))
(setq norm (sqrt norm))
(dotree (x d true)
(context d x (div ((context d x) 0) norm)))
d)
(define (similarity left right, (accum 0))
(if (= left right)
1.0
(let ((left (ngrams 'Left left))
(right (ngrams 'Right right)))
(dotree (s left true)
(inc accum (mul (context Left s) (or (context Right s) 0))))
accum)))
# (similarity "banana" "bananana") => 0.9722718241
...which turned out to be only a few lines longer.
In order to speed it up I tried the following strategies:
* store data set in a context and iterate over it using dotree (instead of list + dolist)
* i used low-level "unpack" for faster string iteration
* use built-in methods when possible (eg bayes-train)
One consequence of all the dotree's is that I felt like I ended up doing a lot more stateful variable manipulation (eg creating a variable and then calling "inc" on it within the block of the 'dotree') than I would have in other dialects of lisp (where I would have used eg 'reduce').
My question for the forum is: does anyone have any tips or tricks for optimizing newlisp code (using trigrams as an example) or corrections to my approach?
Take care
Nick