approximate string matching using trigrams

Started by nallen05, April 30, 2012, 01:21:31 PM

Previous topic - Next topic

nallen05

My friend Ric Szopa wrote a "similarity" function in clojure to compare how similar two strings are using the trigrams approach (find all the sequential 3-letter combos in both words and measure the overlap):



# trigrams in clojure (author: Ric Szopa)
(defn- trigrams [str]
 (let [n 3
       padding (repeat (dec n) $)
       cleaned (remove #(Character/isWhitespace %) (concat padding
(clojure.string/lower-case str)))
       freqs (frequencies (partition n 1 cleaned))
       norm (Math/sqrt (reduce + (map #(Math/pow % 2) (vals freqs))))]
   (into {} (for [[k v] freqs] [k (/ v norm)]))))

(defn similarity [left right]
 (if (= left right)
   1.0
   (let [left (trigrams left)
         right (trigrams right)]
     (apply + (for [[trigram value] left] (* value (or (right trigram) 0)))))))


I decided to try writing it in newlisp...



(define (ngrams d str, (norm 0) (str-clean (append "  " (lower-case (replace " " str "")))))
  (dotimes (i (- (length str-clean) 2))
(bayes-train (unpack "s3" (+ i (address str-clean))) d))
  (dotree (x d true)
    (inc norm (pow ((context d x) 0))))
  (setq norm (sqrt norm))
  (dotree (x d true)
    (context d x (div ((context d x) 0) norm)))
d)

(define (similarity left right, (accum 0))
  (if (= left right)
 1.0
 (let ((left (ngrams 'Left left))
(right (ngrams 'Right right)))
  (dotree (s left true)
 (inc accum (mul (context Left s) (or (context Right s) 0))))
accum)))

#  (similarity "banana" "bananana") => 0.9722718241


...which turned out to be only a few lines longer.



In order to speed it up I tried the following strategies:

* store data set in a context and iterate over it using dotree (instead of list + dolist)

* i used low-level "unpack" for faster string iteration

* use built-in methods when possible (eg bayes-train)



One consequence of all the dotree's is that I felt like I ended up doing a lot more stateful variable manipulation (eg creating a variable and then calling "inc" on it within the block of the 'dotree') than I would have in other dialects of lisp (where I would have used eg 'reduce').



My question for the forum is: does anyone have any tips or tricks for optimizing newlisp code (using trigrams as an example) or corrections to my approach?



Take care



Nick