New word stemmer, now with multiple language support!

Started by methodic, December 09, 2009, 11:49:14 AM

Previous topic - Next topic

methodic

Hi, a while back I made a newlisp word stemmer that used a stemming library:

http://newlispfanclub.alh.net/forum/viewtopic.php?f=16&t=2678&p=14985&hilit=stemmer#p14985">//http://newlispfanclub.alh.net/forum/viewtopic.php?f=16&t=2678&p=14985&hilit=stemmer#p14985



I had to re-visit this and decided to update the library with a different one. The library I'm using now is located here:

http://snowball.tartarus.org/dist/libstemmer_c.tgz">//http://snowball.tartarus.org/dist/libstemmer_c.tgz



Here is my code, with comments:
;; stemmer.lsp by tony lambiris <tony@libpcap.net>
;;
;; supported languages:
;; danish dutch english finnish french german hungarian italian norwegian
;; porter portuguese romanian russian spanish swedish turkish
;;
;; HOW TO USE:
;;
;; 1) download http://snowball.tartarus.org/dist/libstemmer_c.tgz
;; 2) extract and in libstemmer_c type 'make CFLAGS=-fPIC'
;; 3) build the shared library:
;;    gcc -shared -o libstemmer.so libstemmer/*.o src_c/*.o runtime/*.o
;; 4) copy libstemmer.so to the path of your choice (defined by STEMLIB)
;; 5) follow the code examples below to use
;;
;; > (load "stemmer.lsp")
;; MAIN
;; > (STEM:stemmer "english" '("who" "directed" "taxi" "driver"))
;; ("who" "direct" "taxi" "driver")
;; > (STEM:stemmer "french" (parse "Je vous en prie" " "))
;; ("Je" "vous" "en" "pri")

(context 'STEM)

;; change to the location of where you installed the shared library
(constant 'STEMLIB "/home/tlambiris/Code/libstemmer.so")

;; imported function names
(import STEMLIB "sb_stemmer_new")
(import STEMLIB "sb_stemmer_stem")
(import STEMLIB "sb_stemmer_delete")

;; takes 2 parameters (respectively):
;; 1) the language to use for word stemming
;; 2) a list of words
;;
;; if word stemming is successful, a list of stemmed words will be returned
;; otherwise the original list of words will be returned
(define (stemmer lang words)
  (set 'new_words '())

  (dolist (w words)
    (set 's (sb_stemmer_new lang 0))
    (if (> s 0)
      (begin
        ;; we were able to initialize the stemmer. let us stem.
        (set 'n (get-string (sb_stemmer_stem s w (length w))))
        (sb_stemmer_delete s)

        ;; push the stemmed word onto our new list
        (push n new_words -1)
      )
    )
  )

  ;; if new_words is still an empty list, something went wrong
  ;; return the original list of words instead
  (if (= (length new_words) 0)
    words
    new_words
  )
)

(context MAIN)


Here is a direct link:

http://libpcap.net/newlisp/stemmer.lsp">//http://libpcap.net/newlisp/stemmer.lsp



Questions/comments welcome! :) Again, thanks to Lutz for creating such a powerful language!

joejoe

hi methodic,



can't do much yet but to say thanks! it looks like a very useful script.



thanks big for throwing it out there for us. :D