Difficult one

Started by Jeff, May 03, 2007, 02:04:00 PM

Previous topic - Next topic

Jeff

I wrote a little US-style proper name parser for some common name formats (to help with some large lists of data I needed to import with various name formats).  I wrote one in python, which is not very optimized for this sort of thing, and one in newlisp, which I believe should be, and the one in newlisp is running much slower.



The problem seems to be with my matching algorithm.  It runs through around ten regexp patterns on each name (after 'normalizing' the name) until it hits one that matches. I've arranged the patterns as best as I can to match the most efficiently.  However, there is a huge speed difference between the normalization function, which does a bunch of replaces, and the parse-name function, which does a bunch of regex searching.



After running through nearly 8000 names, the python version has about a half a second lead on my quad core g5.  They both use nearly the same algorithm to do matching and replacing (in fact, I copied and pasted most of the regexps).



Any ideas?


#!/usr/bin/env newlisp

(define (int->roman num , result)
  (set 'result "")
  (let ((numerals '((1 "I") (4 "IV") (5 "V") (9 "IX") (10 "X") (40 "XL")
                  (50 "L") (90 "XC") (100 "C") (400 "CD") (500 "D") (900 "CM")
                  (1000 "M"))))
       (dolist (numeral-set (reverse numerals))
               (let ((value (first numeral-set)) (numeral (last numeral-set)))
                    (while (>= num value)
                      (begin
                        (set 'result (join (list result numeral)))
                        (set 'num (- num value))))))) result)

(constant '*patterns*
'(({^b([w-']{2,})bsb((?:St. )*[w-']{2,})b$} (first-name last-name))
 ({^b([w-']{2,})bsb([w-']{2,})bsb((?:St. )*[w-']{2,})b$} (first-name middle-name last-name))
 ({^b([w-']{2,})bsb(w.)sb((?:St. )*[w-']{2,})b$} (first-name middle-name last-name))
 ({^b(w.)sb([w-']{2,})bsb((?:St. )*[w-']{2,})b$} (first-name middle-name last-name))
 ({^b(w.)sb(w.)sb((?:St. )*[w-']{2,})b$} (first-name middle-name last-name))
 ({^b([w-']{2,})bsb(w.)sb((?:St. )*[w-']{2,}bsb[w-']{2,})b$} (first-name middle-name last-name))
 ({^b([w-']{2,})bsb(w.sw.)sb((?:St. )*[w-']{2,})b$} (first-name middle-name last-name))
 ({^b([w-']{2,}bsb[w-']{2,})bsb(w.)sb((?:St. )*[w-']{2,})b$} (first-name middle-name last-name))
 ({^b([w-']{2,})bsb([w-']{2,}bsb[w-']{2,})bsb((?:St. )*[w-']{2,})b$} (first-name middle-name last-name))))

(constant '*romans-list* (map 'int->roman (sequence 1 25)))
(constant '*romans* (format {s(?P<suffix>%s)$}
  (join (map (lambda (i)
                     (format "(?:%s)" i))
        *romans-list*) "|")))
(constant '*suffixes* {b(?P<suffix>[JjSs][Rr]).*b})
(set '*parse-errors* '())

(define (normalize nm)
  (let ((comma-matches (find-all {,(?!(s*[JjSs][Rr]b.))} nm)))      
       (cond ((= (length comma-matches) 2)
              (replace {^s*(.+?)s*,s*(.+?)s*,s*(.+?)s*$} nm (format "%s %s %s" $3 $1 $2) 0))
             ((= (length comma-matches) 1)
              (replace {^s*(.+?)s*,s*(.+?)s*$} nm (format "%s %s" $2 $1) 0)))
       (replace "," nm "")
       (replace {(.)s*} nm ". " 0)
       (replace {b(w)b[^$.]} nm (format "%s. " (upper-case $1)) 0)
       (replace {b([w-']+?)b} nm (title-case $1 true) 0)
       (replace {b([JjSs][Rr])b.*} nm (format "%s." 0))
       (replace *romans* nm (format " %s" (upper-case $1)) 1)
       (set 'nm (trim nm))
       (replace {s+} nm " " 0)) nm)

(define (match-pattern str pattern , p-match)
  (set 'data '((first-name nil) (middle-name nil) (last-name nil)))
  (set 'p-match (regex (first pattern) str))
  (if (not (nil? p-match))
      (begin
        (dolist (x (last pattern))
          (replace-assoc x data (list x (p-match (* 3 (+ 1 $idx))))))) nil))

(define (parse-name nm , data original-name)
  (set 'original-name nm)
  (set 'nm (normalize nm))
  (set 'data '((first-name nil) (middle-name nil)
              (last-name nil) (suffix nil)))
  (let ((suffix-match (regex {b(?P<suffix>[JjSs][Rr]).*b} nm))) ;("Jr" 18 2 "Jr" 18 2)
       (if (not (nil? suffix-match))
           (begin
             (replace-assoc 'suffix data (list 'suffix (suffix-match 3)))
             (replace (format " %s.*" (last (assoc 'suffix data))) nm "" 1))
           (begin
             (set 'suffix-match (regex *romans* nm 1))
             (if (not (nil? suffix-match))
                 (begin
                   (replace-assoc 'suffix data (list 'suffix (suffix-match 3)))
                   (replace (format " %s" (last (assoc 'suffix data))) nm "" 1))))))
  (let ((good-match (catch
                      (dolist (pattern *patterns*)
                        (let ((possible-match (match-pattern nm pattern)))
                             (if (not (nil? possible-match))
                               (throw possible-match)))))))
       (if (nil? good-match)
           (push original-name *parse-errors* -1)
           (dolist (cell good-match)
                   (replace-assoc (first cell) data cell)))) data)

(load "sample-names.lsp")

(map parse-name nm-list)
(println *parse-errors*)
Jeff

=====

Old programmers don\'t die. They just parse on...



http://artfulcode.net\">Artful code

cormullion

#1
I ask these questions sometimes, too. The usual outcome is that if you write Python (or Perl) in newLISP, it doesn't perform quite as well as Python written in Python, Perl in perl, or newLISP in newLISP. Having said that, though, the problem comes when there's no other obvious way to write things.



If i get time I'd like to run through it - perhaps you could show a bit of sample data... newLISP has a few useful timing functions...

Lutz

#2
I re-did the Debian language shootout regex benchmark on my MacMini G4 and it shows newLISP beeing faster in regex pattern matching. So I suppose the slower behaviour you see in newLISP is not caused by regex pattern matching but by something else. The code is too much and the time to little ;-), for me to analyze this further, but there must be a way to make this faster.



 time ./regexmatch.python 1000  < regexmatch



real    0m0.676s

user    0m0.443s

sys     0m0.076s





 time ./regexmatch.newlisp 1000  < regexmatch.input



real    0m0.373s

user    0m0.311s

sys     0m0.008s





The benchmarks from here: http://newlisp.org/benchmarks/">http://newlisp.org/benchmarks/ are for Intel CPUs on Win32, but they give a few hints where the differences in peformance are for Perl, Python and newLISP.





Lutz

nigelbrown

#3
Perhaps a function timing would show the slow point?

These links suggest approaches

Modify function defs to collect timing data when called : http://www-users.cs.umn.edu/~gini/lisp/metering.cl">http://www-users.cs.umn.edu/~gini/lisp/metering.cl



Interrupt program and record what function it is in : http://wwwcgi.rdg.ac.uk:8081/cgi-bin/cgiwrap/wsi14/poplog/lisp/help/profile">http://wwwcgi.rdg.ac.uk:8081/cgi-bin/cg ... lp/profile">http://wwwcgi.rdg.ac.uk:8081/cgi-bin/cgiwrap/wsi14/poplog/lisp/help/profile  and http://www.franz.com/support/documentation/6.2/doc/profiling.htm">http://www.franz.com/support/documentat ... filing.htm">http://www.franz.com/support/documentation/6.2/doc/profiling.htm

Maybe this could be done in newlisp using the (timer ) function if there was a way of checking what function has been interrupted?



Nigel

Jeff

#4
Corm:  It's not very python-esque code.  In python, I tend to rely on list-comprehensions and then use pretty lispy code, because lispy code tends to outperform (except in PHP, where function overhead is huge and you can't pass functions as args [only strings that get eval'd into functions or PHP's crippled lambda, which is not subject to garbage collection]).  The ^technique^ is the same - read the file in, normalize the names so matching is consistent (capitalization, periods after initials, order of names, etc).



Lutz:  I know, I've done my own tests and it is much faster at both reading in file data and regexp matching.  That's what is so frustrating about the problem.



Nigel:  I know where the problem is.  newLisp outperforms python in the first stages of reading in the file and then normalizing the names with regexp replacements.  The problem is in the short-circuit logic that runs through the series of regexps in *patterns* and stops at the first good match.  It is taking significantly longer in newLisp than in python - enough that it kills the performance gains on the first section.



I'm pretty sure the problem is not in the technique (although parsing by whitespace and lopping off slices would be faster, it is not extensible or as elegant as a solution using pattern recognition; besides, emulating how my brain recognizes a name is much more lispy ;)).  The problem, I think, is in the implementation.



In common lisp, I would use lots of nested let statements and recursion, but newlisp's docs say that often, dolist and the other iterators are faster than recursion (since they may be using recursion internally) so I end up using do and begin.  I also have a difficult time in newlisp deciding when to use nil params in the function def or let statements.  Something about how I wrote parse-name is completely inefficient :(



I just noticed - in match-pattern, I did not declare 'data anywhere.  It is reassigning the same global symbol over and over.  That couldn't be causing the slowdown, could it?
Jeff

=====

Old programmers don\'t die. They just parse on...



http://artfulcode.net\">Artful code

Lutz

#5
Several recommendations to speed up the code and other comments:



instead of:


(if (not (nil? suffix-match)) ...

say:


if( (not suffix-match) ...)
; or even faster
(unless suffix-match ...)


instead of:


(last (assoc 'suffix data))

say:


(lookup 'suffix data)

'lookup' will always pick the last in the association (if not given an index)



instead of:


(if (nil? good-match) ...)

try also 'not', which will trigger on 'nil' and (), while 'nil?' means strictly 'nil'. Not sure about the logic here and the impact is probably minimal, but its better English :-)


(if (not good-match) ...)


QuoteI also have a difficult time in newlisp deciding when to use nil params in the function def or let statements.


All parameters in p1,p2,p3 (define (foo p1 p2 p3) ...) are set to 'nil' if not filled in by the caller.



In a 'let' statement if you want to have the params imitialized to 'nil' then better use 'local' (much less overhead/faster):



instead of:
(let (param nil) ....)
do:
(local (param) ...)

QuoteIt is reassigning the same global symbol over and over. That couldn't be causing the slowdown, could it?


If that global variable always gets the same value, I would set it only once. But which var/statement in match-pattern do you mean?



I am not sure how much the above changes will influence the speed, but in any case they will make the code shorter and more readable.



You seem to compose the name records bye doing 'replace-assoc' in the 'data' list:


(set 'data '((first-name nil) (middle-name nil) (last-name nil)))

Would it not be better to compose the whole name records using list' statements instead of replace-assoc statements? (perhaps I am not understanding the code well)



To understand the code better it would be helpful if you can give us a little "sample-names.lsp" with just a few records. Perhaps then we can better understand the logic of 'parse-name' and 'match-pattern' and one could try out those functions in an isolated manner to understand what is going on and suggest a more efficient algorithm.



Lutz

Jeff

#6
Lutz,



Thanks for the advice. Particularly with local and lookup. As far as assigning data at the entrance of match-pattern, that function evolved quite a bit and I didn't clean it up properly. In the original form, I needed to have it whether or not the rest of the function proceeded as a shortcut for regexps that did not have one of the elements (i.e. no middle name/initial).



The sample names file is just as it sounds, a list of strings, representing names in various formats:


(set 'nm-list '("Some P. Body", "Guy, Some Other", "John Q Programmer", "Adams, Sweet Fanny", "I. P. Freely", "Name That Won't Match", "Name That Matches, Jr."))

The idea is to first normalize them, so that they all are properly capitalized, abbreviations end in periods, and are listed in First name or initial, middle name or initial, last name order.  That is the normalize function.



The parse-name function then attempts to use a series of regexp patterns to recognize the parts of the name.  I wanted to use regexp so that it could be easily expanded later and because when writing lisp, it's always clever to try and emulate the human mind (although not always efficient ;)).



parse-name normalizes the name, then uses the function match-pattern on the result, which tries to match the name using each regexp in *patterns* until one matches, short circuiting at this point.



My purpose in splitting parse-name and match-pattern was to lower the overhead of any one function, since parse-name would be used in iteration.
Jeff

=====

Old programmers don\'t die. They just parse on...



http://artfulcode.net\">Artful code

Jeff

#7
I've gotten it down to matching the speed of python:


#!/usr/bin/env newlisp

(define (int->roman num , result)
  (set 'result "")
  (let ((numerals '((1 "I") (4 "IV") (5 "V") (9 "IX") (10 "X") (40 "XL")
                  (50 "L") (90 "XC") (100 "C") (400 "CD") (500 "D") (900 "CM")
                  (1000 "M"))))
       (dolist (numeral-set (reverse numerals))
               (let ((value (first numeral-set)) (numeral (last numeral-set)))
                    (while (>= num value)
                      (begin
                        (set 'result (join (list result numeral)))
                        (set 'num (- num value))))))) result)

(constant '*patterns*
'(({^b([w-']{2,})bsb((?:St. )*[w-']{2,})b$}
   (first-name last-name))
 ({^b([w-']{2,})bsb([w-']{2,})bsb((?:St. )*[w-']{2,})b$}
   (first-name middle-name last-name))
 ({^b([w-']{2,})bsb(w.)sb((?:St. )*[w-']{2,})b$}
   (first-name middle-name last-name))
 ({^b(w.)sb([w-']{2,})bsb((?:St. )*[w-']{2,})b$}
   (first-name middle-name last-name))
 ({^b(w.)sb(w.)sb((?:St. )*[w-']{2,})b$}
   (first-name middle-name last-name))
 ({^b([w-']{2,})bsb(w.)sb((?:St. )*[w-']{2,}bsb[w-']{2,})b$}
   (first-name middle-name last-name))
 ({^b([w-']{2,})bsb(w.sw.)sb((?:St. )*[w-']{2,})b$}
   (first-name middle-name last-name))
 ({^b([w-']{2,}bsb[w-']{2,})bsb(w.)sb((?:St. )*[w-']{2,})b$}
   (first-name middle-name last-name))
 ({^b([w-']{2,})bsb([w-']{2,}bsb[w-']{2,})bsb((?:St. )*[w-']{2,})b$}
   (first-name middle-name last-name))))

(constant '*romans-list* (map 'int->roman (sequence 1 25)))
(constant '*romans* (format {s(?P<suffix>%s)$}
  (join (map (lambda (i)
                     (format "(?:%s)" i))
        *romans-list*) "|")))
(constant '*suffixes* {b(?P<suffix>[JjSs][Rr]).*b})
(set '*parse-errors* '())

(define (normalize nm)
  (let ((comma-matches (find-all {,(?!(s*[JjSs][Rr]b.))} nm)))      
       (cond ((= (length comma-matches) 2)
              (replace {^s*(.+?)s*,s*(.+?)s*,s*(.+?)s*$} nm
                (format "%s %s %s" $3 $1 $2) 0))
             ((= (length comma-matches) 1)
              (replace {^s*(.+?)s*,s*(.+?)s*$} nm
                (format "%s %s" $2 $1) 0)))
       (replace "," nm "")
       (replace {(.)s*} nm ". " 0)
       (replace {b(w)b[^$.]} nm (format "%s. " (upper-case $1)) 0)
       (replace {b([w-']+?)b} nm (title-case $1 true) 0)
       (replace {b([JjSs][Rr])(.*)} nm (format "%s." $1) 1)      
       (replace *romans* nm (format " %s" (upper-case $1)) 1)
       (set 'nm (trim nm))
       (replace {s+} nm " " 0)) nm)

(define (match-pattern str pattern name-if-error) ; name-if-error can be orig.
  (if pattern                                     ; if suffix parsed off str
    (map list (last pattern) (rest (filter string?
                                   (regex (first pattern) str))))
    (begin
      (push (or name-if-error str) *parse-errors* -1)
      (list '(first-name nil) '(middle-name nil) '(last-name nil)))))

(define (parse-name str , data original-str)
  (set 'original-str str)
  (set 'str (normalize str))
  (push (list 'suffix (if (regex {b(?P<suffix>[JjSs][Rr].*)$} str) $1 nil))
        data -1)
  (append (match-pattern (replace {s+b(?P<suffix>[JjSs][Rr].*)$} str "" 0)
            (exists (lambda (pattern) (regex (first pattern) str)) *patterns*)
                    original-str) data))

(define (format-name name-list , str)
  (if (string? name-list) (set 'name-list (parse-name name-list)))
  (if (lookup 'first-name name-list)
    (begin
      (set 'str (lookup 'first-name name-list))
      (if (lookup 'middle-name name-list) (write-buffer str
        (format " %s" (lookup 'middle-name name-list))))
      (write-buffer str (format " %s" (lookup 'last-name name-list)))
      (if (lookup 'suffix name-list) (write-buffer str
        (format ", %s" (lookup 'suffix name-list))))) nil))

(define (format-name name-list)
  (join (filter string? (map last
    (if (string? name-list) (parse-name name-list) name-list))) " "))

(load "sample-names.lsp")

(map parse-name nm-list)
(println *parse-errors*)


However, if you run (map format-name nm-list), python beats the pants off of newlisp.  I also tried name-format as a (format) expression (with conditionals to avoid extra spaces (and errors) in the case of nil values), but that took *much* longer (about 3/4 of a second over 8000 names on a quad core g5) than just joining them.
Jeff

=====

Old programmers don\'t die. They just parse on...



http://artfulcode.net\">Artful code

rickyboy

#8
Who wins at (map parse-name nm-list)?  Python or newLISP?
(λx. x x) (λx. x x)

Jeff

#9
newLisp by a hair (about .1-.2 seconds over 8000 names on a quad-core g5).
Jeff

=====

Old programmers don\'t die. They just parse on...



http://artfulcode.net\">Artful code

rickyboy

#10
So there's something in the rest of format-name which causes "python [to beat] the pants off of newlisp" when running (map format-name nm-list)?



BTW, are you using map in Python or list comprehensions?



--Rick
(λx. x x) (λx. x x)

Jeff

#11
The python code to format a name is slightly more complex:


def format(name_obj):
if isinstance(name_obj, str):
name_obj = parse(name_obj)
formatted = ' '.join([format_first(name_obj), format_last(name_obj)])
return formatted

def format_first(name_obj):
if name_obj['middle']:
return ' '.join([name_obj['first'], name_obj['middle']])
else:
return name_obj['first']

def format_last(name_obj):
if name_obj['suffix']:
if name_obj['suffix'] in ROMANS_LIST:
return ' '.join([name_obj['last'], name_obj['suffix']])
else:
return ', '.join([name_obj['last'], name_obj['first']])
else:
return name_obj['last']


I separated out first and last in the python (which is my production version atm) so I could more easily compose records with first and last name sections separated.  In newlisp, though, it seems like there is a proportionally higher function overhead than in python - I've found that python speeds up when you break operations up, but sometimes newlisp can get bogged down if you factor a function too finely.



Here is the code I am comparing against:


def main():
from sample_names import name_list

for name in name_list:
try:
parse(name)
except ParseError, e:
print e


Using a list comprehension, like [parse(name) for name in name_list] actually makes it quite a bit faster, but doesn't allow for graceful exception handling, since a bad parse throws a ParseError.
Jeff

=====

Old programmers don\'t die. They just parse on...



http://artfulcode.net\">Artful code

rickyboy

#12
Quote from: "Jeff"In newlisp, though, it seems like there is a proportionally higher function overhead than in python - I've found that python speeds up when you break operations up, but sometimes newlisp can get bogged down if you factor a function too finely.

Yes, I've noticed this too.  If I could have low lambda overhead and tail call optimization, newLISP might be my favorite language of all time!  :-)



On the original subject, I don't then see anything beyond lambda overhead that could be the cause of the problem.  I *was* gonna say that newLISP might be slower because your newLISP calling code was list-building and your Python's wasn't, but you said that the list comprehension call was actually faster.  :-(
(λx. x x) (λx. x x)

Lutz

#13
QuoteIn newlisp, though, it seems like there is a proportionally higher function overhead than in python


The Fibonacci benchmark here: http://newlisp.org/benchmarks/">http://newlisp.org/benchmarks/



which measures mostly call overhead would suggest the opposite.



To figure really out where the speed differences are between 2 languages is  difficult when using a relative complex program as Jeff's name parser. One has to get down to very small code snippets and measure execution time of those to nail down where the differences are.



Regarding  "tail call optimization":



Tail call optimization is good if you prefer expressing algorithms via recursion versus iteration. But speed performance wise iteration can be implement more efficiently compared to tail recursion optimization, which usually is implemented using continuations.



The paradox about tail recursion optimization is, that you can only optimize the tail part of the recursion, which at the same time is the part easy to express or transform into iteration (because it is the last call in the function).



In the end tail-recursion-optimization versus iteration comes down to a programmer preference of rather using one versus the other method when expressing algorithms.



Lutz

rickyboy

#14
Quote from: "Lutz"In the end tail-recursion-optimization versus iteration comes down to a programmer preference of rather using one versus the other method when expressing algorithms.

Absolutely!  This is the reason I prefer tail-recursion over iteration -- it's not an implementation issue for me, it's an expression issue.  The biggest reason for me is that recursive expressions are *much* easier to reason about, concerning both proofs of program correctness and getting as much of the context of the program loaded into programmer headspace as possible.  These are also the reasons I like using FP HOFs also: if I can prove the correctness of 'map' or 'fold', any function that uses 'map' or 'fold' has a relatively easy proof of correctness.  And the expression is concise and clear.



Since newLISP doesn't support tail call optimization, I have coded my HOFs iteratively -- I just have to live with that (*in my opinion*) ugliness.  But I wall them off into well-used HOF definitions; the functions, at a higher level in the system, that use the HOFs can be devoid of iterative code, thank God.  :-)
(λx. x x) (λx. x x)