Counting chars in a string

Started by Jeff, August 14, 2007, 01:55:35 PM

Previous topic - Next topic

Jeff

I've got a program I wrote in Python/C that aggregates counts of chars in a string, but I thought I'd take a crack at it in newLisp as well.  I want to end up with a list (or array, hash, whatever) of (character (count of character)) for a string.  The string I'm testing with is War and Peace (here's a free copy - http://artfulcode.nfshost.com/charcounter/war_and_peace.txt">//http://artfulcode.nfshost.com/charcounter/war_and_peace.txt).  The function I'm using is:


(define (count-chars-in-string str)
  (let ((seq (sequence 0 127)))
       (map (fn (i) (cons (char i) (length (find-all (format "\x%x" i) str))))
            seq)))


I've tried it by iterating through the string (which is the tack I took in C, since a string is really an int array), incrementing an 128-member array with the char's int cast as the index.  That takes minutes in newLisp.  In C, I'm doing this in under a tenth of a second.  I should at least be able to get it under half a second.  The current implementation, above, gives me about 3.5 seconds, which is still ridiculously slow.



I also tried using read-char and read-buffer, but neither gave me the results I wanted.  Anyone have any ideas?



By the way, here is a url of the type of results I am planning on ending up with:  http://artfulcode.nfshost.com/charcounter">//http://artfulcode.nfshost.com/charcounter.  This uses Python with a C module that does the counting.  The python only pushes the string to the module.
Jeff

=====

Old programmers don\'t die. They just parse on...



http://artfulcode.net\">Artful code

Lutz

#1
This solution is slightly faster, and may scale better too. The counts are in a hash Lex -> Lex:_a Lex:_b etc.



(define (count-chars-in-string str)
(bayes-train (explode str) 'Lex)
(map list (map name (symbols 'Lex)) (map eval (symbols 'Lex)))
)


normally 'bayes-train is used on several lists at the same time, putting all counts found in a list, but it also can be used counting in just one list as shown here.



Lutz

Jeff

#2
Lutz,



That's a helpful use of the function to know about, but that still doesn't give me a real performance boost.  Exploding the string is too expensive.



Is there no way to iterate quickly over the chars in a string?  What is newLisp's internal representation of a string like?  If there isn't a way to do so, could a dostring function be added?



Also, I tried to import my C library I used with Python to do the work for me and had some real difficulty.  Here is the C lib:


#include <stdio>
#include <stdlib>
#include <string>
#define CHARS 128

typedef struct tally
{
char *text;
int counts[CHARS];
} TALLY;

TALLY init_counts(char *str);
void count_chars(TALLY *t);
int count_of_char(TALLY *t, char c);

TALLY init_count(char *str)
{
int i;
TALLY t;
t.text = str;
for (i = 0; i < CHARS; i++)
{
t.counts[i] = 0;
}
count_chars(&t);
return t;
}

void count_chars(TALLY *t)
{
int i;
for (i = 0; i <strlen>text); i++)
{
t->counts[(int) t->text[i]]++;
}
}

int count_of_char(TALLY *t, char c)
{
return t->counts[(int) c];
}


After importing init_counts, I run (set 'foo (init_counts very-large-string)).  This does not seem to work at all.  I get nothing back and the action hangs.  I also tried a C function that only takes a string and prints the first 100 chars, just to test the newLisp code, and it seems like passing a string to a library function takes an inordinate amount of time (which is why I was wondering about newLisp's internal string representation).



Any ideas?  I would love to get rid of python's overhead in my app by using newLisp and additionally have the ability to run the software in an easy-to-create gui, but 3.5 seconds is way too expensive here.
Jeff

=====

Old programmers don\'t die. They just parse on...



http://artfulcode.net\">Artful code

Lutz

#3
On this page you find a small C program for the character counter and how to compile and import it into newLISP:



http://newlisp.org/index.cgi?CountCharacters">http://newlisp.org/index.cgi?CountCharacters



I tried to post it here but the formatting got all mixed up.



Lutz

jrh

#4
Quote from: "Jeff"Is there no way to iterate quickly over the chars in a string?  What is newLisp's internal representation of a string like?


Strings and text appear as contiguous areas of memory in newLISP.  Fastest iteration I've found while staying in newLISP is to index into them from the base address:



(time (setq buf (dup "00" 3000000)))

(time (for (x 0 (- (length buf) 1)) (get-char (+ (address buf) x))))


Ryon

#5
QuoteI tried to post it here but the formatting got all mixed up.


Does this look right? Some wrapping, but readable?



#include <stdio.h>
#include <memory.h>

int cnts[128];

int * count_characters(char * str, int length)
   {
   int i;

   memset((void *)cnts, 0, 128 * sizeof(int));

   for(i = 0; i < length; i++)
       cnts[str[i]] += 1;

   return(cnts);
   }


compile it to a shared library


on Mac OS X:
   gcc count-characters.c -bundle -o count-characters.so

on other UNIX
   gcc count-characters.c -shared -o count-characters.so


now use it:

newLISP v.9.2.0 on OSX UTF-8, execute 'newlisp -h' for more info.

> (define count-chars (import "count-characters.so" "count_characters"))
count_characters <8CEF0>

> (unpack (dup "lu" 128) (count-chars (read-file "war-and-piece.txt") 3217389))
(0 0 0 0 0 0 0 0 0 0 67418 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 511976 3938 17974 2 1 3 0 7526 640 640 339 1 39921 5850 30686
 26 230 348 163 55 25 49 51 38 169 54 1003 1148 1 2 1 3138 2 6204
 3638 1783 2021 1867 1908 1247 4016 7404 320 1191 687 3288 3627 1638
 6117 35 2704 2974 6464 278 939 2896 348 1265 108 47 0 47 0 1 0 199012
 30984 59225 116122 312451 52818 49877 162871 166004 2214 19194 95740
 58261 180228 190868 38854 2300 144967 159746 219083 65006 25940
 56197 3719 44945 2282 0 0 0 1 0)
>

does it on my MacMini/PPC in 0.129 seconds.

Lutz
\"Give me a Kaypro 64 and a dial tone, and I can do anything!\"

Jeff

#6
Thank you all for your help.



Jrh:  I tried that, too.  I just couldn't get the speed I needed.  Especially when on each iteration I needed to increment the value at an array index.  With a quad-core g5 (2 * 2.5 GHz), it still takes around 3.5 seconds :(



Lutz:  Thanks!  That is what I was looking for.  I was still using the struct as a way of housing the array so that it could be returned from the function (as you can see, I'm not hugely experienced with C; additionally, the lib was originally built to work with the Python API so that it would have a more object-like return value that could be extended as needed).



I was only ever able to return a struct to newLisp successfully when I only returned a pointer to a global instance of the struct.  When I tried to return a locally created struct itself, the interpreter died with no messages.  Any idea what I did wrong there?
Jeff

=====

Old programmers don\'t die. They just parse on...



http://artfulcode.net\">Artful code

Lutz

#7
Quotewas only ever able to return a struct to newLisp successfully when I only returned a pointer to a global instance of the struct.


Yes, that is the only way to do it. All return values from C library imports are 32 bit, so for bigger data objects a pointer has to be returned.



Thanks Ryon for re-posting the code here in readable form.



Lutz

Jeff

#8
Thanks again for all the help.  I just added a version of the character counter form on my website using newLisp as a cgi.



The host only supports cgi (since the virtual host doesn't actually exist until it gets a request).  That means that with each request I have the overhead of launching the interpreter.



The newlisp version, while it takes about the same amount of time to process the text itself, takes more than a half of a second less to actually return the data.



Here is the newLisp version: http://artfulcode.nfshost.com/lex/">//http://artfulcode.nfshost.com/lex/



(don't bookmark that url, it will disappear)
Jeff

=====

Old programmers don\'t die. They just parse on...



http://artfulcode.net\">Artful code

Jeff

#9
PPS: 'sort is excellent.  This was the first time I ever implemented a sort using s-expressions rather than imperative code, and it is much, much cleaner.
Jeff

=====

Old programmers don\'t die. They just parse on...



http://artfulcode.net\">Artful code

Jeff

#10
Yay!  I figured out a way to iterate quickly over a string.  It's not nearly as fast as C, but it seems to out-perform the other ways so far.  First, I get the address of the string as a pointer.  I can get the char at that address using get-char.  Then I can get the next char using (get-char (+ 1 (address string))).  Here is a function to count chars moderately quickly (beats bayes-train because it doesn't explode the string):


(define (count-chars str , ptr chars c)
  (set 'chars (array 128 (dup 0 128)))
  (set 'ptr (address str))
  (for (i 0 (length str))
       (set 'c (get-char (+ ptr i)))
       (nth-set (chars c) (+ 1 (chars c))))
  chars)

(set 'str (read-file "/path/to/war_and_peace.txt"))
(set 'chars (count-chars str))
(for (i 0 127)
     (println (char i) ": " (chars i)))


One question, though... since inc expects a symbol, is there a way to use inc on an array index?  I.e., (inc (arr x y))?
Jeff

=====

Old programmers don\'t die. They just parse on...



http://artfulcode.net\">Artful code

Lutz

#11
The use of an arrray for counting is an important step, even more so when using a few hundred or thousand elements in the counting array.



There is another 2 modifications to get it another 20% faster using 'unpack' to create a vector of ASCII numbers from the string, and using a feature of 'nth-set', which gives the old contents in $0



(define (count-chars str , nums chars c)
  (set 'chars (array 128 (dup 0 128)))
  (set 'nums (unpack (dup "c" (length str)) str))
  (dolist (c nums)
       (nth-set (chars c) (+ $0 1)))
  chars)


Lutz

Jeff

#12
I get another 18% beyond that using only unsigned integers ("b" instead of "c").  Thanks!  This is getting pretty fun :)
Jeff

=====

Old programmers don\'t die. They just parse on...



http://artfulcode.net\">Artful code

Lutz

#13
The rewritten 'count in 9.2.1 is about 3-4 times faster than before when counting 100 in 1,000,000 elements. On the 'war_and_peace.txt' character count about 30% faster on a Core Duo 2 1.8Ghz when testing against counting with 'unpack and char-indexed 'array, and of course a lot easier to code.


(set 'str (read-file "war_and_peace.txt"))
; old way on 9.2.0
(set 'cnts (array 128 (dup 0 128)))
(dolist (i (unpack (dup "b" (length str)) str)))
(nth-set (cnts i) (+ $0 1)))

-> 1270 ms

; using count on 9.2.1
(count (sequence 0 127) (unpack (dup "b" (length str)) str))

-> 812 ms


Lutz

Jeff

#14
Wow!  Where can I get some of that?!  Thanks a bunch, Lutz.  It's nice to work in a language where the author takes a personal interest in how the language is used.
Jeff

=====

Old programmers don\'t die. They just parse on...



http://artfulcode.net\">Artful code