I've got a program I wrote in Python/C that aggregates counts of chars in a string, but I thought I'd take a crack at it in newLisp as well. I want to end up with a list (or array, hash, whatever) of (character (count of character)) for a string. The string I'm testing with is War and Peace (here's a free copy - //http://artfulcode.nfshost.com/charcounter/war_and_peace.txt). The function I'm using is:
(define (count-chars-in-string str)
(let ((seq (sequence 0 127)))
(map (fn (i) (cons (char i) (length (find-all (format "\x%x" i) str))))
seq)))
I've tried it by iterating through the string (which is the tack I took in C, since a string is really an int array), incrementing an 128-member array with the char's int cast as the index. That takes minutes in newLisp. In C, I'm doing this in under a tenth of a second. I should at least be able to get it under half a second. The current implementation, above, gives me about 3.5 seconds, which is still ridiculously slow.
I also tried using read-char and read-buffer, but neither gave me the results I wanted. Anyone have any ideas?
By the way, here is a url of the type of results I am planning on ending up with: //http://artfulcode.nfshost.com/charcounter. This uses Python with a C module that does the counting. The python only pushes the string to the module.
This solution is slightly faster, and may scale better too. The counts are in a hash Lex -> Lex:_a Lex:_b etc.
(define (count-chars-in-string str)
(bayes-train (explode str) 'Lex)
(map list (map name (symbols 'Lex)) (map eval (symbols 'Lex)))
)
normally 'bayes-train is used on several lists at the same time, putting all counts found in a list, but it also can be used counting in just one list as shown here.
Lutz
Lutz,
That's a helpful use of the function to know about, but that still doesn't give me a real performance boost. Exploding the string is too expensive.
Is there no way to iterate quickly over the chars in a string? What is newLisp's internal representation of a string like? If there isn't a way to do so, could a dostring function be added?
Also, I tried to import my C library I used with Python to do the work for me and had some real difficulty. Here is the C lib:
#include <stdio>
#include <stdlib>
#include <string>
#define CHARS 128
typedef struct tally
{
char *text;
int counts[CHARS];
} TALLY;
TALLY init_counts(char *str);
void count_chars(TALLY *t);
int count_of_char(TALLY *t, char c);
TALLY init_count(char *str)
{
int i;
TALLY t;
t.text = str;
for (i = 0; i < CHARS; i++)
{
t.counts[i] = 0;
}
count_chars(&t);
return t;
}
void count_chars(TALLY *t)
{
int i;
for (i = 0; i <strlen>text); i++)
{
t->counts[(int) t->text[i]]++;
}
}
int count_of_char(TALLY *t, char c)
{
return t->counts[(int) c];
}
After importing init_counts, I run (set 'foo (init_counts very-large-string)). This does not seem to work at all. I get nothing back and the action hangs. I also tried a C function that only takes a string and prints the first 100 chars, just to test the newLisp code, and it seems like passing a string to a library function takes an inordinate amount of time (which is why I was wondering about newLisp's internal string representation).
Any ideas? I would love to get rid of python's overhead in my app by using newLisp and additionally have the ability to run the software in an easy-to-create gui, but 3.5 seconds is way too expensive here.
On this page you find a small C program for the character counter and how to compile and import it into newLISP:
http://newlisp.org/index.cgi?CountCharacters
I tried to post it here but the formatting got all mixed up.
Lutz
Quote from: "Jeff"
Is there no way to iterate quickly over the chars in a string? What is newLisp's internal representation of a string like?
Strings and text appear as contiguous areas of memory in newLISP. Fastest iteration I've found while staying in newLISP is to index into them from the base address:
(time (setq buf (dup " 00" 3000000)))
(time (for (x 0 (- (length buf) 1)) (get-char (+ (address buf) x))))
Quote
I tried to post it here but the formatting got all mixed up.
Does this look right? Some wrapping, but readable?
#include <stdio.h>
#include <memory.h>
int cnts[128];
int * count_characters(char * str, int length)
{
int i;
memset((void *)cnts, 0, 128 * sizeof(int));
for(i = 0; i < length; i++)
cnts[str[i]] += 1;
return(cnts);
}
compile it to a shared library
on Mac OS X:
gcc count-characters.c -bundle -o count-characters.so
on other UNIX
gcc count-characters.c -shared -o count-characters.so
now use it:
newLISP v.9.2.0 on OSX UTF-8, execute 'newlisp -h' for more info.
> (define count-chars (import "count-characters.so" "count_characters"))
count_characters <8CEF0>
> (unpack (dup "lu" 128) (count-chars (read-file "war-and-piece.txt") 3217389))
(0 0 0 0 0 0 0 0 0 0 67418 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 511976 3938 17974 2 1 3 0 7526 640 640 339 1 39921 5850 30686
26 230 348 163 55 25 49 51 38 169 54 1003 1148 1 2 1 3138 2 6204
3638 1783 2021 1867 1908 1247 4016 7404 320 1191 687 3288 3627 1638
6117 35 2704 2974 6464 278 939 2896 348 1265 108 47 0 47 0 1 0 199012
30984 59225 116122 312451 52818 49877 162871 166004 2214 19194 95740
58261 180228 190868 38854 2300 144967 159746 219083 65006 25940
56197 3719 44945 2282 0 0 0 1 0)
>
does it on my MacMini/PPC in 0.129 seconds.
Lutz
Thank you all for your help.
Jrh: I tried that, too. I just couldn't get the speed I needed. Especially when on each iteration I needed to increment the value at an array index. With a quad-core g5 (2 * 2.5 GHz), it still takes around 3.5 seconds :(
Lutz: Thanks! That is what I was looking for. I was still using the struct as a way of housing the array so that it could be returned from the function (as you can see, I'm not hugely experienced with C; additionally, the lib was originally built to work with the Python API so that it would have a more object-like return value that could be extended as needed).
I was only ever able to return a struct to newLisp successfully when I only returned a pointer to a global instance of the struct. When I tried to return a locally created struct itself, the interpreter died with no messages. Any idea what I did wrong there?
Quote
was only ever able to return a struct to newLisp successfully when I only returned a pointer to a global instance of the struct.
Yes, that is the only way to do it. All return values from C library imports are 32 bit, so for bigger data objects a pointer has to be returned.
Thanks Ryon for re-posting the code here in readable form.
Lutz
Thanks again for all the help. I just added a version of the character counter form on my website using newLisp as a cgi.
The host only supports cgi (since the virtual host doesn't actually exist until it gets a request). That means that with each request I have the overhead of launching the interpreter.
The newlisp version, while it takes about the same amount of time to process the text itself, takes more than a half of a second less to actually return the data.
Here is the newLisp version: //http://artfulcode.nfshost.com/lex/
(don't bookmark that url, it will disappear)
PPS: 'sort is excellent. This was the first time I ever implemented a sort using s-expressions rather than imperative code, and it is much, much cleaner.
Yay! I figured out a way to iterate quickly over a string. It's not nearly as fast as C, but it seems to out-perform the other ways so far. First, I get the address of the string as a pointer. I can get the char at that address using get-char. Then I can get the next char using (get-char (+ 1 (address string))). Here is a function to count chars moderately quickly (beats bayes-train because it doesn't explode the string):
(define (count-chars str , ptr chars c)
(set 'chars (array 128 (dup 0 128)))
(set 'ptr (address str))
(for (i 0 (length str))
(set 'c (get-char (+ ptr i)))
(nth-set (chars c) (+ 1 (chars c))))
chars)
(set 'str (read-file "/path/to/war_and_peace.txt"))
(set 'chars (count-chars str))
(for (i 0 127)
(println (char i) ": " (chars i)))
One question, though... since inc expects a symbol, is there a way to use inc on an array index? I.e., (inc (arr x y))?
The use of an arrray for counting is an important step, even more so when using a few hundred or thousand elements in the counting array.
There is another 2 modifications to get it another 20% faster using 'unpack' to create a vector of ASCII numbers from the string, and using a feature of 'nth-set', which gives the old contents in $0
(define (count-chars str , nums chars c)
(set 'chars (array 128 (dup 0 128)))
(set 'nums (unpack (dup "c" (length str)) str))
(dolist (c nums)
(nth-set (chars c) (+ $0 1)))
chars)
Lutz
I get another 18% beyond that using only unsigned integers ("b" instead of "c"). Thanks! This is getting pretty fun :)
The rewritten 'count in 9.2.1 is about 3-4 times faster than before when counting 100 in 1,000,000 elements. On the 'war_and_peace.txt' character count about 30% faster on a Core Duo 2 1.8Ghz when testing against counting with 'unpack and char-indexed 'array, and of course a lot easier to code.
(set 'str (read-file "war_and_peace.txt"))
; old way on 9.2.0
(set 'cnts (array 128 (dup 0 128)))
(dolist (i (unpack (dup "b" (length str)) str)))
(nth-set (cnts i) (+ $0 1)))
-> 1270 ms
; using count on 9.2.1
(count (sequence 0 127) (unpack (dup "b" (length str)) str))
-> 812 ms
Lutz
Wow! Where can I get some of that?! Thanks a bunch, Lutz. It's nice to work in a language where the author takes a personal interest in how the language is used.