I tried to implement the threadring task from the alioth language benchmark (//http), and found my code very slow. It's my first atempt at "multithreading", so i hope someone can help me to make it faster.
Here is my code :
(set 'nThreads 503
'limit 500000
'channels '())
; worker do that
(define (worker id chanR chanW , r)
(catch
(while (set 'r (int (read-line chanR)))
(if (= r limit)
(throw id)
(write-line chanW (string (inc r)))))
)
)
; make communication channels
(for (i 0 nThreads) (push (pipe) channels -1))
; spawn workers
(for (i 0 nThreads)
(spawn 'p (worker (+ i 1)
((channels i) 0)
((channels (% (+ 1 i) nThreads)) 1) )) )
; start the process by giving the token to first thread
(write-line ((channels 0) 1) "0")
; wait for result
(while (not p) (sleep 100) (sync 1))
(println "DONE BY " p)
(abort)
(exit)
It's slower than all the other implementations, newlisp can do better.
Thanks for your help.
Renaud.
Threads in newLISP are really different processes, not threads in the same process space. That is the reason that they are slower. On the upside, when running modern multicore CPU's, newLISP spawn or fork will take advantage of multiple cores. Switching processes is slower than switching threads, but processes scale better because they can be distributed on multiple cores.
I believe that in the modern world of multicore CPUs, multi-process is the way to go, because in the end it will scale better. Performance of process switching is very different on different platforms. And you will see very different results on different platforms. And you will also see CPUs getting better at this in the future when multiprocesssing become more important.
ps: I would not loop on p, because p will be true for each process finishing. Rather switch on sync which will be true when all processes finished, e.g. (until (sync long-time)), or (until (sync short-time) (do-something))
With a few changes the following takes about 490 milli seconds on a MacMini Intel 1.83 GHz core duo. With 'limit' set to 10000, that is about 49 micro seconds per pipe message from one to the next in the ring. With nThreads set to 100 the ring gets circled a 100 times.
On my ISP's FreeBSD machine I could do 14 micro seconds per pipe message, but don't know what kind of machine that is in clock speed and number of cores.
#!/usr/bin/newlisp
; processes pass and increment a number token in a ring
; communications is via pipes
; to see how it all works, enable all println's
; and put nThreads to 5 and limit to 10
(set 'nThreads 100 ; these are actually processes
'limit 10000 ; must be multiple of nThreads
'channels '())
; worker
(define (worker id chanR chanW , r)
(catch (while (set 'r (int (read-line chanR)))
;(println "id:" id " read:" r " write:" (+ r 1))
(write-line chanW (string (inc r)))
(when (>= r (+ (- limit nThreads) id))
(throw (string "id:" id " finished with r:" (- r 1))))
)
)
)
; make communication channels
(dotimes (i nThreads) (push (pipe) channels -1))
;(println "channels:" channels)
; spawn workers
(dotimes (i nThreads)
(let ( id (+ i 1)
in (channels i 0)
out (channels (% (+ 1 i) nThreads) 1) )
(spawn 'p (worker id in out))
;(println "started id:" id " in:" in " out:" out)
)
)
; start the process sending the token to the first process
(set 'start (time-of-day))
(write-line (channels 0 1) "0")
(until (sync 10)) ; wait for all to finish
(println p)
(println "duration: " (- (time-of-day) start) "ms")
(exit)
gives this output:
~> ./process-ring-spawn
id:100 finished with r:9999
duration: 490ms
~>
You could do the same thing with 'fork' and would use slightly less memory, but not have the luxury of getting return values and you would need to register the process pid's and use 'wait-pid' on them to kill the process zombies after exiting.
ps: see also here for some other benchmarks: http://www.newlisp.org/benchmarks/
Thanks a lot Lutz.
I spent some time rewriting my code with your idea (not waiting for the result, and letting threads end themselves), but conforming to the benchmark specification, and the best i can get is :
#!/usr/bin/newlisp
(set 'nThreads 503
'limit 500000
'channels '())
; The worker code
(define (worker id chanR chanW , r)
(while (< r limit)
(write-line chanW (string (+ 1 (set 'r (int (read-line chanR)))))))
(when (= r limit) (println "Done by id:" id))
)
; make communication channels
(dotimes (i nThreads) (push (pipe) channels))
; spawn workers
(dotimes (i nThreads)
(let ( id (+ i 1)
in (channels i 0)
out (channels (% (+ 1 i) nThreads) 1) )
(spawn 'p (worker id in out))))
(set 'start (time-of-day))
; put the token in the ring
(write-line ((channels 0) 1) "0")
; wait for threads end
(until (sync 10))
(println "Duration: " (- (time-of-day) start) "ms")
(exit)
That's a bit better (about 15-20% faster) but still a lot slower than the slowest code from the Alioth benchmark, in wich i think some languages use real processes too. I wonder if the difference is not more in pipe readwriting than in process switching.
BTW that's fast enough for me, and even if one allways want improvements, i think overall newLISP is realy a briliant piece of software.
Interesting demo, Lutz.
$ newlisp process-ring-spawn.lsp
id:100 finished with r:9999
duration: 121ms
on iMac...
I ran the GHC benchmark on my box, then the newlisp code you pasted here.
The GHC benchmark is the fastest one on Alioth. Yet, newlisp came out ahead;
18 seconds for newLisp, 25 seconds for Haskell (./threadring.ghc_run 50000000)
How is your benchmark different? I notice that you specify 500 thousand for "limit", while you are passing in 50 million for the Haskell benchmark.
I don't know Haskell that well, and you are using some different variable names. Making it hard to map the code to see that it is doing the same things.
Quote from: "TedWalther"
I notice that you specify 500 thousand for "limit", while you are passing in 50 million for the Haskell benchmark.
Eh ... what's a couple of orders of magnitude, between friends? ;)
Looking at the code, and the definition of the benchmark. I notice some potential problems.
1) send and receive don't have blocking versions. It is all non-blocking. So you have to use a spinlock to wait for your messages. Instead of doing a "poll" or "select" to wake up when notified.
2) send, receive, and share don't work between peers. All traffic must be routed through the controlling parent.
Point number two actually prevents us from implementing the benchmark. It is required for the benchmark that each thread be linked to the next in a ring and communicate directly with it.
I don't know how send and receive are being done, but I'm sure that in Unix, the shared memory api allows any process to sign in and start accessing a segment of shared memory? shm_open manpage doesn't mention any such restrictions. I'm not sure if mmap can be used multiple times on the same file, but that also may be a good way to share between peers.
newLISP has a 'send' and 'receive' used for communication between parent and 'spawn'ed childs, and child to child communications are currently managed with the parent functioning as a proxy. The interface can be used blocking - nothing wrong to use a spin-lock interface, if it's short ;) - and non-blocking. The API is based on mmap(), can handle binary info (zeros) and unlimited sizes (using files). The API also does not need semaphores.
Although I think that the current 'send' 'receive' API is nice from a programmer perspective, unfortunately it is also slow on some platforms.
'send' and 'receive' are being reworked for a 10.3.4 development version without changing the API, but to gain better speed. Until then, pipes or mmap() based 'share' together with 'semaphore's are the most efficient way for inter process communications in newLISP, but are also more complex to use. See also the example file prodcons.lsp in newlisp-x.x.x/examples.
I see that Ormente was using pipes to communicate between child processes, without going through the parent. He was using a message passing style rather than a shared memory style of communication.
read and write do block, so the pipe method is good, it avoids the busy-wait loops that suck up cpu.
I wonder if his code is slow because he is converting things to and from strings all the time. Perhaps if I set up a "share" in the parent process, then the pipes only transmitted a signal that meant "hey, you child, you have exclusive access to the share right now..."
When a parent sets up a shared memory segment, all children spawned from that point on can access it, right?
I'm just guessing at what the slowness is. I have a hard time imagining the pipes can be slow; surely they can't be slower than IPv4 or Unix domain sockets?
Lutz, how do I use send/receive blocking? I mean, truly blocking, not the (until (send)) (until (receive)) idiom. It is beautiful short and easy to code, but I fear it to suck up a lot of cpu. and if you have 500 threads all doing this... major cpu wastage. In the benchmark at hand, that is exactly what happens, since it is a token passing "ring" structure, it really is a serial structure, but using threads. Perhaps it was designed to test a worst case scenario for threading.
Either you have interrupts for an event driven I/O or you have a polling loop somewhere looking for data to arrive and causing the block. Any blocking scheme has this polling loop somewhere, may be implemented on a lower level and not visible, but it's always there.
The faster your potential message turnaround, the tighter your polling loop, the more usage of the CPU. The current implementation introduces a wait internally tuned to the Mac OS X platform to avoid mutual locks. Lifting the loop into user space allows fine-tuning your polling mechanism with 'sleep's and do something inside the loop too, if you wish. Without user adjustment of the loop with 'sleep's there is only a few percent CPU usage per child, depending on the number of children started.
When a parent spawns a child it sets up to shared memory segments one in each direction of communication and passes the addresses to the child. The parent keeps a list of all children started and the memory segments uses.
ok, I got this benchmark down to 19 minutes. For reference, the Haskell benchmark took 2.8 times as long on my machine as it did on the Debian benchmark machine. So, dividing my time by 2.8, I got 405 seconds, or, almost 7 minutes. This is at least in the ballpark now. Here is the code. It was tough to get it down this fast:
#!/usr/bin/newlisp
(setq
nThreads 503
nTransactions 50000000
channels (list)
start-time (time-of-day)
token "a" ltoken (length token)
)
(define (pass-token) (write write-to-pipe token ltoken))
(define (recv-token) (if (and (= (length token) (read read-from-pipe c ltoken)) (= c token)) true nil))
(define (worker id read-from-pipe write-to-pipe n)
(catch
(while (recv-token)
(-- n nThreads)
(if (< 0 n)
(pass-token)
(println (format "Done by id %dnSeconds: %.2f" id (div (- (time-of-day) start-time) 1000)))
(throw nil))
(when (< n nThreads) (throw nil))))
(write write-to-pipe "b" 1))
;; make communication channels
(dotimes (i nThreads) (push (pipe) channels -1))
;; spawn workers
(dotimes (i nThreads)
(fork
(worker (+ 1 i) (channels i 0) (channels (% (+ 1 i) nThreads) 1) (+ nTransactions nThreads (- i) -1))))
;; put the token in the ring
(setq write-to-pipe (channels 0 1)) (pass-token)
(dotimes (i nThreads) (wait-pid -1 0))
(exit)
I tried using spawn and sync, but was getting segfaults. Also, I noticed sync was eating up almost 90% cpu time in the parent.
I'm looking at the sweet timings for Go and Haskell, even SBCL Lisp was able to do it in under 5 minutes.
I have heard from Linux and BSD kernel developers that pthreads API is the right way to go; the library itself is responsible for forking enough times so that the kernel can schedule threads on appropriate CPU's. I remember that is why Linus developed the "clone" API in the Linux kernel. And in OpenBSD, the rthreads implementation of pthreads is supposed to do this too.
If there is any way to speed this up, please show us in here! I did my best.
For reference, here is the spawn based code that was segfaulting. It segfaults once the number of threads goes above 339. Lutz is right, the API is sweet. I think the code doing it this way is smaller and sweeter. I don't know why it segfaults though. And it is slightly slower, about 20% slower.
I am pasting it here in running form (with nThreads set to 339) To see the segfault, just bump up the nThreads value.
#!/usr/bin/newlisp
(setq
; nThreads 503
nThreads 339
nTransactions 50000000
channels (list)
start-time (time-of-day)
)
(define (write-int fd i) (write fd (pack "ld" i) 4))
(define (read-int fd) (read fd buff 4) (first (unpack "ld" buff)))
(define (worker id read-from-pipe write-to-pipe)
(until (zero? (setq r (- (read-int read-from-pipe) 1)))
(write-int write-to-pipe r))
id)
;; make communication channels
(dotimes (i nThreads) (push (pipe) channels -1))
;; spawn workers
(dotimes (i nThreads)
(spawn 'throwaway (worker (+ 1 i) (channels i 0) (channels (% (+ 1 i) nThreads) 1))))
;; put the token in the ring
(write-int (channels 0 1) nTransactions)
(until (integer? throwaway) (sync 1000))
(println "Done by id " throwaway)
(println (format "Seconds: %.2f" (div (- (time-of-day) start-time) 1000)))
(abort)
(exit)
Observing the results here: http://shootout.alioth.debian.org/u64q/performance.php?test=threadring
I see that the pthread libraries on Linux at least, DO handle distributing over various CPU's. Because pthreads on Linux are based on the underlying Clone call. That is why the C implementation of the benchmark is able to do this in 208 seconds, even though it uses pthreads.
The way you can tell this, is that the "elapsed" seconds are less than the "cpu seconds". To use more cpu seconds than elapsed seconds, you have to be using more than one cpu.
If it is any consolation, at least we are beating out Ruby, Perl, and C# Mono. Although this isn't a problem with Mono; F# Mono is as good as Erlang.
Are these times comparing interpreted and compiled languages? Seems odd to allow compiled languages to do their homework before being asked to run their code, whereas the interpreted languages have to respond to the problem without prior notice... :)
Quote
Seems odd to allow compiled languages to do their homework before being asked to run their code, whereas the interpreted languages have to respond to the problem without prior notice... :)
...best explanation of compilation versus interpretation, I have ever seen :)
To Ted: what version of newLISP did you use? I am surprised by the good results, I thought that spawn'ed (fork'ed internally) processes are more heavy-weight than pthreads. The advantage of using different processes instead of threads (running in the same process space) is, that multiple processes can run on multiple cores. But it seems that pthreads are really forked processes on Linux?
Lutz, I used your latest release, 10.3.4, under Ubuntu. I installed from dpkg.reactor-core.org.
Actually, I don't have a way to compare spawn with pthreads. The C implementation on the shootout site, uses pthreads. And it is more than twice as fast as the newlisp implementation using fork.
Have you found the same segfault behavior in the spawn version of my code? Any idea why the parent process is burning so much cpu inside the (sync) function?
Yes, the compiled and interpreted languages are all tested together. I like seeing how they stack up against each other. Now that we have byte code, virtual machines, and on the fly compilation into semantic trees, the line between compiled and interpreted languages is blurry anyway. It doesn't make much sense to quibble much about it.
You can control the time spent in the parent process by changing the proportion of time in the sync statement and a sleep time in the until loop:
(until (integer? throwaway)
(sync 20) ; 20ms in internal sync loop
(sleep 400) ; 400ms giving up time for child processes
)
As you are only waiting for one of the child processes to finish and setting the 'throwaway' variable, the parent process can spend most of the sleeping and giving up time to child processes.
The crash in development version 10.3.4 happens when 'spawn' hits a resource limit. This will be addressed in development version 10.3.5.
Using (sync 100)(sleep 400), I found that the fork version is 7% faster than the spawn version. I did another run of the fork version, this time with 339 threads instead of 503, and found that it was only 3% faster. There must have been some system load slowing it down.
ted@ted-desktop:~/local/src$ time ./threadring4.lsp
Done by id 180
Seconds: 107.50
real 1m47.509s
user 0m47.800s
sys 0m55.270s
ted@ted-desktop:~/local/src$ time ./threadring3.lsp
Done by id 89
Seconds: 115.45
real 1m55.778s
user 0m55.370s
sys 1m22.130s
That crash which happend to you in 'spawn' in development version 10.3.4, is now fixed in pre 10.3.5:
http://www.newlisp.org/downloads/development/inprogress
I'll test it now. Another error, is in the time-of-day code. I save the time-of-day at the start, then subtract at the end and divide by 1000 to get the number of seconds elapsed. This works mostly, but sometimes, in both the spawn and the fork versions, I get a large negative number. Any ideas?
ted@ted-desktop:~/local/src/newlisp-10.3.5-inprogress$ ./threadring3.lsp
ERR: cannot open socket pair in function spawn
I bumped the number of threads back down to 339, and the error went away. I therefore think that the error message is wrong, possibly because of an off-by-one error somewhere. It isn't segfaulting, but how come I can't spawn 503 threads, but I can fork 503 times?
Additionally to the pipes created by the threadring program, 'spawn' creates a socket pair (based on Unix local domain sockets) for each process spawned and to be used by the message API with 'send' and 'receive'.
I will make the the creation of these sockets optional in the 'spawn' call. This will allow applications to not use the message API or only use it for slected childs spawned:
(spawn <sym-variable> <child-process> [true])
The optional 'true' parameter would cause the socket pair to be created, while without it, no socket pair would be created for that child.
See http://www.newlisp.org/downloads/development/inprogress for these changes now implemented.
The threadring program will now run unchanged with a higher number of child processes in the ring.
The negative timing numbers will be addressed at a later time.
The socketpair is how the child process returns its final value to the parent, right? What happens if you spawn a process with the socketpair absent? Isn't that the same as a fork?
The final return value of the child process is transferred via shared memory, not via sockets. Perhaps in the future the low level 'fork' and 'wait-pid' should be taken out in favor of the more high-level and practical 'spawn' plus the message API.
Or at least, don't take fork and wait-pid out entirely; put them in the posix module?
I just tested, and the spawn version is 2% slower than the fork version. So, they are almost equivalent.
After meditating a bit, about being able to "send" and "receive" between spawned processes, I had this idea. Could you expose the send/receive mechanism just a little bit? That way I could "send" a message to the parent, requestion the communications port/socket/handle for some other child. Then I could communicate directly with that other child. And I could instruct the parent how to handle such a request from the child. That would save having to make a socket pair for every single child/child pair.
Have you found that mmap is faster or slower than SysV shm api? Any thoughts on switching to the pthreads api to double the speed?
I found the shm API to be the slowest followed by memory sharing with semaphores, followed by pipes and socket pairs beeing the fastest. I implemented all of these completely and compared speed on different platforms.
The pthreads API happens on only one core on most platforms and in the same process space, and because of that, doesn't scale well on multicore CPUs. Look at cores-used stats in the Debian benchmarks. Also, when running in the same process space, newLISP internal memory structures get messed up, if not duplicated for each thread.
Regarding, child to child communications, for now I suggest using either the proxy approach as outlined in the docs, or creating pipes, which works almost as fast as socket pairs. I understand that the best would be, if the same 'send' and 'receive' could be used, but haven't found yet a viable solution. Ideas, how to do this abound, but the devil lies in the details. Any solution needs some central registry for resources, either maintained by the OS kernel or by the newLISP parent process.
At the moment, the most important thing is, to use the existing API and find typical use patterns for solving real world programming problems. From there we will know, how important it is to extend send/receive by child to child comm., or how much can be done with the existing API.
On the Tips & Tricks page, there is a new category "Parallel and Distributed Processing" with an example, how to do parallel web page retrieval. We need more real world examples to show using the 'spawn' API.
Quote
On the Tips & Tricks page, there is a new category "Parallel and Distributed Processing" with an example, how to do parallel web page retrieval. We need more real world examples to show using the 'spawn' API.
My YubiKey verification library (//http) uses (spawn) to query verification servers in parallel. It's quite basic, but real world anyway.
Thanks for the hint. Parallel network access is definitely a good candidate for 'spawn', as is anything else waiting for I/O.
My only use of spawn is in my irc-bot:
(spawn 'p1 (eval-string tmp))
(if (sync 2000)
; looking good
(set 'result (string p1))
; no luck
(begin
(set 'result (string "error:" p1))
(abort)))
What is it, you have in tmp ?
Usually just any newLISP string: e.g. on #newlisp at the moment
cormullion: (+ 2 3)
newlithp: 5
still a work in progress, though :)