threadring in newlisp

Started by Ormente, September 20, 2011, 11:45:27 AM

Previous topic - Next topic

Ormente

I tried to implement the threadring task from http://shootout.alioth.debian.org/u32/performance.php?test=threadring">the alioth language benchmark, and found my code very slow. It's my first atempt at "multithreading", so i hope someone can help me to make it faster.



Here is my code :



(set 'nThreads  503
     'limit     500000
     'channels  '())

; worker do that
(define (worker id chanR chanW , r)
  (catch
      (while (set 'r (int (read-line chanR)))
          (if (= r limit)
              (throw id)
              (write-line chanW (string (inc r)))))
  )
)

; make communication channels
(for (i 0 nThreads) (push (pipe) channels -1))

; spawn workers
(for (i 0 nThreads)
    (spawn 'p (worker (+ i 1)
                  ((channels i) 0)
                  ((channels (% (+ 1 i) nThreads)) 1) )) )

; start the process by giving the token to first thread
(write-line ((channels 0) 1) "0")

; wait for result
(while (not p) (sleep 100) (sync 1))

(println "DONE BY " p)

(abort)

(exit)


It's slower than all the other implementations, newlisp can do better.



Thanks for your help.



Renaud.

Lutz

#1
Threads in newLISP are really different processes, not threads in the same process space. That is the reason that they are slower. On the upside, when running modern multicore CPU's, newLISP spawn or fork will take advantage of multiple cores. Switching processes is slower than switching threads, but processes scale better because they can be distributed on multiple cores.



I believe that in the modern world of multicore CPUs, multi-process is the way to go, because in the end it will scale better. Performance of process switching is very different on different platforms. And you will see very different results on different platforms. And you will also see CPUs getting better at this in the future when multiprocesssing become more important.





ps: I would not loop on p, because p will be true for each process finishing. Rather switch on sync which will be true when all processes finished, e.g. (until (sync long-time)), or (until (sync short-time) (do-something))

Lutz

#2
With a few changes the following takes about 490 milli seconds on a MacMini Intel 1.83 GHz core duo. With 'limit' set to 10000, that is about 49 micro seconds per pipe message from one to the next in the ring. With nThreads set to 100 the ring gets circled a 100 times.



On my ISP's FreeBSD machine I could do 14 micro seconds per pipe message, but don't know what kind of machine that is in clock speed and number of cores.



#!/usr/bin/newlisp

; processes pass and increment a number token in a ring
; communications is via pipes

; to see how it all works, enable all println's
; and put nThreads to 5 and limit to 10

(set 'nThreads  100      ; these are actually processes
     'limit     10000    ; must be multiple of nThreads
     'channels  '())

; worker
(define (worker id chanR chanW , r)
    (catch  (while (set 'r (int (read-line chanR)))
            ;(println "id:" id " read:" r " write:" (+ r 1))
                (write-line chanW (string (inc r)))
                (when (>= r (+ (- limit nThreads) id))
                    (throw (string "id:" id " finished with r:" (- r 1))))
            )
    )
)

; make communication channels
(dotimes (i nThreads) (push (pipe) channels -1))
;(println "channels:" channels)

; spawn workers
(dotimes (i nThreads)
    (let (  id (+ i 1)
            in (channels i 0)
            out (channels (% (+ 1 i) nThreads) 1) )
        (spawn 'p (worker id in out))
        ;(println "started id:" id " in:" in " out:" out)
    )
)

; start the process sending the token to the first process
(set 'start (time-of-day))
(write-line (channels 0 1) "0")

(until (sync 10)) ; wait for all to finish

(println p)

(println "duration: " (- (time-of-day) start) "ms")
(exit)


gives this output:



~> ./process-ring-spawn
id:100 finished with r:9999
duration: 490ms
~>


You could do the same thing with 'fork' and would use slightly less memory, but not have the luxury of getting return values and you would need to register the process pid's and use 'wait-pid' on them to kill the process zombies after exiting.



ps: see also here for some other benchmarks: http://www.newlisp.org/benchmarks/">http://www.newlisp.org/benchmarks/

Ormente

#3
Thanks a lot Lutz.



I spent some time rewriting my code with your idea (not waiting for the result, and letting threads end themselves), but conforming to the benchmark specification, and the best i can get is :


#!/usr/bin/newlisp

(set 'nThreads  503
     'limit     500000
     'channels  '())

; The worker code
(define (worker id chanR chanW , r)
    (while (< r limit)
        (write-line chanW (string (+ 1 (set 'r (int (read-line chanR)))))))
    (when (= r limit) (println "Done by id:" id))
)

; make communication channels
(dotimes (i nThreads) (push (pipe) channels))

; spawn workers
(dotimes (i nThreads)
    (let (  id (+ i 1)
            in (channels i 0)
            out (channels (% (+ 1 i) nThreads) 1) )
        (spawn 'p (worker id in out))))


(set 'start (time-of-day))
; put the token in the ring
(write-line ((channels 0) 1) "0")
; wait for threads end
(until (sync 10))

(println "Duration: " (- (time-of-day) start) "ms")

(exit)


That's a bit better (about 15-20% faster) but still a lot slower than the slowest code from the Alioth benchmark, in wich i think some languages use real processes too. I wonder if the difference is not more in pipe readwriting than in process switching.



BTW that's fast enough for me, and even if one allways want improvements, i think overall newLISP is realy a briliant piece of software.

cormullion

#4
Interesting demo, Lutz.


$ newlisp process-ring-spawn.lsp
id:100 finished with r:9999
duration: 121ms


on iMac...

TedWalther

#5
I ran the GHC benchmark on my box, then the newlisp code you pasted here.



The GHC benchmark is the fastest one on Alioth.  Yet, newlisp came out ahead;



18 seconds for newLisp, 25 seconds for Haskell (./threadring.ghc_run  50000000)



How is your benchmark different?  I notice that you specify 500 thousand for "limit", while you are passing in 50 million for the Haskell benchmark.



I don't know Haskell that well, and you are using some different variable names.  Making it hard to map the code to see that it is doing the same things.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

rickyboy

#6
Quote from: "TedWalther"I notice that you specify 500 thousand for "limit", while you are passing in 50 million for the Haskell benchmark.

Eh ... what's a couple of orders of magnitude, between friends?   ;)
(λx. x x) (λx. x x)

TedWalther

#7
Looking at the code, and the definition of the benchmark. I notice some potential problems.



1) send and receive don't have blocking versions.  It is all non-blocking.  So you have to use a spinlock to wait for your messages.  Instead of doing a "poll" or "select" to wake up when notified.



2) send, receive, and share don't work between peers.  All traffic must be routed through the controlling parent.



Point number two actually prevents us from implementing the benchmark.  It is required for the benchmark that each thread be linked to the next in a ring and communicate directly with it.



I don't know how send and receive are being done, but I'm sure that in Unix, the shared memory api allows any process to sign in and start accessing a segment of shared memory? shm_open manpage doesn't mention any such restrictions.  I'm not sure if mmap can be used multiple times on the same file, but that also may be a good way to share between peers.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

Lutz

#8
newLISP has a 'send' and 'receive' used for communication between parent and 'spawn'ed childs, and child to child communications are currently managed with the parent functioning as a proxy. The interface can be used blocking - nothing wrong to use a spin-lock interface, if it's short ;) - and non-blocking. The API is based on mmap(), can handle binary info (zeros) and unlimited sizes (using files). The API also does not need semaphores.



Although I think that the current 'send' 'receive' API is nice from a programmer perspective, unfortunately it is also slow on some platforms.



'send' and 'receive' are being reworked for a 10.3.4 development version  without changing the API, but to gain better speed. Until then, pipes or mmap() based 'share' together with 'semaphore's are the most efficient way for inter process communications in newLISP, but are also more complex to use. See also the example file prodcons.lsp in newlisp-x.x.x/examples.

TedWalther

#9
I see that Ormente was using pipes to communicate between child processes, without going through the parent.  He was using a message passing style rather than a shared memory style of communication.



read and write do block, so the pipe method is good, it avoids the busy-wait loops that suck up cpu.



I wonder if his code is slow because he is converting things to and from strings all the time.  Perhaps if I set up a "share" in the parent process, then the pipes only transmitted a signal that meant "hey, you child, you have exclusive access to the share right now..."



When a parent sets up a shared memory segment, all children spawned from that point on can access it, right?



I'm just guessing at what the slowness is. I have a hard time imagining the pipes can be slow; surely they can't be slower than IPv4 or Unix domain sockets?
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

TedWalther

#10
Lutz, how do I use send/receive blocking?  I mean, truly blocking, not the (until (send)) (until (receive)) idiom.  It is beautiful short and easy to code, but I fear it to suck up a lot of cpu. and if you have 500 threads all doing this... major cpu wastage.  In the benchmark at hand, that is exactly what happens, since it is a token passing "ring" structure, it really is a serial structure, but using threads.  Perhaps it was designed to test a worst case scenario for threading.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

Lutz

#11
Either you have interrupts for an event driven I/O or you have a polling loop somewhere looking for data to arrive and causing the block. Any blocking scheme has this polling loop somewhere, may be implemented on a lower level and not visible, but it's always there.



The faster your potential message turnaround, the tighter your polling loop, the more usage of the CPU. The current implementation introduces a wait internally tuned to the Mac OS X platform to avoid mutual locks. Lifting the loop into user space allows fine-tuning your polling mechanism with 'sleep's and do something inside the loop too, if you wish. Without user adjustment of the loop with 'sleep's there is only a few percent CPU usage per child, depending on the number of children started.



When a parent spawns a child it sets up to shared memory segments one in each direction of communication and passes the addresses to the child. The parent keeps a list of all children started and the memory segments uses.

TedWalther

#12
ok, I got this benchmark down to 19 minutes.  For reference, the Haskell benchmark took 2.8 times as long on my machine as it did on the Debian benchmark machine.  So, dividing my time by 2.8, I got 405 seconds, or, almost 7 minutes.  This is at least in the ballpark now.  Here is the code.  It was tough to get it down this fast:



#!/usr/bin/newlisp

(setq
 nThreads  503
 nTransactions      50000000
 channels (list)
 start-time (time-of-day)
 token "a" ltoken (length token)
 )

(define (pass-token) (write write-to-pipe token ltoken))
(define (recv-token) (if (and (= (length token) (read read-from-pipe c ltoken)) (= c token)) true nil))

(define (worker id read-from-pipe write-to-pipe n)
  (catch
      (while (recv-token)
(-- n nThreads)
(if (< 0 n)
   (pass-token)
 (println (format "Done by id %dnSeconds: %.2f" id (div (- (time-of-day) start-time) 1000)))
 (throw nil))
(when (< n nThreads) (throw nil))))
  (write write-to-pipe "b" 1))

;; make communication channels
(dotimes (i nThreads) (push (pipe) channels -1))

;; spawn workers
(dotimes (i nThreads)
  (fork
(worker (+ 1 i) (channels i 0) (channels (% (+ 1 i) nThreads) 1) (+ nTransactions nThreads (- i) -1))))

;; put the token in the ring
(setq write-to-pipe (channels 0 1)) (pass-token)
(dotimes (i nThreads) (wait-pid -1 0))
(exit)



I tried using spawn and sync, but was getting segfaults.  Also, I noticed sync was eating up almost 90% cpu time in the parent.



I'm looking at the sweet timings for Go and Haskell, even SBCL Lisp was able to do it in under 5 minutes.



I have heard from Linux and BSD kernel developers that pthreads API is the right way to go; the library itself is responsible for forking enough times so that the kernel can schedule threads on appropriate CPU's.  I remember that is why Linus developed the "clone" API in the Linux kernel.  And in OpenBSD, the rthreads implementation of pthreads is supposed to do this too.



If there is any way to speed this up, please show us in here!  I did my best.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

TedWalther

#13
For reference, here is the spawn based code that was segfaulting.  It segfaults once the number of threads goes above 339.  Lutz is right, the API is sweet.  I think the code doing it this way is smaller and sweeter.  I don't know why it segfaults though.  And it is slightly slower, about 20% slower.



I am pasting it here in running form (with nThreads set to 339)  To see the segfault, just bump up the nThreads value.



#!/usr/bin/newlisp

(setq
; nThreads  503
 nThreads  339
 nTransactions      50000000
 channels (list)
 start-time (time-of-day)
 )

(define (write-int fd i) (write fd (pack "ld" i) 4))
(define (read-int fd) (read fd buff 4) (first (unpack "ld" buff)))

(define (worker id read-from-pipe write-to-pipe)
  (until (zero? (setq r (- (read-int read-from-pipe) 1)))
         (write-int write-to-pipe r))
  id)

;; make communication channels
(dotimes (i nThreads) (push (pipe) channels -1))

;; spawn workers
(dotimes (i nThreads)
  (spawn 'throwaway (worker (+ 1 i) (channels i 0) (channels (% (+ 1 i) nThreads) 1))))

;; put the token in the ring
(write-int (channels 0 1) nTransactions)

(until (integer? throwaway) (sync 1000))
(println "Done by id " throwaway)
(println (format "Seconds: %.2f" (div (- (time-of-day) start-time) 1000)))
(abort)
(exit)

Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.

TedWalther

#14
Observing the results here: http://shootout.alioth.debian.org/u64q/performance.php?test=threadring">http://shootout.alioth.debian.org/u64q/ ... threadring">http://shootout.alioth.debian.org/u64q/performance.php?test=threadring



I see that the pthread libraries on Linux at least, DO handle distributing over various CPU's.  Because pthreads on Linux are based on the underlying Clone call.  That is why the C implementation of the benchmark is able to do this in 208 seconds, even though it uses pthreads.



The way you can tell this, is that the "elapsed" seconds are less than the "cpu seconds".  To use more cpu seconds than elapsed seconds, you have to be using more than one cpu.
Cavemen in bearskins invaded the ivory towers of Artificial Intelligence.  Nine months later, they left with a baby named newLISP.  The women of the ivory towers wept and wailed.  \"Abomination!\" they cried.