[racket] place: terrible performance of place-channel-get?

From: Matthew Flatt (mflatt at cs.utah.edu)
Date: Wed Nov 12 11:52:29 EST 2014

I'll push a repair to the development version.


The problem isn't so much that message copying/transfer is slow, but
that the rule to trigger an all-places GC doesn't accommodate a large,
not-yet-delivered message. I'll repair that rule.

Most of the process time in your example shows up as GC time, because
the GC was continuously firing while the message waited for the new
place to start and receive it (and the constant GCs slowed the place
start-up).


If upgrading is not an option, you can work around the problem by
waiting for a "ready" message from the new place before sending the
vector as a message. For example, change `test-place1` to

 (define (test-places1)
   (define p1
     (place ch1
            (place-channel-put ch1 'ready)
            (define v (place-channel-get ch1))
            (define w (long-computation v))
            (place-channel-put ch1 w)))  
   (place-channel-get p1) ; => 'ready
   (place-channel-put p1 v1)
   (time (place-channel-get p1)))

That way, `v1` doesn't sit in the message channel long enough to cause
a problem.

At Tue, 11 Nov 2014 17:41:11 -0700, Matthew Flatt wrote:
> This does seem extremely slow. A place-message send must copy the
> vector to send it as a message, but the copy shouldn't take so long.
> I'll investigate further.
> 
> Meanwhile, an option in this case might be to created a "shared
> flvector", which can be passed directly (i.e., without copying) to
> another place. I've enclosed a variant of your example to illustrate.
> 
> At Mon, 10 Nov 2014 11:58:21 +0200, Alexey Cherkaev wrote:
> > Hi,
> > 
> > I am looking at parallelising some numerical computation with Racket. I’ve 
> > tried future/touch first. However, the data for computation is passed as 
> > vectors and in my experiments with future/touch it would always find 
> > "synchronisation task” upon which all multicore-threads collapse into one 
> core 
> > serialised computation.
> > 
> > Now, I decided to try place. My idea is to make it similar to Common Lisp’s 
> > LPARALLEL: create workers <= number of cores and distribute tasks into those 
> > workers. The problem I have encountered, however, is that place-channel-get 
> > seems to take forever to compute. Here is an example of some simulated 
> > computation on a vector using two places and trying to run them in parallel:
> > 
> > #lang racket
> > 
> > (require racket/place)
> > 
> > (provide test-places1 test-places2 long-computation v1 v2 random-vector)
> > 
> > ;;; Utilities: 
> > (define (random-list n)
> >   (let loop ((i n) (r '()))
> >     (if (zero? i)
> >         r
> >         (loop (sub1 i) (cons (random) r)))))
> > 
> > (define (random-vector n)
> >   (let ((l (random-list n)))
> >     (list->vector l)))
> >   
> > (define (vector-reduce f init v)
> >   (let ((n (vector-length v)))
> >     (let loop ((i 0) (r init))
> >       (if (= i n)
> >           r
> >           (loop (add1 i) (f r (vector-ref v i)))))))
> > 
> > ;;; This is  computation to be run in each place:
> > (define (long-computation v)
> >   (let ((n (vector-length v))
> >         (v1 (vector-copy v)))  ; v is immutable, if want to mutate, must copy 
> it
> >     (let loop ((i 0))
> >       (if (= i n)
> >           (begin
> >             (sleep 2)         ; make it work for a bit longer
> >             (vector-reduce + 0.0 v1)) ; to make result printable
> >           (begin
> >             (vector-set! v1 i (* (exp (- (vector-ref v1 i)))
> >                                  (sin (* pi (vector-ref v1 i)))))   ;flonum 
> > computation
> >             (loop (add1 i)))))))
> >       
> > ;;; two vectors to be sent to long-computation
> > (define v1 (random-vector 100000))
> > (define v2 (random-vector 100000))
> > 
> > ;;; Test using one place:
> > (define (test-places1)
> >   (define p1
> >     (place ch1
> >            (define v (place-channel-get ch1))
> >            (define w (long-computation v))
> >            (place-channel-put ch1 w)))  
> >   (place-channel-put p1 v1)
> >   (time (place-channel-get p1)))
> > 
> > ;;; Test using 2 places:
> > (define (test-places2)
> >   (define p1
> >     (place ch1
> >            (define v (place-channel-get ch1))
> >            (define w (long-computation v))
> >            (place-channel-put ch1 w)))
> >   (define p2
> >     (place ch2
> >            (define v (place-channel-get ch2))
> >            (define w (long-computation v))
> >            (place-channel-put ch2 w)))
> >   (place-channel-put p1 v1)
> >   (place-channel-put p2 v2)
> >   (sleep 2) ; hypothetically, after this results shoud be ready immidiately!
> >   (time (list (place-channel-get p1) (place-channel-get p2))))
> > 
> > Exectution from racket on MacBook Pro with Intel Core 2 Duo:
> > 
> > -> (time (long-computation v1))
> > cpu time: 42 real time: 2043 gc time: 0
> > 39523.12275516648
> > -> (test-places1)
> > cpu time: 7593 real time: 7475 gc time: 7001
> > 39523.12275516648
> > -> (test-places2)
> > cpu time: 16591 real time: 12492 gc time: 15485
> > '(39523.12275516648 39505.415738171105)
> > 
> > So, the time of execution of (long-computation v1) and the time of getting 
> the 
> > result out of the channel in (test-places1) should be more or less the same, 
> > but it is not. Furthermore, (test-places2) takes almost twice as 
> (test-places1) 
> > (note, I put (time …) around just getting the value, so it does not include 
> the 
> > time of creating the place).
> > 
> > Am I doing something wrong?
> > 
> > Cheers, Alexey
> > 
> > 
> > ____________________
> >   Racket Users list:
> >   http://lists.racket-lang.org/users
> ------------------------------------------------------------------------------
> [application/octet-stream "shared-flvector-example.rkt"] [~/Desktop & open] 
> [~/Temp & open]
> ____________________
>   Racket Users list:
>   http://lists.racket-lang.org/users


Posted on the users mailing list.