[racket] performance problem in math/matrix

From: Neil Toronto (neil.toronto at gmail.com)
Date: Sun Jan 20 13:59:44 EST 2013

On 01/20/2013 11:14 AM, Jens Axel Søgaard wrote:
> 2013/1/20 Berthold Bäuml <berthold.baeuml at dlr.de>:
>
>> With the ffi-binding example from Jens (thank you!) I get for the 1000x1000
>> multiply 450ms -- only 2x slower than Mathematica or Numpy. So integrating
>> such a binding in math/matrix looks promising.
>
> Huh? I am surprised it is twice as slow.
>
> Ah! The Numpy example uses floats and not doubles.

It may be FFI overhead as well.

>> Nevertheless, my original motivation for the little test was to get an impression of
>> what performance could be achieved in purely Typed Racket for numerical
>> algorithms. Would it in principle be possible to come close to pure C-code
>> performance (using the same algorithm) when using a floating-point-
>> implementation?
>
> Let's try and write a matrix* that works for matrices represented as
> vectors of floats and see how fast it is.

I just did that. Here are the types:

   real-matrix* : (Array Real) (Array Real) -> (Array Real)

   flonum-matrix* : (Array Flonum) (Array Flonum) -> (Array Flonum)

   flmatrix* : FlArray FlArray -> FlArray

Results so far, measured in DrRacket with debugging off:

Function           Size              Time
-----------------------------------------
matrix*            100x100          340ms
real-matrix*       100x100           40ms
flonum-matrix*     100x100           10ms
flmatrix*          100x100            6ms

matrix*           1000x1000      418000ms
real-matrix*      1000x1000       76000ms
flonum-matrix*    1000x1000        7000ms
flmatrix*         1000x1000        4900ms

The only difference between `real-matrix*' and `flonum-matrix*' is that 
the former uses `+' and `*' and the latter uses `fl+' and `fl*'. But if 
I can inline `real-matrix*', TR's optimizer will change the former to 
the latter, making `flonum-matrix*' unnecessary. (FWIW, this would be 
the largest speedup TR's optimizer will have ever shown me.)

It looks like the biggest speedup comes from doing only flonum ops in 
the inner loop sum, which keeps all the intermediate flonums unboxed 
(i.e. not heap-allocated or later garbage-collected).

Right now, `flmatrix*' is implemented a bit stupidly, so I could speed 
it up further. I won't yet, because I haven't settled on a type for 
matrices of unboxed flonums. The type has to work with LAPACK if it's 
installed, which `FlArray' doesn't do because its data layout is 
row-major and LAPACK expects column-major.

I'll change `matrix*' to look like `real-matrix*'. It won't give the 
very best performance, but it's a 60x speedup for 1000x1000 matrices.

Neil ⊥


Posted on the users mailing list.