[racket] performance problem in math/matrix
On 01/20/2013 11:14 AM, Jens Axel Søgaard wrote:
> 2013/1/20 Berthold Bäuml <berthold.baeuml at dlr.de>:
>
>> With the ffi-binding example from Jens (thank you!) I get for the 1000x1000
>> multiply 450ms -- only 2x slower than Mathematica or Numpy. So integrating
>> such a binding in math/matrix looks promising.
>
> Huh? I am surprised it is twice as slow.
>
> Ah! The Numpy example uses floats and not doubles.
It may be FFI overhead as well.
>> Nevertheless, my original motivation for the little test was to get an impression of
>> what performance could be achieved in purely Typed Racket for numerical
>> algorithms. Would it in principle be possible to come close to pure C-code
>> performance (using the same algorithm) when using a floating-point-
>> implementation?
>
> Let's try and write a matrix* that works for matrices represented as
> vectors of floats and see how fast it is.
I just did that. Here are the types:
real-matrix* : (Array Real) (Array Real) -> (Array Real)
flonum-matrix* : (Array Flonum) (Array Flonum) -> (Array Flonum)
flmatrix* : FlArray FlArray -> FlArray
Results so far, measured in DrRacket with debugging off:
Function Size Time
-----------------------------------------
matrix* 100x100 340ms
real-matrix* 100x100 40ms
flonum-matrix* 100x100 10ms
flmatrix* 100x100 6ms
matrix* 1000x1000 418000ms
real-matrix* 1000x1000 76000ms
flonum-matrix* 1000x1000 7000ms
flmatrix* 1000x1000 4900ms
The only difference between `real-matrix*' and `flonum-matrix*' is that
the former uses `+' and `*' and the latter uses `fl+' and `fl*'. But if
I can inline `real-matrix*', TR's optimizer will change the former to
the latter, making `flonum-matrix*' unnecessary. (FWIW, this would be
the largest speedup TR's optimizer will have ever shown me.)
It looks like the biggest speedup comes from doing only flonum ops in
the inner loop sum, which keeps all the intermediate flonums unboxed
(i.e. not heap-allocated or later garbage-collected).
Right now, `flmatrix*' is implemented a bit stupidly, so I could speed
it up further. I won't yet, because I haven't settled on a type for
matrices of unboxed flonums. The type has to work with LAPACK if it's
installed, which `FlArray' doesn't do because its data layout is
row-major and LAPACK expects column-major.
I'll change `matrix*' to look like `real-matrix*'. It won't give the
very best performance, but it's a 60x speedup for 1000x1000 matrices.
Neil ⊥