[racket] performance problem in math/matrix

From: Neil Toronto (neil.toronto at gmail.com)
Date: Sun Jan 20 12:09:00 EST 2013

On 01/20/2013 07:09 AM, Jens Axel Søgaard wrote:
> Hi Berthold,
>
> 2013/1/20 Berthold Bäuml <berthold.baeuml at dlr.de>:
>> I tried a little test with mildly large Flonum matrices using math/matrix in Typed
>> Racket. Two 1000x1000 dimensional matrices should be multiplied with (array-
>> strictness #t). Running this in DrRacket results in an "out of memory" error (also
>> having set the memory limit to 1GB) and running it with racket on the command
>> line makes the computer start to heavily swap.
>
>> Where is this high memory consumption coming from?
>
> I consider this a bug in matrix*.
>
> The culprit is inline-matrix-multiply in "untyped-matrix-arithmetic.rkt".
> When multiplying an mxp matrix with an pxn  matrix, it builds a
> temporary array of size n*p*m with all products needed, then
> it computes sums of these products.
>
> Replacing this algorithm with the standard O(n^3) ought to be easy.

Aye, that's been done. The fix isn't in the current release candidate, 
though. It'll be in the next one.

(I made arrays strict by default in two stages: one for `math/array', 
and one for `math/matrix'. The release candidate was rolled out between 
stages, so the `math/matrix' changes aren't in it. The change for 
`matrix*' ensures that the n*p*m-size matrix is nonstrict, so it never 
allocates an n*p*m-size vector to store its elements.)

> Numpy (and Mathematica, I believe) uses BLAS and LAPACK, which use
> Strassen's algorithm (or better?) for matrices this large, so there will
> still be a performance gap.
>
> If this performance gaps turns out to be too large, we must figure
> out a way to integrate BLAS and LAPACK bindings into math/matrix.

Yes, this.

I'd like to make an `FlMatrix' type that uses BLAS and LAPACK when 
they're available, and falls back to a Racket floating-point-only 
implementation. The latter would at least fix the biggest performance 
issue with flonum matrices, which is that their elements are heap-allocated.

In the meantime, I'll profile matrix multiply and see if I can speed it 
up. My 1000x1000 multiply just finished at 418sec, which is... pitiful.

Neil ⊥


Posted on the users mailing list.