[racket] performance problem in math/matrix
On 20.01.2013, at 18:09, Neil Toronto <neil.toronto at gmail.com> wrote:
> On 01/20/2013 07:09 AM, Jens Axel Søgaard wrote:
>> Hi Berthold,
>>
>> 2013/1/20 Berthold Bäuml <berthold.baeuml at dlr.de>:
>>> I tried a little test with mildly large Flonum matrices using math/matrix in Typed
>>> Racket. Two 1000x1000 dimensional matrices should be multiplied with (array-
>>> strictness #t). Running this in DrRacket results in an "out of memory" error (also
>>> having set the memory limit to 1GB) and running it with racket on the command
>>> line makes the computer start to heavily swap.
>>
>>> Where is this high memory consumption coming from?
>>
>> I consider this a bug in matrix*.
>>
>> The culprit is inline-matrix-multiply in "untyped-matrix-arithmetic.rkt".
>> When multiplying an mxp matrix with an pxn matrix, it builds a
>> temporary array of size n*p*m with all products needed, then
>> it computes sums of these products.
>>
>> Replacing this algorithm with the standard O(n^3) ought to be easy.
>
> Aye, that's been done. The fix isn't in the current release candidate, though. It'll be in the next one.
>
> (I made arrays strict by default in two stages: one for `math/array', and one for `math/matrix'. The release candidate was rolled out between stages, so the `math/matrix' changes aren't in it. The change for `matrix*' ensures that the n*p*m-size matrix is nonstrict, so it never allocates an n*p*m-size vector to store its elements.)
>
>> Numpy (and Mathematica, I believe) uses BLAS and LAPACK, which use
>> Strassen's algorithm (or better?) for matrices this large, so there will
>> still be a performance gap.
>>
>> If this performance gaps turns out to be too large, we must figure
>> out a way to integrate BLAS and LAPACK bindings into math/matrix.
>
> Yes, this.
>
> I'd like to make an `FlMatrix' type that uses BLAS and LAPACK when they're available, and falls back to a Racket floating-point-only implementation. The latter would at least fix the biggest performance issue with flonum matrices, which is that their elements are heap-allocated.
With the ffi-binding example from Jens (thank you!) I get for the 1000x1000 multiply 450ms -- only 2x slower than Mathematica or Numpy. So integrating such a binding in math/matrix looks promising.
Nevertheless, my original motivation for the little test was to get an impression of what performance could be achieved in purely Typed Racket for numerical algorithms. Would it in principle be possible to come close to pure C-code performance (using the same algorithm) when using a floating-point-implementation?
> In the meantime, I'll profile matrix multiply and see if I can speed it up. My 1000x1000 multiply just finished at 418sec, which is... pitiful.
>
> Neil ⊥
>
Berthold
--
-----------------------------------------------------------------------
Berthold Bäuml
DLR, Robotics and Mechatronics Center (RMC)
Münchner Str. 20, D-82234 Wessling
Phone +49 8153 282489
http://www.robotic.de/Berthold.Baeuml