[racket] performance problem in math/matrix

From: Berthold Bäuml (berthold.baeuml at dlr.de)
Date: Tue Jan 22 16:09:45 EST 2013

On 22.01.2013, at 06:19, Neil Toronto <neil.toronto at gmail.com> wrote:

> On 01/21/2013 04:33 PM, Berthold Bäuml wrote:
>> 
>>> 
>>> I just did that. Here are the types:
>>> 
>>>  real-matrix* : (Array Real) (Array Real) -> (Array Real)
>>> 
>>>  flonum-matrix* : (Array Flonum) (Array Flonum) -> (Array Flonum)
>>> 
>>>  flmatrix* : FlArray FlArray -> FlArray
>>> 
>>> Results so far, measured in DrRacket with debugging off:
>>> 
>>> Function           Size              Time
>>> -----------------------------------------
>>> matrix*            100x100          340ms
>>> real-matrix*       100x100           40ms
>>> flonum-matrix*     100x100           10ms
>>> flmatrix*          100x100            6ms
>>> 
>>> matrix*           1000x1000      418000ms
>>> real-matrix*      1000x1000       76000ms
>>> flonum-matrix*    1000x1000        7000ms
>>> flmatrix*         1000x1000        4900ms
>>> 
>>> The only difference between `real-matrix*' and `flonum-matrix*' is that the former uses `+' and `*' and the latter uses `fl+' and `fl*'. But if I can inline `real-matrix*', TR's optimizer will change the former to the latter, making `flonum-matrix*' unnecessary. (FWIW, this would be the largest speedup TR's optimizer will have ever shown me.)
>>> 
>>> It looks like the biggest speedup comes from doing only flonum ops in the inner loop sum, which keeps all the intermediate flonums unboxed (i.e. not heap-allocated or later garbage-collected).
>>> 
>>> Right now, `flmatrix*' is implemented a bit stupidly, so I could speed it up further. I won't yet, because I haven't settled on a type for matrices of unboxed flonums. The type has to work with LAPACK if it's installed, which `FlArray' doesn't do because its data layout is row-major and LAPACK expects column-major.
>>> 
>>> I'll change `matrix*' to look like `real-matrix*'. It won't give the very best performance, but it's a 60x speedup for 1000x1000 matrices.
>>> 
>> 
>> These results look very promising, esp. if , as you mentioned, in the end the real-matrix* will automatically reach the flonum-matrix* performance for Flonums and the flmatrix* automatically switches to a LAPACK based variant, when available. For the latter it would be great if one could even change the used library to, e.g., redirect to a installation of the highly efficient MKL library from Intel.
>> 
>> Looking forward to benchmark it against Numpy and Mathematica (which is MKL based) again!
> 
> The next release candidate will have a `matrix*' that runs as fast as `flonum-matrix*' on (Matrix Flonum) arguments, and as fast as `real-matrix*' on (Matrix Real) arguments.

Cool!

> (Curiously, I wasn't able to speed up `flmatrix*' at all. The "a bit stupidly" implementation was actually the fastest I could do, which probably means it's as fast as it can be done in Racket without dropping to C.)
> 
> Additionally, all the matrix functions will have more precise types to make it easy for TR to prove that operations on (Matrix Flonum) values yield (Matrix Flonum) values. (Currently, it only proves (Matrix Real) and (Matrix Number).) Example:
> 
>  > (:print-type (matrix-qr (matrix [[1.0 2.0] [3.0 4.0]])))
>  (Values (Array Flonum) (Array Flonum))
>  > (:print-type (matrix-qr (matrix [[1.0+0.0i 2.0+0.0i]
>                                     [3.0+0.0i 4.0+0.0i]])))
>  (Values (Array Float-Complex) (Array Float-Complex))
> 
> The current release candidate returns two (Array Real) and (Array Number) instead.
> 
> Since I've brought up decompositions, you should know that these, `matrix-inverse', `matrix-solve', and pretty much every O(n^3) operation except multiplication will be slow on large matrices. I can't think of a way to inline them in a way that TR can optimize. For fast versions, which will use LAPACK if it's installed, you'll have to wait for 5.3.3. :/

Will the LAPACK bindings be in the git head sooner, maybe ;-)
> 

> Is the MKL library binary-compatible with LAPACK?
> 

I did not try it myself yet, but the Intel documentation does say so:
"If your application already relies on the BLAS or LAPACK functionality, simply re-link with Intel® MKL to get better performance on Intel and compatible architectures."
http://software.intel.com/en-us/intel-mkl/

Berthold


> Neil ⊥
> 




Posted on the users mailing list.