[racket] performance problem in math/matrix
On 22.01.2013, at 06:19, Neil Toronto <neil.toronto at gmail.com> wrote:
> On 01/21/2013 04:33 PM, Berthold Bäuml wrote:
>>
>>>
>>> I just did that. Here are the types:
>>>
>>> real-matrix* : (Array Real) (Array Real) -> (Array Real)
>>>
>>> flonum-matrix* : (Array Flonum) (Array Flonum) -> (Array Flonum)
>>>
>>> flmatrix* : FlArray FlArray -> FlArray
>>>
>>> Results so far, measured in DrRacket with debugging off:
>>>
>>> Function Size Time
>>> -----------------------------------------
>>> matrix* 100x100 340ms
>>> real-matrix* 100x100 40ms
>>> flonum-matrix* 100x100 10ms
>>> flmatrix* 100x100 6ms
>>>
>>> matrix* 1000x1000 418000ms
>>> real-matrix* 1000x1000 76000ms
>>> flonum-matrix* 1000x1000 7000ms
>>> flmatrix* 1000x1000 4900ms
>>>
>>> The only difference between `real-matrix*' and `flonum-matrix*' is that the former uses `+' and `*' and the latter uses `fl+' and `fl*'. But if I can inline `real-matrix*', TR's optimizer will change the former to the latter, making `flonum-matrix*' unnecessary. (FWIW, this would be the largest speedup TR's optimizer will have ever shown me.)
>>>
>>> It looks like the biggest speedup comes from doing only flonum ops in the inner loop sum, which keeps all the intermediate flonums unboxed (i.e. not heap-allocated or later garbage-collected).
>>>
>>> Right now, `flmatrix*' is implemented a bit stupidly, so I could speed it up further. I won't yet, because I haven't settled on a type for matrices of unboxed flonums. The type has to work with LAPACK if it's installed, which `FlArray' doesn't do because its data layout is row-major and LAPACK expects column-major.
>>>
>>> I'll change `matrix*' to look like `real-matrix*'. It won't give the very best performance, but it's a 60x speedup for 1000x1000 matrices.
>>>
>>
>> These results look very promising, esp. if , as you mentioned, in the end the real-matrix* will automatically reach the flonum-matrix* performance for Flonums and the flmatrix* automatically switches to a LAPACK based variant, when available. For the latter it would be great if one could even change the used library to, e.g., redirect to a installation of the highly efficient MKL library from Intel.
>>
>> Looking forward to benchmark it against Numpy and Mathematica (which is MKL based) again!
>
> The next release candidate will have a `matrix*' that runs as fast as `flonum-matrix*' on (Matrix Flonum) arguments, and as fast as `real-matrix*' on (Matrix Real) arguments.
Cool!
> (Curiously, I wasn't able to speed up `flmatrix*' at all. The "a bit stupidly" implementation was actually the fastest I could do, which probably means it's as fast as it can be done in Racket without dropping to C.)
>
> Additionally, all the matrix functions will have more precise types to make it easy for TR to prove that operations on (Matrix Flonum) values yield (Matrix Flonum) values. (Currently, it only proves (Matrix Real) and (Matrix Number).) Example:
>
> > (:print-type (matrix-qr (matrix [[1.0 2.0] [3.0 4.0]])))
> (Values (Array Flonum) (Array Flonum))
> > (:print-type (matrix-qr (matrix [[1.0+0.0i 2.0+0.0i]
> [3.0+0.0i 4.0+0.0i]])))
> (Values (Array Float-Complex) (Array Float-Complex))
>
> The current release candidate returns two (Array Real) and (Array Number) instead.
>
> Since I've brought up decompositions, you should know that these, `matrix-inverse', `matrix-solve', and pretty much every O(n^3) operation except multiplication will be slow on large matrices. I can't think of a way to inline them in a way that TR can optimize. For fast versions, which will use LAPACK if it's installed, you'll have to wait for 5.3.3. :/
Will the LAPACK bindings be in the git head sooner, maybe ;-)
>
> Is the MKL library binary-compatible with LAPACK?
>
I did not try it myself yet, but the Intel documentation does say so:
"If your application already relies on the BLAS or LAPACK functionality, simply re-link with Intel® MKL to get better performance on Intel and compatible architectures."
http://software.intel.com/en-us/intel-mkl/
Berthold
> Neil ⊥
>