I'm researching methods for computing expensive vector operations in Java, e.g. dot-products or multiplications between large matrices. There are a few good threads on here on this topic, like this and this.
It appears that there is no reliable way of having the JIT compile code to use CPU vector instructions (SSE2, AVX, MMX...). Moreover, high-performance linear algebra libraries (ND4J, jblas, ...) do in fact make JNI calls to BLAS/LAPACK libraries for the core routines. And I understand BLAS/LAPACK packages to be the de facto standard choices for native linear algebra computations.
On the other hand others (JAMA, ...) implement algorithms in pure Java without native
calls.
My questions are:
- What are the best practices here?
- Is making
native
calls to BLAS/LAPACK actually a recommended choice? Are there other libraries worth considering? - Is the overhead of JNI calls negligible compared to the performance gain? Does anyone have experience as to where the threshold lies (e.g. how small an input should be to make JNI calls more expensive than a pure Java routine?)
- How big is the portability tradeoff?
I hope this question could be of help both for those who develop their own computation routines, and for those who just want to make an educated choice between different implementations.
Insights are appreciated!
There are no clear best practices for every case. Whether you could/should use a pure Java solution (not using SIMD instructions) or (optimized with SIMD) native code through JNI depends on your particular application and specifically the size of your arrays and possible restrictions on the target system.
Pertinent benchmarks have been performed (in random order):
These benchmarks can be confusing as they are informative. One library may be faster for some operation and slower for some other. Also keep in mind that there may be more than one implementation of BLAS available for your system. I currently have 3 installed on my system blas, atlas and openblas. Apart from choosing a Java library wrapping a BLAS implementation you also have to choose the underlying BLAS implementation.
This answer has a fairly up to date list except it doesn't mention nd4j that is rather new. Keep in mind that jeigen depends on eigen so not on BLAS.