I’ve inherited some code that was written about a year ago, so I guess back then it was using numpy 1.13 (now v1.15.2), scipy 1.00rc (now v1.1.0), and sklearn 0.19 (now v.0.20.0).
It implements Fisher’s LDA to reduce an n-dimensional space to an 1…n-1 dimensional space which produces a numpy array of complex numbers as its result (due to floating-point imprecision). That array is then cheery-picked and fed into sklearn.cluster.MeanShift
which immediately throws an exception:
File "/…/lib/python3.6/site-packages/sklearn/cluster/mean_shift_.py", line 416, in fit
X = check_array(X)
File "/…/lib/python3.6/site-packages/sklearn/utils/validation.py", line 531, in check_array
_ensure_no_complex_data(array)
File "/…/lib/python3.6/site-packages/sklearn/utils/validation.py", line 354, in _ensure_no_complex_data
"{}\n".format(array))
ValueError: Complex data not supported
I am still learning the mathematical details of what’s going on here, but it strikes me as odd that this code was declared “runnable”.
Am I missing something here? Have version changes brought about this regression, or is there a more fundamental code flaw? How would I go about fixing this issue?
In comments/chat we have identified at least one problem which is that the numerical eigen decomposition of
is not real as it should but returns significant imaginary components. Here @ is matrix multiplication, cov_w and cov_b are covariance matrices and I is the identity matrix. This can be fixed by computing the matrix square root of (cov_w + I)^-1 lets call it SQ and then using the fact that (1) is similar to
hence has the same eigenvalues and if V are the eigenvectors of (2) then the (right) eigenvectors of (1) are SQ @ V.
What we have gained is that because (2) is a symmetric matrix its eigen decomposition can be computed using
numpy.linalg.eigh
which guarantees purely real results.eigh
can also be used to compute SQ, see here. Be sure to bypass the inverse and applyeigh
directly oncov_w + I
or evencov_w
.