Python mean shift clustering of complex-number numpy array

1.8k views Asked by At

I’ve inherited some code that was written about a year ago, so I guess back then it was using numpy 1.13 (now v1.15.2), scipy 1.00rc (now v1.1.0), and sklearn 0.19 (now v.0.20.0).

It implements Fisher’s LDA to reduce an n-dimensional space to an 1…n-1 dimensional space which produces a numpy array of complex numbers as its result (due to floating-point imprecision). That array is then cheery-picked and fed into sklearn.cluster.MeanShift which immediately throws an exception:

  File "/…/lib/python3.6/site-packages/sklearn/cluster/mean_shift_.py", line 416, in fit
    X = check_array(X)
  File "/…/lib/python3.6/site-packages/sklearn/utils/validation.py", line 531, in check_array
    _ensure_no_complex_data(array)
  File "/…/lib/python3.6/site-packages/sklearn/utils/validation.py", line 354, in _ensure_no_complex_data
    "{}\n".format(array))
ValueError: Complex data not supported

I am still learning the mathematical details of what’s going on here, but it strikes me as odd that this code was declared “runnable”.

Am I missing something here? Have version changes brought about this regression, or is there a more fundamental code flaw? How would I go about fixing this issue?

1

There are 1 answers

0
Paul Panzer On

In comments/chat we have identified at least one problem which is that the numerical eigen decomposition of

(cov_w + I)^-1 @ cov_b                       (1)

is not real as it should but returns significant imaginary components. Here @ is matrix multiplication, cov_w and cov_b are covariance matrices and I is the identity matrix. This can be fixed by computing the matrix square root of (cov_w + I)^-1 lets call it SQ and then using the fact that (1) is similar to

SQ @ cov_b @ SQ                              (2)

hence has the same eigenvalues and if V are the eigenvectors of (2) then the (right) eigenvectors of (1) are SQ @ V.

What we have gained is that because (2) is a symmetric matrix its eigen decomposition can be computed using numpy.linalg.eigh which guarantees purely real results. eigh can also be used to compute SQ, see here. Be sure to bypass the inverse and apply eigh directly on cov_w + I or even cov_w.