I'm pretty new to ML and Datascience, so my question may be a little silly. I have a dataset, each row is a vector [a1,a2,a3,a3,...,an]. Those vectors are different not only in their measurements but also in number of n and the sum A = a1 + a2 + a3 +...+ an.
Most of the vectors have 5-6 dimensions, with some exception at 15-20 dimensions. On average, their components often have value of 40-50.
I have tried Kmeans, DBSCAN and GMM to cluster them:
- Kmeans overall gives the best result, however, for vectors with 2-3 dimensions and vectors with low A, it often misclassifies.
- DBSCAN can only separate vector with low dimension and low A from the dataset, the rest it treats as noise.
- GMM separates the vectors with 5-10 dimension, low A, very good, but performs poorly on the rest.
Now I want to include the information of n and A into the process. Example: -Vector 1 [0,1,2,1,0] and Vector 2 [0,2,4,5,3,2,1,0], they are differents in both n and A, they can't be in the same cluster. Each cluster only contains vectors with similar(close value) A and n, before taking their components into account.
I'm using sklearn on Python, I'm glad to hear suggestion and advice on this problem.
Your main problem is how to measure similarity.
I'm surprised you got the algorithms to run at all, because usually they would expect all vectors to have exactly the same length for computing distances. Maybe you had them automatically filled up with 0 values - and that is likely why the long vectors end up being very far away from all others.
Don't use the algorithms as black boxes
You need to understand what they are doing or the result will likely be useless. In your case, they are using a bad distance, so of course the result can't be very good.
So first, you'll need to find a better way of computing the distance of two points with different length. How similar should [0,1,2,1,0] and [30,40,50,60,50,40,30] be. To me, this is a highly similar pattern (ramp up, ramp down).