KMeans for Sentence Embeddings

1.2k views Asked by At

K-MEANS Clustering b/w 2D NUMPY ARRAYS

I have been looking for a solution for a while and I can sense there must be something silly I might be missing so here goes. I have obtained sentence embeddings after training an embedding layer using Keras Sequential Layers.

Dummy Example

Let's say we have embeddings which looks like this:

Sentence 1 : np.array ([[6, 2], [3, 1], [7, 4], [8, 1], [5, 4], [9, 3], [5, 1]])

Sentence 2 : np.array ([[2, 5], [5, 7], [6, 5], [3, 1], [1, 1], [6,2], [2, 1]])

Basically, in a file with several sentences, I would want such sentence embeddings to be clustered so that similar sentences are clustered together.

I know this is the method we would use to cluster 1d arrays

from sklearn.cluster import KMeans
import numpy as np

X = np.array([[1, 1], [-1, -1], [1, -1]])

kmeans = KMeans(n_clusters=3, random_state=0).fit(X)

I tried this:

x = np.array([ [[6, 2], [3, 1], [7, 4], [8, 1], [5, 4], [11, 3], [5, 1]] , 
               [[6, 5], [8, 1], [7, 4],[8, 1], [5, 4], [11, 3], [5, 1]] ])

kmeans = KMeans(n_clusters=k, random_state=0).fit(x)

which throws ValueError: Found array with dim 3. Estimator expected <= 2.

Is it even possible to do k means clustering on such data or is there any other methodology I should follow?

One solution and the only one I can think of is to Average the Sentence Embeddings and use np.squeeze to squeeze the dimension of each sentence to a 1D ARRAY before clustering but it would mean losing all the positional information of the words in a sentence.

"I am a dog" would be same as "Am I a dog" which is wrong

2

There are 2 answers

0
alankrit nirjhar On BEST ANSWER

As correctly suggested by QUANG HOANG in the comments, the idea was to just flatten the dense sentence embedding matrix.

As needed, this would also keep the positional information about the words intact!

sent1 = np.ndarray.flatten(np.array([[1, 3], [7,5], [8, 1]]))
sent2 = np.ndarray.flatten(np.array([[3, 2], [4, 2], [2, 2]]))
sent3 = np.ndarray.flatten(np.array([[1, 1], [2, 7], [3, 5]]))
sent4 = np.ndarray.flatten(np.array([[1, 1], [2, 6], [3, 5]]))

X = np.array((s1,s2,s3,s4))

print (X)

Output:

array([[1, 3, 7, 5, 8, 1],
       [3, 2, 4, 2, 2, 2],
       [1, 1, 2, 7, 3, 5],
       [1, 1, 2, 6, 3, 5]])
0
Amandeep On
x = np.array([ [[6, 2], [3, 1], [7, 4], [8, 1], [5, 4], [11, 3], [5, 1]] , 
               [[6, 5], [8, 1], [7, 4],[8, 1], [5, 4], [11, 3], [5, 1]] ])

With reference to this, I am just guessing that the problem is that the scikit-learn expects 2d NumPy arrays for the training dataset for a fit function but the dataset you are passing in is a 3d array so you need to reshape the array into a 2d.