KMeans for Sentence Embeddings

Question

KMeans for Sentence Embeddings

1.2k views Asked by alankrit nirjhar At 06 October 2020 at 17:34

K-MEANS Clustering b/w 2D NUMPY ARRAYS

I have been looking for a solution for a while and I can sense there must be something silly I might be missing so here goes. I have obtained sentence embeddings after training an embedding layer using Keras Sequential Layers.

Dummy Example

Let's say we have embeddings which looks like this:

Sentence 1 : np.array ([[6, 2], [3, 1], [7, 4], [8, 1], [5, 4], [9, 3], [5, 1]])

Sentence 2 : np.array ([[2, 5], [5, 7], [6, 5], [3, 1], [1, 1], [6,2], [2, 1]])

Basically, in a file with several sentences, I would want such sentence embeddings to be clustered so that similar sentences are clustered together.

I know this is the method we would use to cluster 1d arrays

from sklearn.cluster import KMeans
import numpy as np

X = np.array([[1, 1], [-1, -1], [1, -1]])

kmeans = KMeans(n_clusters=3, random_state=0).fit(X)

I tried this:

x = np.array([ [[6, 2], [3, 1], [7, 4], [8, 1], [5, 4], [11, 3], [5, 1]] , 
               [[6, 5], [8, 1], [7, 4],[8, 1], [5, 4], [11, 3], [5, 1]] ])

kmeans = KMeans(n_clusters=k, random_state=0).fit(x)

which throws ValueError: Found array with dim 3. Estimator expected <= 2.

Is it even possible to do k means clustering on such data or is there any other methodology I should follow?

One solution and the only one I can think of is to Average the Sentence Embeddings and use np.squeeze to squeeze the dimension of each sentence to a 1D ARRAY before clustering but it would mean losing all the positional information of the words in a sentence.

"I am a dog" would be same as "Am I a dog" which is wrong

Original Q&A

There are 2 answers

Amandeep On 06 October 2020 at 17:46

x = np.array([ [[6, 2], [3, 1], [7, 4], [8, 1], [5, 4], [11, 3], [5, 1]] , 
               [[6, 5], [8, 1], [7, 4],[8, 1], [5, 4], [11, 3], [5, 1]] ])

With reference to this, I am just guessing that the problem is that the scikit-learn expects 2d NumPy arrays for the training dataset for a fit function but the dataset you are passing in is a 3d array so you need to reshape the array into a 2d.

**alankrit nirjhar** · Accepted Answer · 2020-10-06T18:07:13+00:00

As correctly suggested by QUANG HOANG in the comments, the idea was to just flatten the dense sentence embedding matrix.

As needed, this would also keep the positional information about the words intact!

sent1 = np.ndarray.flatten(np.array([[1, 3], [7,5], [8, 1]]))
sent2 = np.ndarray.flatten(np.array([[3, 2], [4, 2], [2, 2]]))
sent3 = np.ndarray.flatten(np.array([[1, 1], [2, 7], [3, 5]]))
sent4 = np.ndarray.flatten(np.array([[1, 1], [2, 6], [3, 5]]))

X = np.array((s1,s2,s3,s4))

print (X)

Output:

array([[1, 3, 7, 5, 8, 1],
       [3, 2, 4, 2, 2, 2],
       [1, 1, 2, 7, 3, 5],
       [1, 1, 2, 6, 3, 5]])

TechQA.

KMeans for Sentence Embeddings

K-MEANS Clustering b/w 2D NUMPY ARRAYS

There are 2 answers

Related Questions in PYTHON

Related Questions in NUMPY

Related Questions in NLP

Related Questions in K-MEANS

Related Questions in SENTENCE-SIMILARITY

Popular Questions

Popular Tags

Trending Questions