Sentence Embedding Clustering

2.2k views Asked by At

I am working on a small project in which I need to eliminate irrelevant information (ads for instance) from the html content I extracted from the websites. Since I am a beginner in NLP, I came up with a simple approach after doing some research.

The language used in the websites is mainly Chinese and I stored each sentence (separated by comma) into a list. I used a model called HanLP to do semantic parsing on my sentences. Something like this:

[['萨哈夫', '说', ',', '伊拉克', '将', '同', '联合国', '销毁', '伊拉克', '大', '规模', '杀伤性', '武器', '特别', '委员会', '继续', '保持', '合作', '。'], 
 ['上海', '华安', '工业', '(', '集团', ')', '公司', '董事长', '谭旭光', '和', '秘书', '张晚霞', '来到', '美国', '纽约', '现代', '艺术', '博物馆', '参观', '。']]

I found a pretrained Chinese word embedding database to get the word embeddings in my list. Then my approach is to get the sentence embedding by calculating the element-wise average in that sentence. Now I got a list with sentence embedding vector of each individual sentence I parsed.

sentence: ['各国', '必须', '“', '大', '规模', '”', '支出', '》', '的', '报道', '称']
sentence embedding: [0.08130878633396192, -0.07660450288941237, 0.008989107615145093, 0.07014013996178453, 0.028158639980988068, 0.01821030060422014, 0.017793822186914356, 0.04148909364911643, 0.019383941353722053, 0.03080177273262631, -0.025636445207055658, -0.019274188523096116, 0.0007501963356679136, 0.00476544528183612, -0.024648051539605313, -0.011124626140702854, -0.0009071269834583455, -0.08850407109341839, 0.016131568784740837, -0.025241035714068195, -0.041586867829954084, -0.0068722023954085835, -0.010853541125966744, 0.03994347004812549, 0.04977656596086242, 0.029051605612039566, -0.031031965550606732, 0.05125975541093133, 0.02666312647687102, 0.0376262941096105, -0.00833959155716002, 0.035523645325817844, -0.0026961421932686458, 0.04742895790629766, -0.07069634984840047, -0.054931600324132225, 0.0727336619218642, 0.0434290729039772, -0.09277284060689536, -0.020194332538680596, 0.0011523241092535582, 0.035080605863847515, 0.13034072890877724, 0.06350403482263739, -0.04108352984555743, 0.03208382343026725, -0.08344872626052662, -0.14081071757457472, -0.010535095733675089, -0.04253014939075166, -0.06409504175694151, 0.04499104322696274, -0.1153958263722333, 0.011868207969448784, 0.032386500388383865, -0.0036963022192305125, 0.01861521213802255, 0.05440248447385701, 0.026148285970769146, 0.011136160687204789, 0.04259885661303997, 0.09219381585717201, 0.06065366725141013, -0.015763109010136264, -0.0030524068596688185, 0.0031816939061338253, -0.01272551697382534, 0.02884035756472837, -0.002176688645373691, -0.04119681418788704, -0.08371328799562021, 0.007803680078888481, 0.0917377421124415, 0.027042210250246255, -0.0168504383076321, -0.0005781924013387073, 0.0075592477594248276, 0.07226487367667934, 0.005541681396690282, 0.001809495755217292, 0.011297995647923513, 0.10331092673269185, 0.0034428672357039018, 0.07364177612841806, 0.03861967177892273, -0.051503680434755304, -0.025596174390309236, 0.014137779785828157, -0.08445698734034192, -0.07401955000717532, 0.05168289600194178, -0.019313615386966954, 0.007136409255591306, -0.042960755484686655, 0.01830706542188471, -0.001172357662157579, -0.008949846103364094, -0.02356141348454085, -0.05277112944432619, 0.006653293967247009, -0.00572453092106364, 0.049479073389771984, -0.03399876727913083, 0.029434629207984966, -0.06990156170319427, 0.0924786920659244, 0.015472117049450224, -0.10265431468459693, -0.023421658562834968, 0.004523425542918796, -0.008990391665561632, -0.06445665437389504, 0.03898039324717088, -0.025552247142927212, 0.03958867977119305, -0.03243451675569469, -0.03848901360338046, -0.061713250523263756, -0.00904815017499707, -0.03730008362750099, 0.02715366007760167, -0.08498009599067947, -0.00397337388924577, -0.0003402943098494275, 0.008005982349542055, 0.05871503853069788, -0.013795949010686441, 0.007956360128115524, -0.024331797295334665, 0.03842244771393863, -0.04393653944134712, 0.02677931230176579, 0.07715398648923094, -0.048624055216681554, -0.11324723844882101, -0.08751555024222894, -0.02469049582511864, -0.08767948790707371, -0.021930147846102376, 0.011519658294591036, -0.08155732788145542, -0.10763703049583868, -0.07967398501932621, -0.03249315629628571, 0.02701333300633864, -0.015305672687563028, 0.002375963249836456, 0.012275356545367024, -0.02917095824060115, 0.02626959386874329, -0.0158629031767222, -0.05546591058373451, -0.023678493686020374, -0.048296650278974666, -0.06167154920033433, 0.004435380412773652, 0.07418209609617903, 0.03524015434297987, 0.063185997529548, -0.05814945189790292, 0.13036084697920491, -0.03370768073099581, 0.03256692289671099, 0.06808869439092549, 0.0563600350340659, 5.7854774323376745e-05, -0.0793171048333699, 0.03862177783792669, 0.007196083004766313, 0.013824320821599527, 0.02798982642707415, -0.00918149473992261, -0.00839392692697319, 0.040496235374699936, -0.007375971498814496, -0.03586547057652338, -0.03411220566538924, -0.025101724758066914, -0.005714270286262035, 0.07351569867354225, -0.024216756182299418, 0.0066968070935796604, -0.032809603959321976, 0.05006068360737779, 0.0504626590250568, 0.04525104385208, -0.027629732069644062, 0.10429493219337681, -0.021474285961382768, 0.018212029964409092, 0.07260083373297345, 0.026920156976716084, 0.043199389770796355, -0.03641596379351209, 0.0661080302670598, 0.09141866947439584, 0.0157452768815512, -0.04552285996297459, -0.03509725736115466, 0.02857604629190808]

My next step is to cluster these sentence embedding vectors and find out sentences that clearly have irrelevant content compared to the others.

Does my approach even make sense? If it does, what tools can I use to cluster my sentence embedding values? I saw there are approaches such as K-means or calculate L2 distances but I am not sure how to implement.

Thanks!

2

There are 2 answers

1
roddar92 On

For clustering you can try k-means, but this algorithm uses just Euclidean metric. For using another distance (i.e. cosine distance), the k-medoids is also suitable EM-algorithm. In Python, you can find KMeans in scikit-learn library. In order to try 'KMedoids', you should install scikit-learn-extra library (https://scikit-learn-extra.readthedocs.io/en/latest/generated/sklearn_extra.cluster.KMedoids.html) or this one: https://github.com/letiantian/kmedoids

2
ATIF ADIB On

The approach makes sense, if you are trying to get rid of sentences which do not contribute to the downstream analysis but element-wise average may not be the best way to construct the sentence embeddings. A better way to construct sentence embeddings would be to take the individual word embeddings and then combine them using tf-idf.

sentence = [w1, w2, w3]
word_vectors = [v1, v2, v3] , # v is of shape (N, ) where N is the size of embedding

term_frequency_of_word = [t1, t2, t3]
inverse_doc_freq = [idf1, idf2, idf3]

word_weights = [tf*idf for tf,idf in zip(term_frequency_of_word, inverse_doc_freq)]

sentence_vector = np.zeros(N)

for weight, vector in zip(word_weights, word_vectors):
    scaled_vectors = vector * weight
    sentence_vector += scaled_vector

By applying tf-idf scaling your sentence embedding will move towards the embedding of the most important word(s) in the sentence which might help you apply clustering algorithms to filter out unwanted sentences.

Here is a quick tutorial on TF-IDF: http://www.tfidf.com