I am working on a small project in which I need to eliminate irrelevant information (ads for instance) from the html content I extracted from the websites. Since I am a beginner in NLP, I came up with a simple approach after doing some research.
The language used in the websites is mainly Chinese and I stored each sentence (separated by comma) into a list. I used a model called HanLP to do semantic parsing on my sentences. Something like this:
[['萨哈夫', '说', ',', '伊拉克', '将', '同', '联合国', '销毁', '伊拉克', '大', '规模', '杀伤性', '武器', '特别', '委员会', '继续', '保持', '合作', '。'],
['上海', '华安', '工业', '(', '集团', ')', '公司', '董事长', '谭旭光', '和', '秘书', '张晚霞', '来到', '美国', '纽约', '现代', '艺术', '博物馆', '参观', '。']]
I found a pretrained Chinese word embedding database to get the word embeddings in my list. Then my approach is to get the sentence embedding by calculating the element-wise average in that sentence. Now I got a list with sentence embedding vector of each individual sentence I parsed.
sentence: ['各国', '必须', '“', '大', '规模', '”', '支出', '》', '的', '报道', '称']
sentence embedding: [0.08130878633396192, -0.07660450288941237, 0.008989107615145093, 0.07014013996178453, 0.028158639980988068, 0.01821030060422014, 0.017793822186914356, 0.04148909364911643, 0.019383941353722053, 0.03080177273262631, -0.025636445207055658, -0.019274188523096116, 0.0007501963356679136, 0.00476544528183612, -0.024648051539605313, -0.011124626140702854, -0.0009071269834583455, -0.08850407109341839, 0.016131568784740837, -0.025241035714068195, -0.041586867829954084, -0.0068722023954085835, -0.010853541125966744, 0.03994347004812549, 0.04977656596086242, 0.029051605612039566, -0.031031965550606732, 0.05125975541093133, 0.02666312647687102, 0.0376262941096105, -0.00833959155716002, 0.035523645325817844, -0.0026961421932686458, 0.04742895790629766, -0.07069634984840047, -0.054931600324132225, 0.0727336619218642, 0.0434290729039772, -0.09277284060689536, -0.020194332538680596, 0.0011523241092535582, 0.035080605863847515, 0.13034072890877724, 0.06350403482263739, -0.04108352984555743, 0.03208382343026725, -0.08344872626052662, -0.14081071757457472, -0.010535095733675089, -0.04253014939075166, -0.06409504175694151, 0.04499104322696274, -0.1153958263722333, 0.011868207969448784, 0.032386500388383865, -0.0036963022192305125, 0.01861521213802255, 0.05440248447385701, 0.026148285970769146, 0.011136160687204789, 0.04259885661303997, 0.09219381585717201, 0.06065366725141013, -0.015763109010136264, -0.0030524068596688185, 0.0031816939061338253, -0.01272551697382534, 0.02884035756472837, -0.002176688645373691, -0.04119681418788704, -0.08371328799562021, 0.007803680078888481, 0.0917377421124415, 0.027042210250246255, -0.0168504383076321, -0.0005781924013387073, 0.0075592477594248276, 0.07226487367667934, 0.005541681396690282, 0.001809495755217292, 0.011297995647923513, 0.10331092673269185, 0.0034428672357039018, 0.07364177612841806, 0.03861967177892273, -0.051503680434755304, -0.025596174390309236, 0.014137779785828157, -0.08445698734034192, -0.07401955000717532, 0.05168289600194178, -0.019313615386966954, 0.007136409255591306, -0.042960755484686655, 0.01830706542188471, -0.001172357662157579, -0.008949846103364094, -0.02356141348454085, -0.05277112944432619, 0.006653293967247009, -0.00572453092106364, 0.049479073389771984, -0.03399876727913083, 0.029434629207984966, -0.06990156170319427, 0.0924786920659244, 0.015472117049450224, -0.10265431468459693, -0.023421658562834968, 0.004523425542918796, -0.008990391665561632, -0.06445665437389504, 0.03898039324717088, -0.025552247142927212, 0.03958867977119305, -0.03243451675569469, -0.03848901360338046, -0.061713250523263756, -0.00904815017499707, -0.03730008362750099, 0.02715366007760167, -0.08498009599067947, -0.00397337388924577, -0.0003402943098494275, 0.008005982349542055, 0.05871503853069788, -0.013795949010686441, 0.007956360128115524, -0.024331797295334665, 0.03842244771393863, -0.04393653944134712, 0.02677931230176579, 0.07715398648923094, -0.048624055216681554, -0.11324723844882101, -0.08751555024222894, -0.02469049582511864, -0.08767948790707371, -0.021930147846102376, 0.011519658294591036, -0.08155732788145542, -0.10763703049583868, -0.07967398501932621, -0.03249315629628571, 0.02701333300633864, -0.015305672687563028, 0.002375963249836456, 0.012275356545367024, -0.02917095824060115, 0.02626959386874329, -0.0158629031767222, -0.05546591058373451, -0.023678493686020374, -0.048296650278974666, -0.06167154920033433, 0.004435380412773652, 0.07418209609617903, 0.03524015434297987, 0.063185997529548, -0.05814945189790292, 0.13036084697920491, -0.03370768073099581, 0.03256692289671099, 0.06808869439092549, 0.0563600350340659, 5.7854774323376745e-05, -0.0793171048333699, 0.03862177783792669, 0.007196083004766313, 0.013824320821599527, 0.02798982642707415, -0.00918149473992261, -0.00839392692697319, 0.040496235374699936, -0.007375971498814496, -0.03586547057652338, -0.03411220566538924, -0.025101724758066914, -0.005714270286262035, 0.07351569867354225, -0.024216756182299418, 0.0066968070935796604, -0.032809603959321976, 0.05006068360737779, 0.0504626590250568, 0.04525104385208, -0.027629732069644062, 0.10429493219337681, -0.021474285961382768, 0.018212029964409092, 0.07260083373297345, 0.026920156976716084, 0.043199389770796355, -0.03641596379351209, 0.0661080302670598, 0.09141866947439584, 0.0157452768815512, -0.04552285996297459, -0.03509725736115466, 0.02857604629190808]
My next step is to cluster these sentence embedding vectors and find out sentences that clearly have irrelevant content compared to the others.
Does my approach even make sense? If it does, what tools can I use to cluster my sentence embedding values? I saw there are approaches such as K-means or calculate L2 distances but I am not sure how to implement.
Thanks!
For clustering you can try
k-means
, but this algorithm uses just Euclidean metric. For using another distance (i.e. cosine distance), thek-medoids
is also suitable EM-algorithm. In Python, you can findKMeans
inscikit-learn
library. In order to try 'KMedoids', you should installscikit-learn-extra
library (https://scikit-learn-extra.readthedocs.io/en/latest/generated/sklearn_extra.cluster.KMedoids.html) or this one: https://github.com/letiantian/kmedoids