Ngram creation by removing words which are not present in LM model vectors for TfIdfVectorizer

22 views Asked by At

I want to cluster 160 000 documents or variable lengths.

Problem: Spacy LM model "en_core_web_lg" doesn't have all the words that are present in my documents. Creating NGrams also include non_present words to effect vector of the ngram.

Solution i tried: I override _word_ngrams method of TfIdfVectorizer to handle this.

import spacy
from sklearn.feature_extraction.text import TfidfVectorizer


class NewTfidfVectorizer(TfidfVectorizer):

    def _word_ngrams(self, tokens, stop_words=None):
        nlp = spacy.load('en_core_web_lg')
        """Turn tokens into a sequence of n-grams after stop words filtering and removing words which are out of lm model"""
        tokens = super(TfidfVectorizer, self)._word_ngrams(tokens, None)
        new_tokens = []
        for token in tokens:
            split_words = token.split(' ')
            if len(split_words) == 1:
                if nlp.vocab.has_vector(token):
                    new_tokens.append(token)
        for token in tokens:
            split_words = token.split(' ')
            new_words = []
            for word in split_words:
                if word in new_tokens:
                    new_words.append(word)
            new_tokens.append(' '.join(new_words))
        # Convert the list to a set, which automatically removes duplicates
        unique_tokens = set(new_tokens)
        # Convert the set back to a list
        unique_tokens_list = list(unique_tokens)
        return sorted(unique_tokens_list)`

Now problem: It takes a significant amount of time to just fitting this

NGRAM_RANGE = (1, 3)
tfidf_vectorizer = NewTfidfVectorizer(analyzer='word', norm=None, ngram_range=NGRAM_RANGE, stop_words='english', use_idf=True, smooth_idf=True)
tfidf_matrix = tfidf_vectorizer.fit_transform(docs)  # sparse matrix
vocab = tfidf_vectorizer.get_feature_names_out()
  • Step 1: process data: remove stopwords, html, xml tags and stemming Remove stopwords , html and xml tags by nltk stopwords and one personal list of stopwords.

  • Step 2: Tf-Idf fitting: used ngrams_range (1,3)

  • Step 3: normalize tfidf matrix (weighted_ifidf): "This step can be ignored"

  • Step 4: create a dict in this format. {"doc1": [("word1", 2.45), ("word2", 3.93454)], "doc2": {("word5", 1.395), ("word9", 4.2455)]}}

  • Step 5: Take weighted average of vectors based on normalize tfidf matrix: (vector12 + vector23 + vector3*1)/6

  • Step 6: Clustering using HDBSCAN

Few other problems:

the step 5 is very process intensive even though i use sparse matrix for tfidf at every point.

the data contains people and company names and do tend to have effect on cluster formation

0

There are 0 answers