CountVectorizer to build dictionary for removing extra words

Question

CountVectorizer to build dictionary for removing extra words

1.1k views Asked by AudioBubble At 11 October 2020 at 15:03

I have a list of sentences within a pandas column:

sentence
I am writing on Stackoverflow because I cannot find a solution to my problem.
I am writing on Stackoverflow. 
I need to show some code. 
Please see the code below

I would like to run some text mining and analysis through them, for example to get the word frequency. To do it, I am using this approach:

from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["I am writing on Stackoverflow because I cannot find a solution to my problem."]
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(text)

How can I apply it to my column, removing extra stopwords after building the vocabulary?

Original Q&A

There are 1 answers

**Sergey Bushmanov** · Accepted Answer · 2020-10-11T15:15:10+00:00

You may make use of stop_words param in CountVectorizer, that will take care of removing stop words:

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
text = ["I am writing on Stackoverflow because I cannot find a solution to my problem."]
stopwords = stopwords.words("english") # you may add or define your stopwords here
vectorizer = CountVectorizer(stop_words=stopwords)
vectorizer.fit_transform(text)

If you want to do all the preprocessing within pandas dataframe:

from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
text = ["I am writing on Stackoverflow because I cannot find a solution to my problem.", "I am writing on Stackoverflow."]
df = pd.DataFrame({"text": text})
stopwords = stopwords.words("english") # you may add or define your stopwords here
vectorizer = CountVectorizer(stop_words=stopwords)
df["counts"] = vectorizer.fit_transform(df["text"]).todense().tolist()
df
                                                text              counts
0  I am writing on Stackoverflow because I cannot...  [1, 1, 1, 1, 1, 1]
1                     I am writing on Stackoverflow.  [0, 0, 0, 0, 1, 1]

In both cases you have a vocab with stopwords removed:

print(vectorizer.vocabulary_)
{'writing': 5, 'stackoverflow': 4, 'cannot': 0, 'find': 1, 'solution': 3, 'problem': 2}

TechQA.

CountVectorizer to build dictionary for removing extra words

There are 1 answers

Related Questions in PYTHON

Related Questions in PANDAS

Related Questions in SCIKIT-LEARN

Related Questions in NLP

Related Questions in COUNTVECTORIZER

Popular Questions

Popular Tags

Trending Questions