I have a list of sentences within a pandas column:
sentence
I am writing on Stackoverflow because I cannot find a solution to my problem.
I am writing on Stackoverflow.
I need to show some code.
Please see the code below
I would like to run some text mining and analysis through them, for example to get the word frequency. To do it, I am using this approach:
from sklearn.feature_extraction.text import CountVectorizer
# list of text documents
text = ["I am writing on Stackoverflow because I cannot find a solution to my problem."]
vectorizer = CountVectorizer()
# tokenize and build vocab
vectorizer.fit(text)
How can I apply it to my column, removing extra stopwords after building the vocabulary?
You may make use of
stop_words
param in CountVectorizer, that will take care of removing stop words:If you want to do all the preprocessing within
pandas
dataframe:In both cases you have a vocab with stopwords removed: