I'm using CountVectorizer
from sklearn
to do text tokenization (2-gram) and create a term-document matrix. How can I tokenize text into 2-grams with punctuation as boundaries? For example, the input sentence is "this is example, with punctuation."
I want the tokens to be "this is", "is example", "with punctuation".
I don't want "example with", which is across the comma.
Below is my current code:
from sklearn.feature_extraction.text import CountVectorizer
df = pd.DataFrame({'title':['this is example, with punctuation'], 'page':[1]})
countvec = CountVectorizer(ngram_range=(2, 2), analyzer="word")
test_tdm = pd.DataFrame(countvec.fit_transform(df.title).toarray(), columns=countvec.get_feature_names())
print(test_tdm)
Thanks!
One way of doing it would be to first split the string you want to tokenise by punctuation. Something like this:
and then apply the tokenisation to each element of the
splitted_title