How to tokenize text using punctuation as boundaries (Python)

253 views Asked by At

I'm using CountVectorizer from sklearn to do text tokenization (2-gram) and create a term-document matrix. How can I tokenize text into 2-grams with punctuation as boundaries? For example, the input sentence is "this is example, with punctuation." I want the tokens to be "this is", "is example", "with punctuation". I don't want "example with", which is across the comma.

Below is my current code:

from sklearn.feature_extraction.text import CountVectorizer
df = pd.DataFrame({'title':['this is example, with punctuation'], 'page':[1]})
countvec = CountVectorizer(ngram_range=(2, 2), analyzer="word")

test_tdm = pd.DataFrame(countvec.fit_transform(df.title).toarray(), columns=countvec.get_feature_names())
print(test_tdm)

Thanks!

1

There are 1 answers

0
MiKo On

One way of doing it would be to first split the string you want to tokenise by punctuation. Something like this:

import re, string

patt = '[' + string.punctuation + ']'
splitted_title = re.split(patt, df.title)

and then apply the tokenisation to each element of the splitted_title