How to tokenize text using punctuation as boundaries (Python)

Question

How to tokenize text using punctuation as boundaries (Python)

248 views Asked by Yichi Liu At 15 September 2017 at 09:09

I'm using CountVectorizer from sklearn to do text tokenization (2-gram) and create a term-document matrix. How can I tokenize text into 2-grams with punctuation as boundaries? For example, the input sentence is "this is example, with punctuation." I want the tokens to be "this is", "is example", "with punctuation". I don't want "example with", which is across the comma.

Below is my current code:

from sklearn.feature_extraction.text import CountVectorizer
df = pd.DataFrame({'title':['this is example, with punctuation'], 'page':[1]})
countvec = CountVectorizer(ngram_range=(2, 2), analyzer="word")

test_tdm = pd.DataFrame(countvec.fit_transform(df.title).toarray(), columns=countvec.get_feature_names())
print(test_tdm)

Thanks!

Original Q&A

There are 1 answers

**MiKo** · Answer 1 · 2017-09-15T09:34:31+00:00

One way of doing it would be to first split the string you want to tokenise by punctuation. Something like this:

import re, string

patt = '[' + string.punctuation + ']'
splitted_title = re.split(patt, df.title)

and then apply the tokenisation to each element of the splitted_title

TechQA.

How to tokenize text using punctuation as boundaries (Python)

There are 1 answers

Related Questions in PYTHON

Related Questions in TOKENIZE

Related Questions in TERM-DOCUMENT-MATRIX

Popular Questions

Popular Tags

Trending Questions