CountVectorizer for number

Question

CountVectorizer for number

174 views Asked by saraafr At 12 April 2023 at 07:53

I have a list of numbers and I want to use CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

def x(n):
   return str(n)

sentences = [5,10,15,10,5,10]

vectorizer = CountVectorizer(preprocessor= x, analyzer="word")
vectorizer.fit(sentences)

vectorizer.vocabulary_

output:

{'10': 0, '15': 1}

and:

vectorizer.transform(sentences).toarray()

output:

array([[0, 0],
   [1, 0],
   [0, 1],
   [1, 0],
   [0, 0],
   [1, 0]], dtype=int64)

But why can't I do this for numbers less than 10?

Original Q&A

There are 1 answers

**lifezbeautiful** · Accepted Answer · 2023-04-12T08:26:09+00:00

This is the expected behavior. In the regex for token_pattern parameter of the CountVectorizer, it mentions:

token_pattern  str or None, default=r”(?u)\b\w\w+\b”
Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'.

The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

If there is a capturing group in token_pattern then the captured group content, not the entire match, becomes the token. At most one capturing group is permitted.

If you wish single character strings to be considered too, you just need to replace delete the first w in the regex, which will then allow, 1 and 1+ characters, by default it allows, 2 and 2+ as per the documentation.

vectorizer = CountVectorizer(preprocessor= x, analyzer="word", token_pattern=r"(?u)\b\w+\b")


Output: 
{'5': 2, '10': 0, '15': 1}

TechQA.

CountVectorizer for number

There are 1 answers

Related Questions in PYTHON

Related Questions in MACHINE-LEARNING

Related Questions in COUNTVECTORIZER

Popular Questions

Trending Questions