I have a list of numbers and I want to use CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
def x(n):
return str(n)
sentences = [5,10,15,10,5,10]
vectorizer = CountVectorizer(preprocessor= x, analyzer="word")
vectorizer.fit(sentences)
vectorizer.vocabulary_
output:
{'10': 0, '15': 1}
and:
vectorizer.transform(sentences).toarray()
output:
array([[0, 0],
[1, 0],
[0, 1],
[1, 0],
[0, 0],
[1, 0]], dtype=int64)
But why can't I do this for numbers less than 10?
This is the expected behavior. In the regex for
token_patternparameter of theCountVectorizer, it mentions:If you wish single character strings to be considered too, you just need to replace delete the first
win the regex, which will then allow, 1 and 1+ characters, by default it allows, 2 and 2+ as per the documentation.