the parameter 'token_pattern' will not be used since 'tokenizer' is not none'

1k views Asked by At

I am trying to remove punctuation and spaces (which includes newlines) and filter for tokens consisting of alphabetic characters only, and return the token text. I first define the function

  return [t.text for t in nlp(doc) if \
          not t.is_punct and \
          not t.is_space and \
          t.is_alpha]

And then i vectorize

vectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer)
train_feature_vects = vectorizer.fit_transform(train_data)

The terminal gets stuck, and says the parameter 'token_pattern' will not be used since 'tokenizer' is not none'. What am I doing wrong?

1

There are 1 answers

0
Andj On

For TfidfVectorizer, CountVectorizer, etc. in scikit-learn, to define your own tokenizer, you also need to set token_pattern to None:

vectorizer = TfidfVectorizer(tokenizer=spacy_tokenizer, token_pattern=None)

scikit-learn will use the token_pattern for tokenisation, if not specified, it will use the default value r”(?u)\b\w\w+\b”. this will override the tokenizer you define. So you need to set token_pattern to None, and scikit-learn will use the function you pass to tokenizer for the tokenization step.