When using the scikit-learn library in Python, I can use the CountVectorizer
to create ngrams of a desired length (e.g. 2 words) like so:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.collocations import *
from nltk.probability import FreqDist
import nltk
myString = 'This is a\nmultiline string'
countVectorizer = CountVectorizer(ngram_range=(2,2))
analyzer = countVectorizer.build_analyzer()
listNgramQuery = analyzer(myString)
NgramQueryWeights = nltk.FreqDist(listNgramQuery)
print(NgamQueryWeights.items())
This prints:
dict_items([('is multiline', 1), ('multiline string', 1), ('this is', 1)])
As one can see from the is multiline
ngram that was created (stop word a
is filtered out by default), the engine does not care about the linebreak within the string.
How can I modify the engine creating the ngrams to respect linebreaks in the string and only create ngrams with all of its words belonging to the same line of text? My expected output would be:
dict_items([('multiline string', 1), ('this is', 1)])
I know that I can modify the tokenizer pattern by passing token_pattern=someRegex
to CountVectorizer. Moreover, I read somewhere that the default regex used is u'(?u)\\b\\w\\w+\\b'
. Still, I think this problem is more about the ngram creation than about the tokenizer, as the problem is not that tokens are created without respecting the linebreak but the ngrams.
You'll need to overload the analyzer, as described in the documentation.