Create ngrams only for words on the same line (disregarding line breaks) with Scikit-learn CountVectorizer

3k views Asked by At

When using the scikit-learn library in Python, I can use the CountVectorizer to create ngrams of a desired length (e.g. 2 words) like so:

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.collocations import *
from nltk.probability import FreqDist
import nltk

myString = 'This is a\nmultiline string'

countVectorizer = CountVectorizer(ngram_range=(2,2))
analyzer = countVectorizer.build_analyzer()

listNgramQuery = analyzer(myString)
NgramQueryWeights = nltk.FreqDist(listNgramQuery)

print(NgamQueryWeights.items())

This prints:

dict_items([('is multiline', 1), ('multiline string', 1), ('this is', 1)])

As one can see from the is multiline ngram that was created (stop word a is filtered out by default), the engine does not care about the linebreak within the string.

How can I modify the engine creating the ngrams to respect linebreaks in the string and only create ngrams with all of its words belonging to the same line of text? My expected output would be:

dict_items([('multiline string', 1), ('this is', 1)])

I know that I can modify the tokenizer pattern by passing token_pattern=someRegex to CountVectorizer. Moreover, I read somewhere that the default regex used is u'(?u)\\b\\w\\w+\\b'. Still, I think this problem is more about the ngram creation than about the tokenizer, as the problem is not that tokens are created without respecting the linebreak but the ngrams.

3

There are 3 answers

0
Fred Foo On BEST ANSWER

You'll need to overload the analyzer, as described in the documentation.

def bigrams_per_line(doc):
    for ln in doc.split('\n'):
        terms = re.findall(r'\w{2,}', ln)
        for bigram in zip(terms, terms[1:]):
            yield '%s %s' % bigram


cv = CountVectorizer(analyzer=bigrams_per_line)
cv.fit(['This is a\nmultiline string'])
print(cv.get_feature_names())
# ['This is', 'multiline string']
0
Dirk On

The accepted answer works fine, but only finds bigrams (tokens consisting of exactly two words). In order to generalize this to ngrams (as it was in my example code in the question by using the ngram_range=(min,max) argument), one can use the following code:

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.collocations import *
from nltk.probability import FreqDist
import nltk
import re
from itertools import tee, islice

# custom ngram analyzer function, matching only ngrams that belong to the same line
def ngrams_per_line(doc):

    # analyze each line of the input string seperately
    for ln in doc.split('\n'):

        # tokenize the input string (customize the regex as desired)
        terms = re.findall(u'(?u)\\b\\w+\\b', ln)

        # loop ngram creation for every number between min and max ngram length
        for ngramLength in range(minNgramLength, maxNgramLength+1):

            # find and return all ngrams
            # for ngram in zip(*[terms[i:] for i in range(3)]): <-- solution without a generator (works the same but has higher memory usage)
            for ngram in zip(*[islice(seq, i, len(terms)) for i, seq in enumerate(tee(terms, ngramLength))]): # <-- solution using a generator
                ngram = ' '.join(ngram)
                yield ngram

Then use the custom analyzer as argument to CountVectorizer:

cv = CountVectorizer(analyzer=ngrams_per_line)

Make sure that minNgramLength and maxNgramLength are defined in such a way that they are known to the ngrams_per_line function (e.g. by declaring them as globals) since they cannot be passed to it as arguments (at least I don't know how).

0
CKLu On

Dirk's answer is even better than the accepted one, just give another clue for how to assign params to this function -- simply use closure.

def gen_analyzer(minNgramLength, maxNgramLength):
     def ngrams_per_line(doc):
     ...
     
     return ngrams_per_line

cv = CountVectorizer(analyzer=gen_analyzer(1, 2))