Create ngrams only for words on the same line (disregarding line breaks) with Scikit-learn CountVectorizer

Question

Create ngrams only for words on the same line (disregarding line breaks) with Scikit-learn CountVectorizer

3k views Asked by Dirk At 13 November 2014 at 10:58

When using the scikit-learn library in Python, I can use the CountVectorizer to create ngrams of a desired length (e.g. 2 words) like so:

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.collocations import *
from nltk.probability import FreqDist
import nltk

myString = 'This is a\nmultiline string'

countVectorizer = CountVectorizer(ngram_range=(2,2))
analyzer = countVectorizer.build_analyzer()

listNgramQuery = analyzer(myString)
NgramQueryWeights = nltk.FreqDist(listNgramQuery)

print(NgamQueryWeights.items())

This prints:

dict_items([('is multiline', 1), ('multiline string', 1), ('this is', 1)])

As one can see from the is multiline ngram that was created (stop word a is filtered out by default), the engine does not care about the linebreak within the string.

How can I modify the engine creating the ngrams to respect linebreaks in the string and only create ngrams with all of its words belonging to the same line of text? My expected output would be:

dict_items([('multiline string', 1), ('this is', 1)])

I know that I can modify the tokenizer pattern by passing token_pattern=someRegex to CountVectorizer. Moreover, I read somewhere that the default regex used is u'(?u)\\b\\w\\w+\\b'. Still, I think this problem is more about the ngram creation than about the tokenizer, as the problem is not that tokens are created without respecting the linebreak but the ngrams.

Original Q&A

There are 3 answers

Dirk On 18 November 2014 at 11:45

The accepted answer works fine, but only finds bigrams (tokens consisting of exactly two words). In order to generalize this to ngrams (as it was in my example code in the question by using the ngram_range=(min,max) argument), one can use the following code:

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from nltk.collocations import *
from nltk.probability import FreqDist
import nltk
import re
from itertools import tee, islice

# custom ngram analyzer function, matching only ngrams that belong to the same line
def ngrams_per_line(doc):

    # analyze each line of the input string seperately
    for ln in doc.split('\n'):

        # tokenize the input string (customize the regex as desired)
        terms = re.findall(u'(?u)\\b\\w+\\b', ln)

        # loop ngram creation for every number between min and max ngram length
        for ngramLength in range(minNgramLength, maxNgramLength+1):

            # find and return all ngrams
            # for ngram in zip(*[terms[i:] for i in range(3)]): <-- solution without a generator (works the same but has higher memory usage)
            for ngram in zip(*[islice(seq, i, len(terms)) for i, seq in enumerate(tee(terms, ngramLength))]): # <-- solution using a generator
                ngram = ' '.join(ngram)
                yield ngram

Then use the custom analyzer as argument to CountVectorizer:

cv = CountVectorizer(analyzer=ngrams_per_line)

Make sure that minNgramLength and maxNgramLength are defined in such a way that they are known to the ngrams_per_line function (e.g. by declaring them as globals) since they cannot be passed to it as arguments (at least I don't know how).

CKLu On 15 April 2022 at 09:56

Dirk's answer is even better than the accepted one, just give another clue for how to assign params to this function -- simply use closure.

def gen_analyzer(minNgramLength, maxNgramLength):
     def ngrams_per_line(doc):
     ...
     
     return ngrams_per_line

cv = CountVectorizer(analyzer=gen_analyzer(1, 2))

**Fred Foo** · Accepted Answer · 2014-11-13T20:37:21+00:00

You'll need to overload the analyzer, as described in the documentation.

def bigrams_per_line(doc):
    for ln in doc.split('\n'):
        terms = re.findall(r'\w{2,}', ln)
        for bigram in zip(terms, terms[1:]):
            yield '%s %s' % bigram


cv = CountVectorizer(analyzer=bigrams_per_line)
cv.fit(['This is a\nmultiline string'])
print(cv.get_feature_names())
# ['This is', 'multiline string']

TechQA.

Create ngrams only for words on the same line (disregarding line breaks) with Scikit-learn CountVectorizer

There are 3 answers

Related Questions in PYTHON

Related Questions in SCIKIT-LEARN

Related Questions in N-GRAM

Popular Questions

Popular Tags

Trending Questions