Gensim Doc2Vec Exception AttributeError: 'str' object has no attribute 'words'

2.9k views Asked by At

I am learning Doc2Vec model from gensim library and using it as follows:

class MyTaggedDocument(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            with open(os.path.join(self.dirname, fname),encoding='utf-8') as fin:
                print(fname)
                for item_no, sentence in enumerate(fin):
                    yield LabeledSentence([w for w in sentence.lower().split() if w in stopwords.words('english')], [fname.split('.')[0].strip() + '_%s' % item_no])
sentences = MyTaggedDocument(dirname)
model = Doc2Vec(sentences,min_count=2, window=10, size=300, sample=1e-4, negative=5, workers=7)

The input dirname is a directory path which has , for the sake of simplicity, only 2 files located with each file containing more than 100 lines. I am getting following Exception.

Output

Also, with print statement I could see that the iterator iterated over directory 6 times. Why is this so?

Any kind of help would be appreciated.

1

There are 1 answers

0
gojomo On

It looks like one of the text-example objects, which should be shaped like a TaggedDocument (with words and tags properties, formerly called LabeledSentence), is somehow a plain string instead. Are you 100% certain that the error in your screenshot was generated by exactly the iterable code you've included? (The code here looks like it could only emit acceptable LabeledSentece objects.)

Your supplied corpus Iterable is read once to do an initial scan which discovered all words/tags, then again multiple times for training. How many times is controlled by the iter parameter, with a default value (in recent versions of gensim) of 5. So the initial scan plus 5 training passes equal 6 total iterations. (10 or more iterations is common with Doc2Vec.)