I am learning Doc2Vec
model from gensim
library and using it as follows:
class MyTaggedDocument(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for fname in os.listdir(self.dirname):
with open(os.path.join(self.dirname, fname),encoding='utf-8') as fin:
print(fname)
for item_no, sentence in enumerate(fin):
yield LabeledSentence([w for w in sentence.lower().split() if w in stopwords.words('english')], [fname.split('.')[0].strip() + '_%s' % item_no])
sentences = MyTaggedDocument(dirname)
model = Doc2Vec(sentences,min_count=2, window=10, size=300, sample=1e-4, negative=5, workers=7)
The input dirname
is a directory path which has , for the sake of simplicity, only 2 files located with each file containing more than 100 lines. I am getting following Exception.
Also, with print
statement I could see that the iterator iterated over directory 6 times. Why is this so?
Any kind of help would be appreciated.
It looks like one of the text-example objects, which should be shaped like a
TaggedDocument
(withwords
andtags
properties, formerly calledLabeledSentence
), is somehow a plain string instead. Are you 100% certain that the error in your screenshot was generated by exactly the iterable code you've included? (The code here looks like it could only emit acceptableLabeledSentece
objects.)Your supplied corpus Iterable is read once to do an initial scan which discovered all words/tags, then again multiple times for training. How many times is controlled by the
iter
parameter, with a default value (in recent versions of gensim) of 5. So the initial scan plus 5 training passes equal 6 total iterations. (10 or more iterations is common with Doc2Vec.)