Multiple tags for single document in doc2vec. TaggedDocument

3.3k views Asked by At

Is it possible to to train a doc2vec model where a single document has multiple tags? For example, in movie reviews,

doc0 = doc2vec.TaggedDocument(words=review0,tags=['UID_0','horror','action'])
doc1 = doc2vec.TaggedDocument(words=review1,tags=['UID_1','drama','action','romance'])

In such case where each document has a unique tag (UID) and multiple categorical tags, how do I access the vector after the training? For example, what would be the most proper syntax to call

model['UID_1']
1

There are 1 answers

0
gojomo On

Yes, it's possible to supply multiple tags per document, and that's why the tags property of TaggedDocument should be a list, and why a key used to refer to learned doc-vectors is called a 'tag' rather than an 'id'. (While the original 'Paragraph Vectors' paper on which gensim Doc2Vec is based only described using one unique identifier per document, this is a natural extension.)

To get any doc-vector, you must access it via the docvecs property of the model, not the model itself. (The model itself, inheriting functionality from Word2Vec, will contain word-vectors, not doc-vectors, and those word-vectors will only be meaningful in some Doc2Vec modes.)

So after training, you'd get the doc-vectors of your example data via operations like the following:

model.docvecs['UID_1']
model.docvecs['action']

Keep in mind that when you're training more vectors, you'll likely need more data. In a rough sense, whatever valuable generalizations that can be made from your data come from compressing the original data into a smaller representation. If you train a larger model – more word-vectors document-tag-vectors as internal tunable parameters – on the same amount of data, the results may be more 'diluted' or even 'overfit'. (That is, it may come to reflect memorized idiosyncrasies of the training data, rather than generalizable insights useful for downstream purposes or new texts).