Is it possible to to train a doc2vec model where a single document has multiple tags? For example, in movie reviews,
doc0 = doc2vec.TaggedDocument(words=review0,tags=['UID_0','horror','action'])
doc1 = doc2vec.TaggedDocument(words=review1,tags=['UID_1','drama','action','romance'])
In such case where each document has a unique tag (UID) and multiple categorical tags, how do I access the vector after the training? For example, what would be the most proper syntax to call
model['UID_1']
Yes, it's possible to supply multiple tags per document, and that's why the
tags
property ofTaggedDocument
should be a list, and why a key used to refer to learned doc-vectors is called a 'tag' rather than an 'id'. (While the original 'Paragraph Vectors' paper on which gensimDoc2Vec
is based only described using one unique identifier per document, this is a natural extension.)To get any doc-vector, you must access it via the
docvecs
property of the model, not the model itself. (The model itself, inheriting functionality fromWord2Vec
, will contain word-vectors, not doc-vectors, and those word-vectors will only be meaningful in someDoc2Vec
modes.)So after training, you'd get the doc-vectors of your example data via operations like the following:
Keep in mind that when you're training more vectors, you'll likely need more data. In a rough sense, whatever valuable generalizations that can be made from your data come from compressing the original data into a smaller representation. If you train a larger model – more word-vectors document-tag-vectors as internal tunable parameters – on the same amount of data, the results may be more 'diluted' or even 'overfit'. (That is, it may come to reflect memorized idiosyncrasies of the training data, rather than generalizable insights useful for downstream purposes or new texts).