Does Doc2vec support multiple languages?And does transvec lib use for Doc2vec model?

98 views Asked by At

I'm new in machine learning. Now I want to calculate similarity between two documents in different languages (ex: a Vietnamese document and an English document).

I know about if we compare multilingual words, we can use transvec in word2vec. I want to ask if it is possible in doc2vec. How can I do to solve this problem with doc2vec? (Now I train doc2vec by gensim)

1

There are 1 answers

8
gojomo On

The Doc2Vec model in Gensim is oblivious to languages. It just applies the very-word2vec-like 'Paragraph Vector' algorithm to learn vectors for runs-of-tokens (documents) that are helpful in predicting words, either alone (pure DBOW mode) or in combintion with nearby-word-to-nearby-word info (DM mdoes).

So, whether it'd work on a multilingual corpus, for any particular purpose –such as detecting when two documents in different languages cover similar topics – will depend entirely on how you train the model, and especially the kind of documents and word-to-word correlations it sees in it training set.

While I've not run the experiments, from my understanding of the algorithm, I would expect it to possibly work if:

  1. Given lots of data that's a meaningful mix of both languages, so that it has a chance to learn that word A in language 1 and word B in language 2 are related to the same topics. (It might be enough to simply have natural bilingual docs, but would probably help to include documents that include within them equivalent text in both languages. It might even be OK if such documents were rather low-quality mechanistic translations, as long as they overall hint properly as to which words related to each other, even if in the original/natural monolingual docs they never appear together.)
  2. Is right-sized to force the internal neural network to make use of cross-language correlations. (An oversized model – too many dimensions or rare words – would tend to instead learn the two languages without sharing much internal representation – a sort of overfitting.)

A model with only monolingual examples, and oversized, could tend to work great on English-to-English doc comparisons – putting all English docs in one giant region of the vector-space – and also work great on Vietnamese-to-Vietnamese doc comparisions _ putting all Vietnamese documents in an arbitrarily-different giant region of the vector-space. But even an English doc and a Vietnamese doc about the same thing could have very different vectors – because nothing in the training data ever hinted those words covered the same things.

Ultimately, though, you'd need to experiment to see how well it would work, and how much you could help it to work, by ensuring it has useful multilingual hints of cross-language topics.

Update re your Q about transvec:

I wasn't aware of that library; it looks neat, and seems it may be variant of, and possibly better than, the TranslationMatrix model in Gensim.

Both transvec & Gensim's TranslationMatrix are general tools for learning mappings between separate vector spaces, once you provide a set of known correlated anchors.

As such, they could allow an alternative approach to your goal:

  1. create one Doc2Vec model using only English documents, create a separate Doc2Vec model using only Vietnamese documents, and ensure they're both individually sensible – trained on enough data and giving reasonable results in ad hoc or rigorous evaluations.

2 then, using some good 'gold standard' set of English & Vietnamese document pairs that "should" have the same doc-vector – as for example if they are good translations of each other – use transvec to learn how to translate one model's vectors (either from the original training set or later inferences) into the other mdoel's space, for direct comparison to the other space's vectors.

As a vague rule of thumb, I believe you'd want to have many more anchor-pairs than there are dimensions in the model. (That is: a mere 100 1-to-1 examples is unlikely to be sufficient to learn a good mapping between 2 300-dimensional spaces – there's too much extra slack/variance on each end for fewer examples to communicate – but a thousand or several thousand examples might work well.)

But of course, the real answer will come through experiments on your data, for your goals.