Are word-vector orientations universal?

96 views Asked by At

I have recently been experimenting with Word2Vec and I noticed whilst trawling through forums that a lot of other people are also creating their own vectors from their own databases.

This has made me curious as to how vectors look across databases and whether vectors take a universal orientation?

I understand that the vectors are created as a result of the context they are found in the corpus. So in that sense perhaps you wouldn't expect words to have the same orientation across databases. However, if the language of the documents are constant, then the contexts should be at least somewhat similar across different databases (excluding ambiguous words like bank (for money) and (river) bank). And if they are somewhat similar, it seems plausible that as we look at more commonly occurring words their direction may converge?

1

There are 1 answers

0
tripleee On BEST ANSWER

As outlined in the comments, "orientation" is not a well-defined concept in this context. A traditional word vector space has one dimension for each term.

In order for word vectors to be compatible, they will need to have the same term order. This is typically not the case between different vector collections, unless you build them from exactly the same documents in exactly the same order with exactly the same algorithms.

You could construe "orientation" as "vectors with the same terms in the same order" but the parallel to three-dimensional geometry is already strained as it is. It's probably better to avoid this term.

Given two collections of vectors from reasonably representative input in a known language, the most frequent terms will probably have similar distributions, so you could perhaps derive a mapping from one representation to another with some accuracy (see Zipf's Law). Back in the long tail of rare terms, you will certainly not be able to identify any useful mappings.