I used the XLM-RoBERTa tokenizer in order to get the IDs for a bunch of sentences such as:
["loving is great", "This is another example"]
I see that the IDs returned are not always as many as the number of whitespace-separated tokens in my sentences: for example, the first sentence corresponds to [[0, 459, 6496, 83, 6782, 2]]
, with loving
being 456
and 6496
. After getting the matrix for the word embeddings from the IDs, I was trying to identify only those word embeddings/vectors corresponding to some specific tokens: is there a way to do that? If the original tokens are sometimes assigned more than one ID and this cannot be predicted, I do not see how this is possible.
More in general, my task is to get word embeddings for some specific tokens within a sentence: my goal is therefore to use first the sentence so that word embeddings of single tokens can be calculated within the syntactic context, but then I would like to identify/keep the vectors of only some specific tokens and not those of all tokens in the sentence.
The mapping between tokens and IDs is unique, however, the text is segmented into subwords before you get the token (in this case subword) IDs.
You can find out what string the IDs belong to:
You will get:
['▁lo', 'ving']
which shows how the first word was actually pre-processed.The preprocessing split everything on spaces and prepend the first token all tokens preceded with a space with the
▁
sign. In the second step, it splits out-out-vocabulary tokens into subwords for which there are IDs in the vocabulary.