I'm trying to evaluate performance of most_similar method (https://spacy.io/api/vectors#most_similar) from Spacy. I'm curious whether it works faster on GPU or not. The function like this:
def spacy_most_similar(word, topn=10):
ms = nlp_ru.vocab.vectors.most_similar(nlp_ru(word).vector.reshape(1,100), n=topn)
words = [nlp_ru.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
return words, distances
spacy_most_similar("дерево", 10)
works correctly for CPU version, but on GPU (which uses CuPy arrays instead of NumPy) I receive an error:
TypeError Traceback (most recent call last)
<ipython-input-8-ea5e049ec55b> in <module>()
7 distances = ms[2]
8 return words, distances
----> 9 spacy_most_similar("дерево", 10)
<ipython-input-8-ea5e049ec55b> in spacy_most_similar(word, topn)
3 print(nlp_ru(word).vector.reshape(1,100).shape)
4 ms = nlp_ru.vocab.vectors.most_similar(
----> 5 nlp_ru(word).vector.reshape(1,100), n=topn)
6 words = [nlp_ru.vocab.strings[w] for w in ms[0][0]]
7 distances = ms[2]
vectors.pyx in spacy.vectors.Vectors.most_similar()
TypeError: list indices must be integers or slices, not cupy.core.core.ndarray
I also tried this approach:
def spacy_most_similar(word, topn=10):
ms = nlp_ru.vocab.vectors.most_similar(np.asarray([nlp_ru.vocab.vectors[nlp_ru.vocab.strings[word]]]), n=topn)
words = [nlp_ru.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
return words, distances
spacy_most_similar("дерево", 10)
Again all working fine on CPU, but for GPU version (I changed np to cp):
import cupy as cp
def spacy_most_similar(word, topn=10):
with cp.cuda.Device(0):
nlp_ru.vocab.vectors.data = cp.asarray(nlp_ru.vocab.vectors.data)
ms = nlp_ru.vocab.vectors.most_similar(cp.asarray([nlp_ru.vocab.vectors[nlp_ru.vocab.strings[word]]]), n=topn)
words = [nlp_ru.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
return words, distances
spacy_most_similar("дерево", 10)
I've got an error like this:
TypeError Traceback (most recent call last)
<ipython-input-6-876656d5f75d> in <module>()
7 distances = ms[2]
8 return words, distances
----> 9 spacy_most_similar("дерево", 10)
<ipython-input-6-876656d5f75d> in spacy_most_similar(word, topn)
3 with cp.cuda.Device(0):
4 nlp_ru.vocab.vectors.data = cp.asarray(nlp_ru.vocab.vectors.data)
----> 5 ms = nlp_ru.vocab.vectors.most_similar(cp.asarray([nlp_ru.vocab.vectors[nlp_ru.vocab.strings[word]]]), n=topn)
6 words = [nlp_ru.vocab.strings[w] for w in ms[0][0]]
7 distances = ms[2]
vectors.pyx in spacy.vectors.Vectors.most_similar()
TypeError: unhashable type: 'cupy.core.core.ndarray'
Could you please help me to build correct CuPy input for most_similar() method?
I doubt you can do
most_similar
on GPU given the existing source code:Note,
filled
is already a CPU object, which will be indexed properly by an index fetched from numpy array, but not from cupy array. The errorTypeError: list indices must be integers or slices, not cupy.core.core.ndarray
is from the following 2 lines:If you think there is a value of finding most similar words on GPU you may open an issue on https://github.com/explosion/spaCy/issues or write your own
most_similar
(which I believe is simple enough).