I have noticed a phenomenon whereby there is a linear relationship between embedding vocabulary size in a Tensorflow model and its inference time.
This in itself I do not find surprising. However the steepness of the relationship I have found to be surprising and impractical.
I have created a simple sequential model with a vocabulary size of 10 million and running a single inference through the model takes 82 seconds on my M1 Pro laptop.
I have created a minimal script that reproduces the effect:
import time
import uuid
import tensorflow as tf
def create_vocab(vocab_size):
return [str(uuid.uuid4()) for i in range(vocab_size)]
def run(vocab_size):
vocabulary = create_vocab(vocab_size)
model = tf.keras.Sequential([
tf.keras.layers.StringLookup(
vocabulary=vocabulary,
mask_token=None),
tf.keras.layers.Embedding(
vocab_size + 1,
24)
])
t1 = time.time()
model.predict([vocabulary[4]])
t2 = time.time()
inference_time = t2-t1
print(f"Vocab size: {vocab_size} / Inference time: {inference_time}")
if __name__ == '__main__':
for vocab_size in [1000, 10000, 100000, 1000000, 10000000]:
run(vocab_size)
The results I get are: Vocab_Size Inference time 1000 0.041 10000 0.106 100000 0.718 1000000 7.351 10000000 82.48
I can only assume that this is expected behaviour and I am using the wrong pattern. However, my question then is, how might I build a performant online model that uses high volumes of embeddings? The use case I have here is users in a recommender system where I would like to perform realtime inference on a model containing at least 10m user embeddings in <200ms. There must be many online models that achieve this. Any advice much appreciated.
UPDATE, SOLVED: It was solved by removing the string lookup from the model as it appears this component is particularly inefficient. The alternative approach is to do a string-to-int user mapping and use this as input instead. With this approach, the 10m vocab variant returns in 200ms.