Impractical inference time on Tensorflow model with large embedding vocabulary

Question

Impractical inference time on Tensorflow model with large embedding vocabulary

45 views Asked by Felix Mercer Moss At 03 October 2023 at 17:29

I have noticed a phenomenon whereby there is a linear relationship between embedding vocabulary size in a Tensorflow model and its inference time.

This in itself I do not find surprising. However the steepness of the relationship I have found to be surprising and impractical.

I have created a simple sequential model with a vocabulary size of 10 million and running a single inference through the model takes 82 seconds on my M1 Pro laptop.

I have created a minimal script that reproduces the effect:

import time
import uuid
import tensorflow as tf

def create_vocab(vocab_size):
    return [str(uuid.uuid4()) for i in range(vocab_size)]

def run(vocab_size):
    vocabulary = create_vocab(vocab_size)
    model = tf.keras.Sequential([
                        tf.keras.layers.StringLookup(
                            vocabulary=vocabulary,
                            mask_token=None),
                        tf.keras.layers.Embedding(
                            vocab_size + 1,
                            24)
                    ])

    t1 = time.time()
    model.predict([vocabulary[4]])
    t2 = time.time()
    inference_time = t2-t1
    print(f"Vocab size: {vocab_size} / Inference time: {inference_time}")

if __name__ == '__main__':
    for vocab_size in [1000, 10000, 100000, 1000000, 10000000]:
        run(vocab_size)

The results I get are: Vocab_Size Inference time 1000 0.041 10000 0.106 100000 0.718 1000000 7.351 10000000 82.48

I can only assume that this is expected behaviour and I am using the wrong pattern. However, my question then is, how might I build a performant online model that uses high volumes of embeddings? The use case I have here is users in a recommender system where I would like to perform realtime inference on a model containing at least 10m user embeddings in <200ms. There must be many online models that achieve this. Any advice much appreciated.

Original Q&A

There are 1 answers

**Felix Mercer Moss** · Answer 1 · 2023-10-04T08:34:15+00:00

UPDATE, SOLVED: It was solved by removing the string lookup from the model as it appears this component is particularly inefficient. The alternative approach is to do a string-to-int user mapping and use this as input instead. With this approach, the 10m vocab variant returns in 200ms.

TechQA.

Impractical inference time on Tensorflow model with large embedding vocabulary

There are 1 answers

Related Questions in TENSORFLOW

Related Questions in KERAS

Related Questions in RECOMMENDATION-ENGINE

Related Questions in TENSORFLOW-SERVING

Related Questions in LATENCY

Popular Questions

Trending Questions