I have a large csv file (35m rows) in the following format:
id, sentence, description
Normally in inference mode, Id like to use model like so:
for iter_through_csv:
model = SentenceTransformer('flax-sentence-embeddings/some_model_here', device=gpu_id)
encs = model.encode(row[1], normalize_embeddings=True)
But since I have GPUs Id like to batch it. However, the size is large (35m), so I do not want to read in memory and batch.
I am struggling to find a template to batch csv on huggingface. What is the most optimal way to do this?
You should convert the csv to a huggingface dataset. This allows you to process a large dataset without loading the full thing into memory. You can map your embedding function along the dataset to compute embeddings without having to keep all of them in memory.