huggingface embedding large csv in batches

528 views Asked by At

I have a large csv file (35m rows) in the following format:

id, sentence, description

Normally in inference mode, Id like to use model like so:

for iter_through_csv:
    model = SentenceTransformer('flax-sentence-embeddings/some_model_here', device=gpu_id)
    encs = model.encode(row[1], normalize_embeddings=True)

But since I have GPUs Id like to batch it. However, the size is large (35m), so I do not want to read in memory and batch.

I am struggling to find a template to batch csv on huggingface. What is the most optimal way to do this?

2

There are 2 answers

0
Karl On

You should convert the csv to a huggingface dataset. This allows you to process a large dataset without loading the full thing into memory. You can map your embedding function along the dataset to compute embeddings without having to keep all of them in memory.

0
Anna Andreeva Rogotulka On

I recommend using buffer reading from file, for example, via pandas, you just skip rows and read another batch_size rows

pd.read_csv(path,
            skiprows=index * batch_size + 1,
            chunksize=batch_size,
            names=['data'])