huggingface embedding large csv in batches

Question

huggingface embedding large csv in batches

546 views Asked by AJW At 30 October 2023 at 10:47

I have a large csv file (35m rows) in the following format:

id, sentence, description

Normally in inference mode, Id like to use model like so:

for iter_through_csv:
    model = SentenceTransformer('flax-sentence-embeddings/some_model_here', device=gpu_id)
    encs = model.encode(row[1], normalize_embeddings=True)

But since I have GPUs Id like to batch it. However, the size is large (35m), so I do not want to read in memory and batch.

I am struggling to find a template to batch csv on huggingface. What is the most optimal way to do this?

Original Q&A

There are 2 answers

**Karl** · Answer 1 · 2023-10-31T02:45:41+00:00

Karl On 31 October 2023 at 02:45

You should convert the csv to a huggingface dataset. This allows you to process a large dataset without loading the full thing into memory. You can map your embedding function along the dataset to compute embeddings without having to keep all of them in memory.

**Anna Andreeva Rogotulka** · Answer 2 · 2023-10-30T15:22:54+00:00

Anna Andreeva Rogotulka On 30 October 2023 at 15:22

I recommend using buffer reading from file, for example, via pandas, you just skip rows and read another batch_size rows

pd.read_csv(path,
            skiprows=index * batch_size + 1,
            chunksize=batch_size,
            names=['data'])

TechQA.

huggingface embedding large csv in batches

There are 2 answers

Related Questions in PYTHON

Related Questions in CSV

Related Questions in SENTENCE-TRANSFORMERS

Popular Questions

Popular Tags

Trending Questions