Ho to Extract DistilBERT Embeddings from a list containg 5000 records..?

49 views Asked by At

After tokenizing the dataset then we try to extract DistiBert Embeddings on our dataset(contains 5000 text records in dataframe) memory error occurred for the following code:

outputs = model(**tokenized_inputs0)
bert_embeddings = outputs.last_hidden_state

So, we divide the dataframe into list using following code :

list_train = [final_Data1[i:i+100] for i in range(0,final_Data1.shape[0],100)] 

Now how to extract the DistilBERT Embeddings on the above list_train...????

How to apply following code for extracting DistilBERT Embeddings on the list...?

outputs = model(**tokenized_inputs0) bert_embeddings = outputs.last_hidden_state

1

There are 1 answers

1
Jesse Sealand On

You could use a for loop to iterate over the list of data you created. You will have to do something with your predictions to get them out of memory, like saving them to a file, otherwise you'll still run out of memory.

batch_list = [final_Data1[i:i+100] for i in range(0,final_Data1.shape[0],100)] 

for batch in batch_list:

    outputs = model(**batch)
    bert_embeddings = outputs.last_hidden_state

    # do something with your outputs