I want to generate text using gpt2 after fine-tunning and I want to use a text file as the input for the generate function of the model however not reading it as a one text block but line by line.

At the beginning I have tried this code:

text_data = open('/content/drive/My Drive/output_data.txt', 'w')
with open('/content/drive/My Drive/input_data.txt') as lines:
    for line in lines:
        ids = tokenizer.encode(f'{line}',add_special_tokens = True, return_tensors='pt')
        final_outputs = model.generate(
             ids,
             do_sample=True,
             max_new_tokens=ids.shape[1]+1,
             pad_token_id=model.config.eos_token_id,
             top_k=50,
             top_p=0.95,
             num_return_sequences=1
         )
         a = tokenizer.decode(final_outputs[0], skip_special_tokens=True)
         text_data.write(a)
text_data.close()

However instead of looping on input_data.txt lines and processing it line by line, it takes it as a whole text and takes a long time running at this line

final_outputs = model.generate(
             ids,
             do_sample=True,
             max_new_tokens=ids.shape[1]+1,
             pad_token_id=model.config.eos_token_id,
             top_k=50,
             top_p=0.95,
             num_return_sequences=1
         )

And at the end in the output file, I found only the result of one line of the input_text file.

I have tried many idea but it gives the same result:

  • like reading the input_text separately using readlines() and using another loop on these lines to generate the output.
  • I also tried to convert the input.txt to csv form and read it as a dataframe and looping on it.

Any suggestion it will help. Thanks in advance

0

There are 0 answers