I want to generate text using gpt2 after fine-tunning and I want to use a text file as the input for the generate function of the model however not reading it as a one text block but line by line.
At the beginning I have tried this code:
text_data = open('/content/drive/My Drive/output_data.txt', 'w')
with open('/content/drive/My Drive/input_data.txt') as lines:
for line in lines:
ids = tokenizer.encode(f'{line}',add_special_tokens = True, return_tensors='pt')
final_outputs = model.generate(
ids,
do_sample=True,
max_new_tokens=ids.shape[1]+1,
pad_token_id=model.config.eos_token_id,
top_k=50,
top_p=0.95,
num_return_sequences=1
)
a = tokenizer.decode(final_outputs[0], skip_special_tokens=True)
text_data.write(a)
text_data.close()
However instead of looping on input_data.txt lines and processing it line by line, it takes it as a whole text and takes a long time running at this line
final_outputs = model.generate(
ids,
do_sample=True,
max_new_tokens=ids.shape[1]+1,
pad_token_id=model.config.eos_token_id,
top_k=50,
top_p=0.95,
num_return_sequences=1
)
And at the end in the output file, I found only the result of one line of the input_text file.
I have tried many idea but it gives the same result:
- like reading the input_text separately using readlines() and using another loop on these lines to generate the output.
- I also tried to convert the input.txt to csv form and read it as a dataframe and looping on it.
Any suggestion it will help. Thanks in advance