Determining the probability of a sequence generated by T5 model by HuggingFace

1.4k views Asked by At

I am using T5-Large by HuggingFace for inference. Given a premise and a hypothesis, I need to determine whether they are related or not. So, if I feed a string "mnli premise: This game will NOT open unless you agree to them sharing your information to advertisers. hypothesis: Personal data disclosure is discussed." the model is supposed to return either entailment, neutral, or contradiction.

Though I am able to determine the result, I am unable to determine the probability of the sequence generated. For instance, consider the model will generate entailment for the example given above. I also want to know what is the probability of entailment. So far, I have been using the following code,

from transformers import T5Tokenizer, T5ForConditionalGeneration

def is_entailment(premise, hypothesis):
    entailment_premise = premise
    entailment_hypothesis = hypothesis

    token_output = tokenizer("mnli premise: " + entailment_premise + " hypothesis: " + entailment_hypothesis,
                             return_tensors="pt", return_length=True)
    input_ids = token_output.input_ids

    output = model.generate(input_ids, output_scores=True, return_dict_in_generate=True, max_new_tokens=15)
    entailment_ids = output["sequences"]

    entailment = tokenizer.decode(entailment_ids[0], skip_special_tokens=True)
    return entailment


tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)


premise = "This game will NOT open unless you agree to them sharing your information to advertisers."
hypothesis = "Personal data disclosure is discussed."

print(is_entailment(premise, hypothesis))

I have tried using the scores we get as output, but not sure how to calculate the probability from them. Same goes for the last hidden states that can be fetched as the output from the generate(). I saw in another question on Stack Overflow that suggested using a softmax function on the last hidden states but I am unsure how to do it.

How can I calculate the probability of the sequence being generated? That is, if I get entailment for a pair of hypothesis and premise, what would be the P(entailment)?

1

There are 1 answers

1
Jindřich On

What you get as the scores are output token distributions before the softmax, so-called logits. You can get the probabilities of generated tokens by normalizing the logits and taking respective token ids. You can get them from the field sequences from what the generate method returns.

These are, however, not the probabilities you are looking for because T5 segments your output words into smaller units (e.g., "entailment" gets segmented to ['▁', 'en', 'tail', 'ment'] using the t5-small tokenizer). This is even trickier because different answers get split into a different number of tokens. You can get an approximate score by averaging the token probabilities (this is typically used during beam search). Such scores do not sum up to one.

If you want a normalized score, the only way is to feed all three possible answers to the decoder, get their scores, and normalize them to sum to one.