What differentiates Direct Preference Optimization (DPO) from supervised fine-tuning (SFT)

24 views Asked by At

Assume that I want to build a binary classifier using LLM, which takes an input document x and outputs a label y, where y_w is the correct answer, and y_l is the incorrect answer.

Intuitively, I want to maximize p(y_w|x) and minimize p(y_l|x). So what difference does it make if we simply do a SFT using the cross-entropy loss as apposed to using DPO?

Cross entropy loss:

enter image description here

The loss function in the DPO paper:

enter image description here

In this particular scenario of using LLM as a classifier, can I say that SFT and DPO are equivalent?

I can see that the loss functions are specified differently, but what does the difference mean from a mathematical/computational perspective? In other words, what is the contribution of the DPO method when we already have SFT? Thanks in advance.

0

There are 0 answers