Home Reference-free Monolitic Odds Ratio Preference Optimization (ORPO)
Post
Cancel

Reference-free Monolitic Odds Ratio Preference Optimization (ORPO)

Our paper, ORPO: Monolithic Preference Optimization without Reference Model with Noah Lee, is uploaded to Arxiv! Our best models, 🤗 Mistral-ORPO-$\alpha$ (7B) and 🤗 Mistral-ORPO-$\beta$ (7B), surpass or are on par with the state-of-the-art instruction-following large language models (in AlpacaEval and MT-Bench), including Zephyr $\beta$ (7B), TULU-2-DPO (13B), and Llama-2-Chat (13B), by fine-tuning pre-trained language models with ORPO on single-turn conversation dataset only, without two separate supervised fine-tuning & alignment phases. In detail, 🤗 Mistral-ORPO-$\alpha$ (7B) was trained on the UltraFeedback, and 🤗 Mistral-ORPO-$\beta$ (7B) was trained on the cleaned version of UltraFeedback by Argilla. Find our checkpoints by clicking the model name next to 🤗 in the table. Also, check the detailed evaluation results on AlpacaEval, MT-Bench, IFEval, and Open-LLM-Leaderboard!

Model NameSizeAlignTrain Set
(Single-turn)
Train Set
(Multi-turn)
MT-BenchAlpacaEval 1.0AlpacaEval 2.0
🤗 Mistral-ORPO-$\alpha$7BORPO7.2387.9211.33
🤗 Mistral-ORPO-$\beta$7BORPO7.3291.4112.20
🤗 Zephyr ($\beta$)7BDPO7.3490.6010.99
🤗 TULU-2-DPO13BDPO7.0089.510.12
🤗 Llama-2-Chat7BRLHF6.2771.374.96
🤗 Llama-2-Chat13BRLHF6.6581.097.70

 

Methods

In Section 3 of our paper, we study the role of SFT in the context of preference alignment and show that the negative log-likelihood (NLL) loss in SFT simultaneously encourages the log probabilities of both the chosen and rejected responses. Meanwhile, inspired by the intuitions behind DPO and Unlikelihood training, we designed a modified NLL loss that leads to stronger adaptation to the chosen responses and triggers a minor penalty to the rejected responses, thereby questioning the necessity of a separated alignment scheme.

Description of the image
Figure 1. General diagram comparing crucial components in RLHF, DPO, and ORPO. ORPO effectively handles preference learning during SFT by penalizing the rejected response tokens and leading to stronger adaptation to the chosen response tokens.

From this background, we propose ORPO, which penalizes the model for assigning high probabilities to the rejected responses on average in the first place. Specifically, $L_{SFT}$ follows conventional NLL loss in fine-tuning language models with next token prediction and $L_{OR}$ is a log odds ratio of the chosen responses’ logits over the rejected ones.

\[\mathcal{L}_{ORPO} = -\mathbb{E}_{(x, y_w, y_l)}\left[ \mathcal{L}_{SFT} + \lambda \cdot \mathcal{L}_{OR} \right]\] \[\mathcal{L}_{OR} = \log \sigma \left( \log \frac{\textbf{odds}_\theta(y_w|x)}{\textbf{odds}_\theta(y_l|x)} \right)\]

 

Why Odds Ratio?

While most preference alignment algorithms, including DPO, are designed with the log probability ratio, we claim that the properties of the odds ratio are more adequate in the monolithic situation. In detail, $1-p$ as the denominator of the odds amplifies the odds when the assigned probability $p$ gets more extensive, resulting in the odds ratio between two probabilities, $p_1$ and $p_2$, being more prominent in odds ratio than that of the probability ratio.

\[\textbf{odds}_\theta(y|x) = \frac{P_\theta(y|x)}{1 - P_\theta(y|x)}\]

This is a desired property in contrasting two logits in monolithic setting, since there is no reference model which can keep the center point. If the margin of logits gets too large, it leads to a degeneration issue, as shown in the ablation study from our paper. To see how the margin between the chosen and the rejected responses change during training, please check the Wandb reports: Mistral-ORPO-$\alpha$ (7B) or Mistral-ORPO-$\beta$ (7B).

 

AlpacaEval

We report the single-turn instruction-following skills of two models through AlpacaEval. In AlpacaEval 1.0, 🤗 Mistral-ORPO-$\alpha$ (7B) and 🤗 Mistral-ORPO-$\beta$ (7B) scores 87.92$\%$ and 91.41$\%$, exceeding Llama-2-Chat models with the size of 7B and 13B and Zephyr $\alpha$ and Zephyr $\beta$. Furthermore, in AlpacaEval 2.0, 🤗 Mistral-ORPO-$\alpha$ (7B) and 🤗 Mistral-ORPO-$\beta$ (7B) scores 11.33$\%$ and 12.20$\%$. We explicitly compare the score against Llama-2-Chat and Zephyr models, the checkpoints trained with RLHF and DPO, respectively. It is noteworthy that while those models were trained with significantly more data (e.g., Zephyr was trained on 270k conversations in total and 🤗 Mistral-ORPO series were trained on 61k conversations), ORPO surpasses corresponding checkpoints.

Description of the image
Figure 2. AlpacaEval\(\text{}_{2.0}\) score of Llama-2-\(\texttt{ORPO}\) (7B), 🤗 Mistral-\(\texttt{ORPO}\)-\(\alpha\) (7B), and 🤗 Mistral-\(\texttt{ORPO}\)-\(\beta\) (7B) compared to RLHF and DPO models. They surpass RLHF and DPO-tuned models, respectively.

 

MT-Bench

We report the comprehensive instruction-following skills in the single-turn and multi-turn conversation through MT-Bench. Our best models, 🤗 Mistral-ORPO-$\alpha$ (7B) and 🤗 Mistral-ORPO-$\beta$ (7B), achieved 7.23 and 7.32 in MT-Bench, even though they were trained on the 61k instances of single-turn conversation dataset (UltraFeedback) alone. Moreover, we implemented MT-Bench with Gemini-Pro as an evaluator, and 🤗 ORPO-Mistral (7B) outperforms GPT-3.5-turbo (7.23), Claude-v1 (7.36), and Zephyr $\beta$ (7.40) by getting 7.60.

 

IFEval

Finally, we assess the strict instruction-following skills of our models with IFEval. The scores are measured using EleutherAI/lm-evaluation-harness by applying the chat template. The scores for Llama-2-Chat (70B), Zephyr-β (7B), Mixtral-8X7B-Instruct-v0.1, GPT-3.5-Turbo, and GPT-4 are originally reported in this tweet. The highest scores for non-proprietary models are bolded.

Model TypePrompt-StrictPrompt-LooseInst-StrictInst-Loose
Llama-2-Chat (70B)0.44360.53420.54680.6319
Zephyr-β (7B)0.42330.45470.54920.5767
Mixtral-8X7B-Instruct-v0.10.52130.57120.63430.6823
Mistral-ORPO-⍺ (7B)0.50090.50830.59950.6163
Mistral-ORPO-β (7B)0.52870.55640.63550.6619
GPT-3.5-Turbo0.57670.64140.68350.7338
GPT-40.75420.81150.82610.8681
This post is licensed under Attribution 4.0 International(CC BY 4.0) by Jiwoo Hong