Back to library

🎯Understand DPO and Why It Replaced PPO for Alignment

Trace DPO from the Bradley-Terry preference equation to the closed-form policy and the log-prob loss so it stops feeling like 'just another trainer' and starts feeling inevitable. By the end, you'll predict on three preference pairs which way DPO will push chosen vs rejected log-probs — then check against a real training run.

Applied14 drops~2-week path · 5–8 min/daytechnology

Phase 1Why PPO became a four-model bottleneck and DPO let it go

See why PPO needed four models and DPO needs two

4 drops
  1. PPO for alignment is a four-model accounting problem

    7 min

    PPO for alignment is a four-model accounting problem

  2. Bradley-Terry is the equation underneath every preference dataset

    7 min

    Bradley-Terry is the equation underneath every preference dataset

  3. The KL constraint is what keeps fine-tuning from eating its own tail

    7 min

    The KL constraint is what keeps fine-tuning from eating its own tail

  4. DPO is what happens when you compose two old ideas

    8 min

    DPO is what happens when you compose two old ideas

Phase 2Derive DPO from Bradley-Terry by hand

Derive the closed-form policy from Bradley-Terry yourself

5 drops
  1. Solve for the reward and the partition function falls out

    7 min

    Solve for the reward and the partition function falls out

  2. Plug into Bradley-Terry and the partition vanishes

    7 min

    Plug into Bradley-Terry and the partition vanishes

  3. The DPO loss is one negative log sigmoid in disguise

    7 min

    The DPO loss is one negative log sigmoid in disguise

  4. The gradient tells you exactly what DPO is doing to chosen vs rejected

    8 min

    The gradient tells you exactly what DPO is doing to chosen vs rejected

  5. A minimal DPO training step fits on one screen

    8 min

    A minimal DPO training step fits on one screen

Phase 3Where DPO breaks in real training runs

Diagnose distribution shift, length bias, and beta tuning

4 drops
  1. Distribution shift from the reference degrades DPO silently

    7 min

    Distribution shift from the reference degrades DPO silently

  2. DPO will learn to be verbose because longer is easier to win

    7 min

    DPO will learn to be verbose because longer is easier to win

  3. Beta is the algorithm — not a knob to leave at default

    8 min

    Beta is the algorithm — not a knob to leave at default

  4. Pick the variant that fixes your specific bug, not the trendiest one

    7 min

    Pick the variant that fixes your specific bug, not the trendiest one

Phase 4Predict three preference pairs and verify

Predict three preference pairs and verify against a training run

1 drop
  1. Predict the gradient direction on three preference pairs

    8 min

    Predict the gradient direction on three preference pairs

Frequently asked questions

What is the actual difference between DPO and PPO for alignment?
This is covered in the “Understand DPO and Why It Replaced PPO for Alignment” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
Why does DPO not need an explicit reward model?
This is covered in the “Understand DPO and Why It Replaced PPO for Alignment” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
How does the Bradley-Terry model lead to the DPO loss?
This is covered in the “Understand DPO and Why It Replaced PPO for Alignment” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
What does the beta hyperparameter actually control in DPO?
This is covered in the “Understand DPO and Why It Replaced PPO for Alignment” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.
When does DPO fail or underperform PPO in practice?
This is covered in the “Understand DPO and Why It Replaced PPO for Alignment” learning path. Start with daily 5-minute micro-lessons that build from fundamentals to hands-on application.