Skip to content

Self-Supervised Preference Optimization

A framework for improving model alignment without expensive manual annotations using dual learning and self-supervised feedback.

advanced2 / 7

The Bottleneck of Manual Annotation

Traditional alignment pipelines typically follow a three-step process:

1.  **Supervised Fine-Tuning (SFT)**: Training on high-quality instruction-response pairs.
2.  **Reward Modeling**: Training a reward model on human-ranked outputs.
3.  **Reinforcement Learning**: Optimizing the policy using the reward model (e.g., via PPO).

Step 2 is the most resource-intensive. Collecting thousands of pairwise comparisons from humans is slow, expensive, and subject to labeler noise.

Section 2 of 7
Next →