A framework for improving model alignment without expensive manual annotations using dual learning and self-supervised feedback.
Traditional alignment pipelines typically follow a three-step process:
1. **Supervised Fine-Tuning (SFT)**: Training on high-quality instruction-response pairs.
2. **Reward Modeling**: Training a reward model on human-ranked outputs.
3. **Reinforcement Learning**: Optimizing the policy using the reward model (e.g., via PPO).
Step 2 is the most resource-intensive. Collecting thousands of pairwise comparisons from humans is slow, expensive, and subject to labeler noise.