Skip to content

Self-Supervised Preference Optimization

A framework for improving model alignment without expensive manual annotations using dual learning and self-supervised feedback.

advanced6 / 7

Case Study: Reducing Annotation Costs

Recent experiments have shown that self-supervised methods can achieve performance parity with models trained on thousands of human-labeled samples. For instance, a model trained using self-generated feedback on a coding dataset improved its pass@1 rate on HumanEval by significant margins without seeing a single human-ranked pair during the alignment phase.

Section 6 of 7
Next →