Skip to content

Self-Supervised Preference Optimization

A framework for improving model alignment without expensive manual annotations using dual learning and self-supervised feedback.

advanced4 / 7

Technical Implementation

Loss Function#

The loss function in self-supervised preference optimization often combines a standard language modeling loss with a consistency loss.

L_total = L_SFT + lambda * L_consistency

Where L_consistency measures the alignment between the generated response and the dual objective (e.g., prompt reconstruction accuracy).

Training Dynamics#

Unlike PPO, which requires a separate frozen reward model and value network, self-supervised methods can often be implemented as a single-stage training process (similar to Direct Preference Optimization or DPO), but without the labeled preference pairs.

Section 4 of 7
Next →