Self-Supervised Preference Optimization

Loss Function#

The loss function in self-supervised preference optimization often combines a standard language modeling loss with a consistency loss.

L_total = L_SFT + lambda * L_consistency

Where L_consistency measures the alignment between the generated response and the dual objective (e.g., prompt reconstruction accuracy).

Training Dynamics#

Unlike PPO, which requires a separate frozen reward model and value network, self-supervised methods can often be implemented as a single-stage training process (similar to Direct Preference Optimization or DPO), but without the labeled preference pairs.

Self-Supervised Preference Optimization

Technical Implementation

Loss Function#

Training Dynamics#