A framework for improving model alignment without expensive manual annotations using dual learning and self-supervised feedback.
The loss function in self-supervised preference optimization often combines a standard language modeling loss with a consistency loss.
L_total = L_SFT + lambda * L_consistency
Where L_consistency measures the alignment between the generated response and the dual objective (e.g., prompt reconstruction accuracy).
Unlike PPO, which requires a separate frozen reward model and value network, self-supervised methods can often be implemented as a single-stage training process (similar to Direct Preference Optimization or DPO), but without the labeled preference pairs.