Self-Supervised Preference Optimization

New approaches leverage Dual Learning principles to bypass the need for external labels. The core idea is to treat the generation and discrimination tasks as dual problems that can regularize each other.

How It Works#

1.  **Forward Path (Generation)**: The model generates a response to a prompt.
2.  **Backward Path (Reconstruction/Verification)**: The system attempts to reconstruct the prompt or verify the consistency of the response using a dual model or a separate head.
3.  **Consistency Check**: The discrepancy between the forward and backward paths serves as a self-supervised signal. If the model generates a response that is inconsistent with the prompt's intent (as measured by the dual task), it receives a negative signal.

Preference Optimization via Dual Learning (PODL)#

In frameworks like PODL, the model is trained to maximize the likelihood of its own high-confidence outputs while minimizing the likelihood of low-confidence or inconsistent ones. This effectively creates a "self-rewarding" loop where the model learns to prefer outputs that are robust and consistent.

Self-Supervised Preference Optimization

Dual Learning and Self-Supervision

How It Works#

Preference Optimization via Dual Learning (PODL)#