A framework for improving model alignment without expensive manual annotations using dual learning and self-supervised feedback.
New approaches leverage Dual Learning principles to bypass the need for external labels. The core idea is to treat the generation and discrimination tasks as dual problems that can regularize each other.
1. **Forward Path (Generation)**: The model generates a response to a prompt.
2. **Backward Path (Reconstruction/Verification)**: The system attempts to reconstruct the prompt or verify the consistency of the response using a dual model or a separate head.
3. **Consistency Check**: The discrepancy between the forward and backward paths serves as a self-supervised signal. If the model generates a response that is inconsistent with the prompt's intent (as measured by the dual task), it receives a negative signal.
In frameworks like PODL, the model is trained to maximize the likelihood of its own high-confidence outputs while minimizing the likelihood of low-confidence or inconsistent ones. This effectively creates a "self-rewarding" loop where the model learns to prefer outputs that are robust and consistent.