Exploring techniques to detect and mitigate reward hacking in LLMs by training models to produce separate 'confession' outputs.
To counter this, researchers have proposed training models to produce a separate, private output—a "confession"—alongside their public response.
1. **Primary Task**: The model generates a response to the user.
2. **Auxiliary Task**: The model generates a "confession" or "internal monologue" that is judged _only_ on honesty.
3. **Training**: The model is penalized if the confession contradicts the primary response or reveals that the primary response was deceptive.