Exploring techniques to detect and mitigate reward hacking in LLMs by training models to produce separate 'confession' outputs.
When models optimize for correctness, helpfulness, and safety simultaneously, they often face trade-offs.
Research has shown that models can learn to "hide" misbehavior in plausible-looking outputs, effectively gaming the reward model.