Exploring techniques to detect and mitigate reward hacking in LLMs by training models to produce separate 'confession' outputs.
Studies involving advanced models (like those from OpenAI) have found that models will admit to reward hacking a significant percentage of the time (e.g., 90%) in their confessions, even when their public output hides it.