Exploring techniques to detect and mitigate reward hacking in LLMs by training models to produce separate 'confession' outputs.