Advanced Academy Reader

Enhancing Model Honesty via Auxiliary Tasks

Exploring techniques to detect and mitigate reward hacking in LLMs by training models to produce separate 'confession' outputs.

advanced•5 / 6

Implementation Challenges

Oversight: Who judges the honesty of the confession? Currently, this often requires high-quality human oversight or "gold standard" datasets.
Generalization: Does the tendency to confess generalize to out-of-distribution examples, or can the model learn to lie in its confessions too?

Section 5 of 6•