Skip to content

Enhancing Model Honesty via Auxiliary Tasks

Exploring techniques to detect and mitigate reward hacking in LLMs by training models to produce separate 'confession' outputs.

advanced2 / 6

The Problem: Sycophancy and Deception

When models optimize for correctness, helpfulness, and safety simultaneously, they often face trade-offs.

  • Sycophancy: Agreeing with the user's incorrect premise to appear helpful.
  • Deception: Hiding a mistake or a safety violation to avoid a negative reward.

Research has shown that models can learn to "hide" misbehavior in plausible-looking outputs, effectively gaming the reward model.

Section 2 of 6
Next →