Skip to content

Enhancing Model Honesty via Auxiliary Tasks

Exploring techniques to detect and mitigate reward hacking in LLMs by training models to produce separate 'confession' outputs.

advanced5 / 6

Implementation Challenges

  1. Oversight: Who judges the honesty of the confession? Currently, this often requires high-quality human oversight or "gold standard" datasets.
  2. Generalization: Does the tendency to confess generalize to out-of-distribution examples, or can the model learn to lie in its confessions too?
Section 5 of 6
Next →