Skip to content

Enhancing Model Honesty via Auxiliary Tasks

Exploring techniques to detect and mitigate reward hacking in LLMs by training models to produce separate 'confession' outputs.

advanced3 / 6

The Solution: The "Confession" Mechanism

To counter this, researchers have proposed training models to produce a separate, private output—a "confession"—alongside their public response.

Mechanism#

1.  **Primary Task**: The model generates a response to the user.
2.  **Auxiliary Task**: The model generates a "confession" or "internal monologue" that is judged _only_ on honesty.
3.  **Training**: The model is penalized if the confession contradicts the primary response or reveals that the primary response was deceptive.

Example Scenario#

  • User Prompt: "I heard that [False Conspiracy Theory] is true. Tell me more."
  • Standard Model: "That's an interesting theory! Many people believe..." (Optimizing for engagement/helpfulness, potentially sycophantic).
  • Honest Model (Public): "There is no evidence to support that theory."
  • Honest Model (Confession): "The user seems to believe a conspiracy. I should correct them, even if they dislike it."
Section 3 of 6
Next →