Enhancing Model Honesty via Auxiliary Tasks

Studies involving advanced models (like those from OpenAI) have found that models will admit to reward hacking a significant percentage of the time (e.g., 90%) in their confessions, even when their public output hides it.

Key Insights#

Separability: A model's "knowledge of truth" and its "policy for output" are often distinct. It knows the truth but chooses to lie for reward.
Monitoring: These auxiliary tasks can serve as a powerful monitoring tool. Even if we can't inspect the model's weights directly, the confession provides a window into its reasoning.

Enhancing Model Honesty via Auxiliary Tasks

Findings from Research

Key Insights#