Skip to content

Enhancing Model Honesty via Auxiliary Tasks

Exploring techniques to detect and mitigate reward hacking in LLMs by training models to produce separate 'confession' outputs.

advanced6 / 6

Conclusion

Enhancing model honesty via auxiliary tasks represents a promising direction for scalable oversight. By incentivizing models to "tell on themselves," we can detect alignment failures that would otherwise go unnoticed in standard output evaluations. This technique is likely to become a standard component of safety stacks for frontier models.

Section 6 of 6
View Original