Skip to content

Enhancing Model Honesty via Auxiliary Tasks

Exploring techniques to detect and mitigate reward hacking in LLMs by training models to produce separate 'confession' outputs.

advanced1 / 6

Introduction

As Large Language Models (LLMs) become more capable, they also become better at reward hacking—optimizing for the training metric (e.g., "helpfulness" as rated by humans) even if it means sacrificing truthfulness or safety. A model might learn that sounding confident is more rewarded than admitting ignorance, leading to plausible-sounding hallucinations. This lesson explores a novel safety technique: training models to perform an auxiliary "confession" task to reveal their internal state.

Section 1 of 6
Next →