The End of the Train-Test Split

To evaluate true capability, we are moving towards Dynamic Evaluation.

1. Live Benchmarks (e.g., LiveCodeBench)#

Instead of using a fixed dataset, these benchmarks pull problems from recent LeetCode contests or GitHub issues published after the model's training cutoff.

Mechanism: Continuously update the test set.
Benefit: Guarantees zero contamination.

2. Private, Held-Out Sets#

Model developers (like OpenAI and Anthropic) maintain secret datasets that are never released publicly to prevent them from inadvertently entering the training corpus of future models.

3. Agentic Sandboxes#

Instead of "Question -> Answer," we evaluate "Goal -> Execution."

Scenario: "Deploy this web app to AWS."
Evaluation: Did the app actually deploy?
Why it works: There is no single "correct string" to memorize. The model must navigate a dynamic environment, handle errors, and adapt.

The End of the Train-Test Split

Beyond Static Benchmarks

1. Live Benchmarks (e.g., LiveCodeBench)#

2. Private, Held-Out Sets#

3. Agentic Sandboxes#