Exploring the shift from static dataset evaluation to dynamic, agentic benchmarking in the era of capable AI models.
To evaluate true capability, we are moving towards Dynamic Evaluation.
Instead of using a fixed dataset, these benchmarks pull problems from recent LeetCode contests or GitHub issues published after the model's training cutoff.
Model developers (like OpenAI and Anthropic) maintain secret datasets that are never released publicly to prevent them from inadvertently entering the training corpus of future models.
Instead of "Question -> Answer," we evaluate "Goal -> Execution."