Exploring the shift from static dataset evaluation to dynamic, agentic benchmarking in the era of capable AI models.
The train-test split served us well for a decade, but for general-purpose intelligence, it is insufficient. As engineers, we must move from "testing on test.csv" to "evaluating in the wild."