Design benchmark suites that capture tool-use reliability, multilingual stability, and planning depth for frontier agent systems.