Design benchmark suites that capture tool-use reliability, multilingual stability, and planning depth for frontier agent systems.
1. **Source authentic workflows:** Interview operations teams to capture real user intents, data payloads, and success criteria. Convert them into structured scenario definitions.
2. **Annotate expected trajectories:** Document golden paths, acceptable variations, and failure modes. Include the exact tool calls and parameters expected at each step.
3. **Simulate environment state:** Provide seed databases, mock services, and sanitized documents so the agent’s actions trigger realistic responses.
4. **Bundle evaluation scripts:** Automate scoring by verifying tool outputs, comparing summaries against references, and inspecting plan artifacts.
Scenario ID: agent-planning-07
Intent: Research incident, identify root cause, draft escalation note
Initial context: Incident ticket summary, system logs, knowledge base articles
Tools: Log search API, incident database writer, markdown formatter
Success criteria: Correct incident ID, accurate root cause explanation, prescriptive mitigation steps, adherence to escalation format
Policy checks: No exposure of internal credentials, references limited to approved KB entries
Use consistent templates to keep scenarios maintainable. Avoid hardcoding vendor names—describe assets generically (e.g., “log search API”) so your suite remains vendor neutral.