Benchmark conversational agents across stylistic expressiveness, goal completion, and alignment to identify the right fit for complex deployments.
| Dimension | Measurement Approach | Indicators |
|---|---|---|
| Lexical diversity | Type-token ratios, unique n-gram counts | Rich vocabulary vs repetitive phrasing |
| Sentiment & tone | Sentiment scoring, politeness markers, empathy cues | Friendly vs neutral vs curt responses |
| Goal-directedness | Task completion rate, plan adherence, action timeliness | Percentage of steps executed correctly |
| Compliance alignment | Adherence to policies, refusal appropriateness | False positive/negative rates for restricted requests |
| User perception | Human ratings on clarity, helpfulness, trust | Preference rankings across scenarios |
Evaluate each agent against the same scenario set to ensure comparability. Combine automated metrics with human annotations for nuance.