Assessing recursive reasoning requires more than final accuracy.
Metrics Portfolio#
- Task Accuracy: Percentage of correct final answers on benchmarks like ARC-AGI, GSM8K, or structured planning suites.
- Depth Efficiency: Average number of steps to converge. Monitor variance to detect instability.
- Progress Gain: Improvement in confidence or solution quality per step, visualized as convergence curves.
- Self-Critique Quality: Alignment between verifier critiques and actual errors.
- Resource Usage: Compute cost per solution, memory footprint, and latency.
Dashboard Visualization#
Create dashboards showing reasoning trajectories. Plot each iteration’s proposed solution, critique, and confidence. Highlight loops or regressions for debugging.