Accuracy reporting
Outcome signal coverage
300/300 sessions have score evidence
reward coverage
56.7%
3,880 total score records
Accuracy reporting
300/300 sessions have score evidence
reward coverage
56.7%
3,880 total score records
LLM judge
Selected score source
Analyzed sample
Accuracy opportunities
55 of 116 runs fail at full cost. Adding early exit checks before heavy tool calls could cut waste fast.
NextAdd a preflight prompt step to validate inputs before any tool calls run.
47% of runs fail or partially pass. Failed runs average 19.8 tool calls and 934 input tokens versus 9.9 calls and 565 tokens for successful runs.
NextSet a tool call limit per run to stop runaway loops early.
The agent retried a broken tool call 193 times across 2 threads without changing its approach, burning context and making no progress.
NextAdd a circuit breaker that stops retrying after 3 identical errors.
Users are fixing incomplete or incorrectly shaped responses from your assistant. This suggests the assistant isn't following its output rules consistently. A validator can catch these problems before users see them.
NextDefine exact output shape: required fields, data types, and format rules