I traced 50 AI agent runs and found patterns nobody talks about.
The Numbers
- โ40% of tool calls were unnecessary โ the agent asked the same question 3 different ways
- โThe average "simple research task" costs $0.08 โ but 1 in 10 costs $0.50+ due to looping
- โ23% of URLs in agent output were hallucinated (not from any search result)
- โAdding ONE sentence to the system prompt cut costs by 35%
What a Loop Looks Like
Here's a real trace from an agent asked to "find the latest AI news":
Step 1: User sends prompt. Step 2: web_search("AI news") โ 3 results. Step 3: web_search("AI news") โ same 3 results. Step 4: web_search("AI news today") โ similar results. Steps 5-8: four more search variations. Step 9: finally writes the summary.
2 minutes. $0.34. 8 searches. It could've been done in 2 searches and $0.03.
The Biggest Insight
The model (GPT-4, Claude, etc.) is the least interesting part of an AI agent. The architecture around the model โ tools, memory, skills, config โ determines whether your agent works or wastes money.
A $0.002/token model with good tooling outperforms a $0.06/token model with bad tooling. Every time.
What to Trace on Every Run
1. Duration Per Step
Not just total time โ time per tool call. You'll find one step consistently takes 60% of the session.
2. Cost Per Step
Tokens times model pricing per step. Most people only see total cost. Per-step cost reveals the system prompt alone accounts for 30-50% of input tokens.
3. Tool Call Patterns
Which tools, how many times, any repeats with identical arguments? The repeat-with-same-args pattern is the #1 cost driver.
4. URL Verification
Are output URLs from search results or hallucinated? Automated checks catch this on every run.
5. Loop Detection
Same tool called 3+ times with the same arguments = guaranteed loop. This should be an automatic alert.
6. Security Checks
Internal network access? API key leaks? Sensitive files? These should run on every trace, not just when you're worried.
7. Quality Score
Did the agent complete the task? Empty output? Refusal? Automated evals catch these patterns.
You Can't Fix What You Can't See
AI agents need observability. Same as APIs. Same as servers. Same as databases. The tooling for web services took 20 years to mature. Agent observability is at day one.
Free, no signup. Works with OpenClaw, MyClaw, KiloClaw. Waterfall timeline, cost breakdown, 8 auto-evals, security audit, AI debugging โ all instant.