Gallery

Contacts

405 W. Greenlawn Ave Lansing, Michigan 48910

contact@techjacksolutions.com

+1-616-320-4064

Agentic lesson
Track 03 · Agentic Advanced ~9 min

How do you tell if an AI agent actually did the job?

A chatbot you can grade on its answer. An agent takes many steps — calling tools, reading results, deciding what to do next — and it can reach the right answer the wrong way, or fail silently three steps in. This lesson covers what agent evaluation measures, how observability traces let you see inside a run, and where the field's hardest benchmarks stand. Step through a real-looking agent trace below and grade it yourself.

Module progress
0%

01Why an agent is harder to grade than a chatbot

With a single chatbot reply you can look at the output and judge it. An agent is different: to finish a task it runs a multi-step trajectory — it calls a tool, reads the result (the "observation"), reasons about what to do next, calls another tool, and eventually produces a final answer. That sequence is where things go right or wrong. Two traps make agents tricky to evaluate. First, an agent can reach the right final answer through a broken path — guessing, using the wrong tool, or taking ten steps where two would do. Second, it can fail silently: pass a malformed argument to a tool, get back an error, and quietly carry on as if nothing happened. So good agent evaluation looks at both the destination and the journey.

  • A chatbot is judged on one output; an agent is judged on a trajectory — a chain of tool calls, observations, and decisions.
  • Outcome-only metrics miss "corrupt success" — the agent lands on the right answer via an unsafe or illogical path. Pair task-success with trajectory evaluation.
  • According to A Survey on Evaluation of LLM-based Agents (2025), evaluation can happen at turn-level, milestone-level, or trajectory-level granularity.

02The four axes agent evaluation measures

There's no single "agent score." Researchers and practitioners look at several dimensions, because each one catches a different kind of failure. Four come up again and again:

  • Task success — did the agent achieve the goal? This is often measured by execution, not opinion: in SWE-bench the agent's code edit must make the repo's tests pass; in tau-bench the final database state is compared to the goal state. Objective, but it only sees the end.
  • Tool-call correctness — did the agent call the right tools, with valid arguments, in a sensible order? A wrong tool or a hallucinated argument is a failure even if the final answer happens to look fine.
  • Step efficiency — did it get there without wandering? Redundant or looping steps cost money and latency and often signal shaky reasoning.
  • Trajectory quality — how the agent acted overall: reasoning coherence, goal-directedness, and whether the action sequence makes sense. Per the survey, this is judged either reference-based (aligning the run to a gold trajectory — exact, partial, unordered, or subset matching) or reference-free (an LLM judge reads the observed sequence and scores it).

One more axis matters for anything you'd deploy: reliability. A single passing run can hide inconsistency. tau-bench introduced the pass^k metric — the chance the agent succeeds on all of k independent attempts — precisely because agents "can pass once but fail on repeats." Report multi-trial metrics, not a lucky single shot.

03See it work: step through an agent trace

Here's a short agent run for a customer-service task: "Refund the customer's last order if it's within the 30-day window." Step through the trajectory one node at a time — each is a tool call, an observation, or the final answer. The scorecard on the right grades the run on the four axes. Then flip to the failed run: same task, but the agent calls the wrong tool and passes a hallucinated argument — and you'll see exactly which step the trace catches it on. These figures are illustrative, built to show how observability surfaces a break.

InteractiveStep through · then toggle the failed run
step 1 / 5
  • The trace is the unit of observability: a request flows through spans — each LLM call, tool execution, and observation — with timing, inputs, outputs, and parent-child nesting.
  • In the failed run, the final answer can still read plausibly — only the trace reveals the wrong tool and the swallowed error. That's the whole point of observability.

04Observability: the trace is how you see inside

You can't evaluate what you can't see. Observability for agents records each run as a trace: the flow of a request broken into spans (also called observations) — one for each LLM call, each retrieval step, each tool execution, and any custom logic. Every span carries timing, inputs, outputs, parent-child nesting, and metadata. That nested structure is what turns a black-box run into something you can step through, exactly like the inspector above.

A growing ecosystem of platforms captures these traces — LangSmith, Arize Phoenix, Langfuse, and Weights & Biases Weave are among the major agent/LLM observability and evaluation tools (Phoenix and Langfuse are open source). To stop every tool inventing its own format, the OpenTelemetry GenAI semantic conventions — a CNCF-backed, vendor-neutral standard — define a common schema for GenAI spans, metrics, and events: prompts, responses, token usage, and tool/agent calls. Two caveats worth knowing: most of these conventions are still experimental (the schema can change), and they do not capture prompt or response content by default, to avoid leaking personal data into your telemetry.

  • A trace = the request flow; spans = the individual steps inside it, with timing, I/O, and nesting.
  • OpenTelemetry GenAI conventions give a vendor-neutral schema so observability data is portable across tools — but they're largely experimental as of 2026.
  • Open-source options (Phoenix, Langfuse) build on OpenTelemetry/OpenInference; managed options (LangSmith, Weave) add hosted dashboards and evals. Re-verify features against the docs — they move fast.

05Grading the journey: LLM-as-a-judge

Some things are easy to grade automatically — did the tests pass, does the final state match the goal. But "was the reasoning coherent?" or "was this trajectory sensible?" resists a simple rule. The common answer is LLM-as-a-judge: use a strong model to score or compare outputs (or whole trajectories) in place of, or alongside, human raters. This is also how reference-free trajectory evaluation works — no gold path required; the judge just reads the observed run and rates it.

Does it work? Reasonably well. Strong judges reach over 80% agreement with human preferences on MT-Bench and Chatbot Arena (Zheng et al., 2023) — comparable to how often two humans agree. The G-Eval method (Liu et al., 2023), which prompts the judge with chain-of-thought and a structured form to fill in, reached about 0.514 Spearman correlation with human ratings on summarization, beating older metrics like BLEU and ROUGE. But judges are not neutral: the same research documents position bias (favoring whichever answer comes first), verbosity bias (longer looks better), and self-enhancement bias (preferring their own outputs), plus limited reasoning. The fix isn't to abandon them — it's to use mitigations (swap option positions, give explicit rubrics, use multiple judges) and validate against human labels. Never treat a single judge score as ground truth.

  • LLM-as-a-judge scores outputs or trajectories with a strong model — the backbone of reference-free trajectory evaluation.
  • Strong judges hit >80% human agreement; G-Eval correlates better with humans than BLEU/ROUGE — but both numbers are from 2023 and task-specific.
  • Known biases — position, verbosity, self-enhancement — mean you must add mitigations and human validation.

06Where the hardest benchmarks stand

Public benchmarks are how the field compares agents. A few have become landmarks, and the headline numbers tell a humbling story about how hard real-world agent work is:

  • AgentBench (2023, ICLR 2024) — the first systematic multi-environment suite, with 8 environments (operating system, database, knowledge graph, games, web), exposing a wide gap between commercial and open-source models.
  • SWE-bench (2023) — 2,294 real GitHub issues across 12 Python repos, scored by whether the agent's code edit makes the tests pass. The original paper's best agent solved roughly 1.96%.
  • WebArena (2023) — 812 long-horizon web tasks in a self-hosted environment; the paper's best GPT-4 agent hit 14.41% versus 78.24% for humans.
  • GAIA (2023) — 466 real-world questions needing reasoning, multimodality, and tool use; humans scored 92% while GPT-4 with plugins managed about 15%.
  • tau-bench (2024) — tool-agent-user interaction with policy-following, where state-of-the-art function-calling agents came in under 50% success and were inconsistent across repeats.

Two cautions before you quote any of these. First, those specific numbers are from the source papers (2023–2024) — frontier models score far higher today, so always cite the model, the benchmark version, and the date. Second, benchmark contamination is real: public test sets can leak into training data and inflate scores, so prefer held-out or verified splits (for example, SWE-bench Verified) and read leaderboard numbers as upper bounds, not guarantees.

07Check your understanding

TJS Quiz

08Take it with you & go deeper

"Agent evaluation & observability" — one-page summary
The whole lesson distilled to a printable cheat-sheet.
▸ Already on the site — go deeper
▸ Coming next — deeper progression
Live lesson

Agent memory architectures

What an agent remembers across steps — and how memory choices show up in the trajectory.

Read →
Coming soon

AI red teaming

Stress-testing agents adversarially — the offensive counterpart to evaluation and observability.

Coming soon

Continue learning

Concept map

A bird's-eye view of agent evaluation and observability — expand each branch to see the key ideas from this lesson.

Why agents are harder to grade
  • A chatbot turn has one output to score; an agent takes a multi-step trajectory of tool calls and decisions.
  • Outcome-only metrics miss "corrupt success" — reaching the right end state via an unsafe or illogical path.
  • A high one-shot pass rate can hide inconsistency, so reliability across repeated trials matters.
The four axes of agent evaluation
  • Task success: did the agent achieve the goal? Often execution-based (SWE-bench tests passing) or end-state comparison (tau-bench DB state).
  • Trajectory quality: how it acted — tool-call sequence, reasoning coherence, efficiency — graded against a gold trajectory or by an LLM judge.
  • Reliability: consistency across repeated trials (tau-bench's pass^k). Evaluation can be turn-, milestone-, or trajectory-level.
Observability: seeing inside the trace
  • A request is recorded as a trace of spans/observations — each LLM call, retrieval, and tool execution — with timing, inputs, outputs, and nesting.
  • OpenTelemetry GenAI semantic conventions (CNCF) give a vendor-neutral schema; most are experimental and skip content capture by default to avoid PII leakage.
  • Tooling: LangSmith, Arize Phoenix, Langfuse, and W&B Weave — Phoenix and Langfuse are open source.
LLM-as-a-judge
  • Using a strong LLM to score or compare outputs, including reference-free trajectory evaluation.
  • Strong judges reach >80% agreement with human preferences on MT-Bench/Chatbot Arena (Zheng et al., 2023) — comparable to inter-human agreement.
  • Watch for position, verbosity, and self-enhancement biases; mitigate with position swaps, rubrics, and multiple judges. G-Eval reaches 0.514 Spearman vs humans on summarization.
Where the hardest benchmarks stand
  • WebArena: 812 web tasks; the source paper's best GPT-4 agent scored 14.41% vs human 78.24%. SWE-bench: 2,294 real GitHub issues, execution-based, original best ~1.96%.
  • GAIA: 466 questions where humans scored 92% vs GPT-4+plugins 15%; tau-bench had SOTA function-calling agents under 50% success.
  • Numbers are version- and harness-dependent and may be contaminated by training data — cite model + benchmark version + date and treat as upper bounds.
Sources & further reading

Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the peer-reviewed papers and vendor-neutral docs below. The agent trace and scores in the interactive are illustrative and labelled as such; benchmark figures are quoted from the source papers with their model and date, and the field moves quickly — re-verify before relying on any number.

Responsible use

This is an educational lesson, not professional, legal, or security advice. The agent trace, scores, and verdicts in the interactive are illustrative teaching aids, not measurements of any real system. Benchmark results cited here are tied to specific models, benchmark versions, and dates from the source papers; treat public-leaderboard numbers as upper bounds (benchmark contamination and harness differences inflate scores) and re-verify any figure against the linked source before acting on it.

Agent observability traces can contain prompts, tool inputs/outputs, and personal data; OpenTelemetry GenAI conventions disable content capture by default for this reason. If you instrument real systems, treat captured traces as regulated data and follow your organization's privacy and governance requirements.

Agent evaluation & observability — in one page

Tech Jacks Solutions · AI Knowledge Hub · educational summary

Why agents are hard to grade

An agent runs a multi-step trajectory (tool calls, observations, decisions), not a single reply. It can reach the right answer the wrong way (corrupt success) or fail silently. Judge the destination and the journey.

The four axes (+ reliability)

Task success — goal achieved, often by execution (SWE-bench tests pass; tau-bench end-state). Tool-call correctness — right tools, valid args, sensible order. Step efficiency — no wandering. Trajectory quality — reference-based (vs a gold path) or reference-free (LLM judge). Reliability — multi-trial metrics like tau-bench's pass^k.

Observability

A run is recorded as a trace of spans (LLM calls, tools, observations) with timing, I/O, and nesting. OpenTelemetry GenAI conventions give a vendor-neutral schema (mostly experimental; no content capture by default). Tools: LangSmith, Arize Phoenix, Langfuse, W&B Weave.

LLM-as-a-judge

Strong judges reach >80% human agreement (Zheng et al., 2023); G-Eval beats BLEU/ROUGE. But they show position, verbosity, and self-enhancement bias — mitigate (swap positions, rubrics, multiple judges) and validate against humans.

Benchmark cautions

Numbers are version/date-specific and public sets can leak into training data. Prefer verified splits; read leaderboard scores as upper bounds; treat traces as sensitive data.