Agentic lesson

Track · Agentic Intermediate ~8 min

Watching an LLM app in production

Shipping a model is the start, not the finish. Once real traffic hits it, you need to see inside every call, watch cost and latency, and catch the moment answer quality starts to slip. That's monitoring and observability, and you'll run a live console for one right here on the page.

Module progress

01Why a deployed model needs watching

A classic software function does the same thing forever: same input, same output. A machine-learning system doesn't get that luxury. The landmark 2015 paper "Hidden Technical Debt in Machine Learning Systems" (Sculley and colleagues at Google) made the case that real-world ML carries large ongoing maintenance cost: from tangled dependencies, hidden feedback loops, and, above all, the outside world changing underneath the model. The data your model meets next month won't look exactly like the data it was tested on. That's why a one-time test before launch isn't enough; you need continuous monitoring after it.

Governance frameworks say the same thing. The NIST AI Risk Management Framework dedicates an entire function, MEASURE, to it: continuously monitor deployed systems for performance deviations and emerging risks, document what you find, and review the monitoring process itself over time. Monitoring isn't a nice-to-have you bolt on later; it's part of operating an AI system responsibly.

The world drifts. Inputs, user behaviour, and even what counts as a "good" answer all change after launch, so the model's behaviour must be watched, not assumed.
It's a documented cost, not a surprise. The technical-debt work established maintenance and monitoring as an inherent, ongoing expense of running ML in production.
It's a governance expectation. NIST AI RMF's MEASURE function calls for continuous monitoring, documentation of results, and periodic review of the process.

02Monitoring vs. observability, and the trace

People use these two words interchangeably, but they're not the same. Monitoring is collecting and alerting on a set of metrics you decided to watch in advance: latency, error rate, token usage, an evaluation score. It tells you that something is wrong. Observability is the broader ability to ask arbitrary questions about what your system did, from the outputs it left behind, so you can work out why. In LLM apps, observability comes mostly from tracing.

A trace is the full record of one request. It's made of spans: each span is one operation inside that request. The vendor-neutral OpenTelemetry GenAI semantic conventions (developed by the OpenTelemetry GenAI SIG, a CNCF project) define what these spans look like for AI apps. A top-level invoke_agent span, with child chat spans (one per LLM call) and execute_tool spans (one per tool call). Each span carries standardized attributes like gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.response.finish_reasons. One important caveat: as of early 2026 most of these GenAI conventions were still marked experimental, so the exact attribute names can still change.

Monitoring answers "is something wrong?" (predefined metrics + alerts); observability answers "why?" (ask anything from the traces).
A trace = one request; spans = the steps inside it (agent → retrieval → model call → tool call).
OpenTelemetry's GenAI conventions give a shared schema so different tools can read the same telemetry, but it's still experimental, so treat attribute names as provisional.

03What you actually measure

Production signals fall into two buckets: operational (is it fast, cheap, and up?) and quality (are the answers any good?). Operational signals are easy to count; quality is the hard part, because for open-ended text there's rarely a single "right answer" to compare against. Switch between the buckets to see what goes in each.

InteractiveSwitch the signal type

Operational: the things you can simply count

These come straight off the spans and need no judgement call. Latency (often time-to-first-token and total time), throughput, error / failure rate, token usage (input and output) and the cost that follows from it, plus finish/stop reasons. They map directly onto the OpenTelemetry GenAI attributes and are surfaced by tools like Datadog, Helicone, LangSmith, and W&B Weave.

watch: latency · tokens in/out · cost · error rate · finish reason

Quality: judged, because there's no single right answer

You can't grade open-ended text with a simple match. Older overlap metrics like BLEU and ROUGE correlate poorly with human judgement for creative output, which is the whole motivation for LLM-as-a-judge, where a model scores outputs against a rubric. For RAG apps, RAGAS defines reference-free signals like faithfulness and answer relevancy. For factuality, sampling-consistency methods such as SelfCheckGPT flag likely hallucinations without needing logits or an external database.

watch: LLM-judge score · faithfulness · answer relevancy · hallucination signal

Drift: the inputs and meaning shifting over time

Data drift is the input distribution changing; concept drift is the input–output relationship changing even when inputs look similar. Both can quietly degrade a model. In LLM/RAG systems a common proxy is embedding drift, a shift in the vector embeddings of prompts, responses, or retrieved docs. Because embeddings are high-dimensional, simple univariate tests (like Kolmogorov–Smirnov) are unreliable, so practitioners prefer distance-based measures from a baseline.

watch: input distribution · embedding distance from baseline · performance-aware detectors

04Run an observability console

Here's the whole idea in one place. The console below streams simulated LLM-call traces (each one a little span tree: invoke_agent → retrieval → chat → execute_tool) while panels tally latency, cost, and tokens, and an LLM-as-judge quality score tracks over time. Everything runs healthy by default. Flip "Inject incident" to simulate a bad deploy or data-drift event: watch the quality score drift down and the alert fire when it crosses the threshold. Flip it back off to recover. The numbers are illustrative: a teaching model, not a real system.

InteractiveToggle the incident on / off

type:

Live trace streamstreaming

Metrics0 calls

p95 latency

–ms

cost / 1k calls (illustrative)

$–

avg tokens

–

error rate

–%

LLM-as-judge quality

threshold 0.75 –

Alert firing. Quality score below threshold.

Traces make it legible. Each call is a span tree you can open up; that's the observability part: you can see exactly where time and tokens went.
An incident shows up as a trend, then an alert. Quality slides first; monitoring catches it the moment it crosses the line you set in advance.
Thresholds are choices, not constants. "Below 0.75" or ">2 std-devs from baseline" are practitioner heuristics you calibrate per application.

05Offline vs. online evals, and the tooling

Evaluation happens in two places. Offline (pre-deployment) evaluation runs your app against a fixed dataset of inputs with reference answers or rubrics, so you can compare versions before shipping. Online (production) evaluation scores live traffic (usually on a sample, often with LLM-as-judge evaluators or lightweight heuristic descriptors) to catch regressions and drift after shipping. You need both: offline keeps a bad version from launching; online catches the world changing once it's live.

The tooling market is broad and moves fast, so it's best to think in categories rather than a single "winner." There are open-source options (Arize Phoenix, Evidently, WhyLabs' LangKit, W&B Weave) and commercial platforms (Datadog, LangSmith cloud, Helicone). Many now build on the OpenTelemetry GenAI standard, which is what lets you switch or combine them. One honest caveat: each vendor's capability claims come from its own docs and aren't independently benchmarked here, so read "supports X," not "best at X." And LLM-judge scores aren't ground truth; they carry position, verbosity, and self-preference biases and should be calibrated against human labels for anything high-stakes.

Offline = test versions on a fixed dataset before launch; online = score sampled live traffic after launch.
The market spans open-source and commercial tools; no single one is canonical, so pick by your stack and governance needs.
Treat judge scores carefully: they're useful signals, not objective accuracy, and need human calibration for high-stakes decisions.

06Check your understanding

TJS Quiz

Responsible use

This is an educational lesson, not operational or compliance advice. The live console is a simplified teaching simulation; its latencies, costs, scores, and thresholds are illustrative, not measurements of any real system. Monitoring approaches, tool capabilities, and the OpenTelemetry GenAI conventions evolve quickly; verify specifics against current primary documentation before relying on them. LLM-as-judge scores are useful signals but are not objective ground truth and should be calibrated against human review for high-stakes decisions.

Keep going

You finished LLMOps: Monitoring & Observability

Here’s where it sits in your path, and the strongest next move.

FoundationsLanguage & modelsAgenticGovernance

▸

Recommended next

Model Context Protocol

What MCP is, how hosts, clients and servers connect, and why it matters.

Start lesson →

Build on this

Agentic

AI Agents

How agents perceive, reason, use tools and act, and how they differ from chatbots.

Open lesson → Agentic

RAG

How retrieval grounds LLM answers, step by step.

Open lesson → Agentic

Chatbots

How they understand and respond, their limits, and how they differ from agents.

Open lesson →

⊕The lesson at a glance

Why a deployed model needs watching

The outside world changes after launch: inputs, user behaviour, and what counts as a good answer all drift, so behaviour must be watched, not assumed.
The 2015 paper "Hidden Technical Debt in Machine Learning Systems" (Sculley et al., Google) established ongoing maintenance and monitoring as an inherent cost of running ML in production.
NIST AI RMF's MEASURE function calls for continuous monitoring, documentation of results, and periodic review of the monitoring process.

Monitoring vs. observability, and the trace

Monitoring alerts on predefined metrics (tells you something is wrong); observability lets you ask arbitrary questions to find out why, mainly through tracing.
A trace is the full record of one request; spans are the steps inside it (agent → retrieval → model call → tool call).
OpenTelemetry's GenAI conventions give a shared schema (invoke_agent, chat, execute_tool spans), but most were still experimental as of early 2026, so treat attribute names as provisional.

What you actually measure

Operational signals you can simply count: latency, throughput, error rate, token usage and the cost that follows, plus finish/stop reasons.
Quality is judged because open-ended text has no single right answer: BLEU and ROUGE correlate poorly with human judgement, which is why LLM-as-a-judge and RAG signals like faithfulness exist.
Drift: data drift is the input distribution changing; concept drift is the input–output relationship changing; embedding drift is a common proxy in LLM/RAG systems.

Run an observability console

Each call is a span tree you can open up; that's the observability part: you can see exactly where time and tokens went.
An incident shows up as a trend first, then an alert: quality slides, and monitoring catches it the moment it crosses the line you set in advance.
Thresholds are choices, not constants: "below 0.75" or ">2 std-devs from baseline" are practitioner heuristics you calibrate per application.

Offline vs. online evals, and the tooling

Offline evaluation tests versions on a fixed dataset before launch; online evaluation scores sampled live traffic after launch, and you need both.
The market spans open-source (Arize Phoenix, Evidently, LangKit, W&B Weave) and commercial (Datadog, LangSmith, Helicone) tools; no single one is canonical.
Treat LLM-judge scores carefully: they carry position, verbosity, and self-preference biases and need human calibration for high-stakes decisions.

⇩Take it with you

⎘

One-page summaryThe whole lesson on a printable cheat-sheet.

Every claim below links to its primary source so you can go straight to the original.

✓ VerifiedPublished by Tech Jacks Solutions · Reviewed June 2026 · Grounded in 14 sources

Hidden Technical Debt in Machine Learning SystemsSculley et al., NeurIPS 2015 NIST AI Risk Management Framework (AI 100-1)NIST OpenTelemetry Semantic Conventions for Generative AIOpenTelemetry / CNCF Inside the LLM Call: GenAI Observability with OpenTelemetryOpenTelemetry blog A Survey on Concept Drift in Evolving EnvironmentsHinder, Vaquet & Hammer (arXiv:2310.15826) From Concept Drift to Model DegradationBayram, Ahmed & Kassler (arXiv:2203.11070) LLMs-as-Judges: A Comprehensive SurveyLi et al. (arXiv:2412.05579) G-Eval: NLG Evaluation using GPT-4Liu et al. (arXiv:2303.16634) SelfCheckGPT: Zero-Resource Hallucination DetectionManakul, Liusie & Gales (arXiv:2303.08896) RAGAS: Automated Evaluation of Retrieval Augmented GenerationEs et al. (arXiv:2309.15217) 5 Methods to Detect Drift in ML EmbeddingsEvidently AI LangSmith ObservabilityLangChain Arize Phoenix: AI Observability & EvaluationArize AI Datadog LLM / Agent ObservabilityDatadog

Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below; the figures in the console simulator are illustrative and labelled as such.

LLMOps: monitoring & observability in 8 minutes

Tech Jacks Solutions · AI Knowledge Hub · educational summary

Why monitor a deployed model

A deployed ML system meets a world that keeps changing, so a one-time pre-launch test isn't enough. The 2015 "Hidden Technical Debt" paper framed ongoing maintenance/monitoring as an inherent cost; the NIST AI RMF MEASURE function makes continuous monitoring a governance expectation.

Monitoring vs. observability

Monitoring = alert on predefined metrics (tells you something is wrong). Observability = ask arbitrary questions from the outputs (tells you why), mostly via tracing.

Traces and spans

A trace is one request; spans are the steps inside it. OpenTelemetry's GenAI conventions define a parent invoke_agent span with child chat and execute_tool spans, plus attributes like model, input/output tokens, and finish reason (still experimental in early 2026).

What you measure

Operational: latency, throughput, error rate, tokens, cost, finish reason. Quality: LLM-as-judge, RAGAS faithfulness/answer-relevancy, hallucination signals (e.g. SelfCheckGPT). Drift: data drift, concept drift, embedding drift (distance-based, not KS).

Offline vs. online eval

Offline tests versions on a fixed dataset before launch; online scores sampled live traffic after launch. Tooling spans open-source (Phoenix, Evidently, Weave, LangKit) and commercial (Datadog, LangSmith, Helicone); judge scores need human calibration.

Gallery

Contacts

Watching an LLM app in production

01Why a deployed model needs watching

02Monitoring vs. observability, and the trace

03What you actually measure

Operational: the things you can simply count

Quality: judged, because there's no single right answer

Drift: the inputs and meaning shifting over time

04Run an observability console

05Offline vs. online evals, and the tooling

06Check your understanding

You finished LLMOps: Monitoring & Observability

AI Agents

RAG

Chatbots

LLMOps: monitoring & observability in 8 minutes

Why monitor a deployed model

Monitoring vs. observability

Traces and spans

What you measure

Offline vs. online eval

Services

Learn

Company