Watching an LLM app in production
Shipping a model is the start, not the finish. Once real traffic hits it, you need to see inside every call, watch cost and latency, and catch the moment answer quality starts to slip. That's monitoring and observability, and you'll run a live console for one right here on the page.
01Why a deployed model needs watching
A classic software function does the same thing forever: same input, same output. A machine-learning system doesn't get that luxury. The landmark 2015 paper "Hidden Technical Debt in Machine Learning Systems" (Sculley and colleagues at Google) made the case that real-world ML carries large ongoing maintenance cost: from tangled dependencies, hidden feedback loops, and, above all, the outside world changing underneath the model. The data your model meets next month won't look exactly like the data it was tested on. That's why a one-time test before launch isn't enough; you need continuous monitoring after it.
Anchor your AI program in a charter. The AI Governance Charter: establish ownership, scope, and accountability for AI.
Your purchase helps keep our hubs free to read.
Governance frameworks say the same thing. The NIST AI Risk Management Framework dedicates an entire function, MEASURE, to it: continuously monitor deployed systems for performance deviations and emerging risks, document what you find, and review the monitoring process itself over time. Monitoring isn't a nice-to-have you bolt on later; it's part of operating an AI system responsibly.
- The world drifts. Inputs, user behaviour, and even what counts as a "good" answer all change after launch, so the model's behaviour must be watched, not assumed.
- It's a documented cost, not a surprise. The technical-debt work established maintenance and monitoring as an inherent, ongoing expense of running ML in production.
- It's a governance expectation. NIST AI RMF's MEASURE function calls for continuous monitoring, documentation of results, and periodic review of the process.
02Monitoring vs. observability, and the trace
People use these two words interchangeably, but they're not the same. Monitoring is collecting and alerting on a set of metrics you decided to watch in advance: latency, error rate, token usage, an evaluation score. It tells you that something is wrong. Observability is the broader ability to ask arbitrary questions about what your system did, from the outputs it left behind, so you can work out why. In LLM apps, observability comes mostly from tracing.
A trace is the full record of one request. It's made of spans: each span is one operation inside that request. The vendor-neutral OpenTelemetry GenAI semantic conventions (developed by the OpenTelemetry GenAI SIG, a CNCF project) define what these spans look like for AI apps. A top-level invoke_agent span, with child chat spans (one per LLM call) and execute_tool spans (one per tool call). Each span carries standardized attributes like gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.response.finish_reasons. One important caveat: as of early 2026 most of these GenAI conventions were still marked experimental, so the exact attribute names can still change.
- Monitoring answers "is something wrong?" (predefined metrics + alerts); observability answers "why?" (ask anything from the traces).
- A trace = one request; spans = the steps inside it (agent → retrieval → model call → tool call).
- OpenTelemetry's GenAI conventions give a shared schema so different tools can read the same telemetry, but it's still experimental, so treat attribute names as provisional.
03What you actually measure
Production signals fall into two buckets: operational (is it fast, cheap, and up?) and quality (are the answers any good?). Operational signals are easy to count; quality is the hard part, because for open-ended text there's rarely a single "right answer" to compare against. Switch between the buckets to see what goes in each.
Operational: the things you can simply count
These come straight off the spans and need no judgement call. Latency (often time-to-first-token and total time), throughput, error / failure rate, token usage (input and output) and the cost that follows from it, plus finish/stop reasons. They map directly onto the OpenTelemetry GenAI attributes and are surfaced by tools like Datadog, Helicone, LangSmith, and W&B Weave.
Quality: judged, because there's no single right answer
You can't grade open-ended text with a simple match. Older overlap metrics like BLEU and ROUGE correlate poorly with human judgement for creative output, which is the whole motivation for LLM-as-a-judge, where a model scores outputs against a rubric. For RAG apps, RAGAS defines reference-free signals like faithfulness and answer relevancy. For factuality, sampling-consistency methods such as SelfCheckGPT flag likely hallucinations without needing logits or an external database.
Drift: the inputs and meaning shifting over time
Data drift is the input distribution changing; concept drift is the input–output relationship changing even when inputs look similar. Both can quietly degrade a model. In LLM/RAG systems a common proxy is embedding drift, a shift in the vector embeddings of prompts, responses, or retrieved docs. Because embeddings are high-dimensional, simple univariate tests (like Kolmogorov–Smirnov) are unreliable, so practitioners prefer distance-based measures from a baseline.
04Run an observability console
Here's the whole idea in one place. The console below streams simulated LLM-call traces (each one a little span tree: invoke_agent → retrieval → chat → execute_tool) while panels tally latency, cost, and tokens, and an LLM-as-judge quality score tracks over time. Everything runs healthy by default. Flip "Inject incident" to simulate a bad deploy or data-drift event: watch the quality score drift down and the alert fire when it crosses the threshold. Flip it back off to recover. The numbers are illustrative: a teaching model, not a real system.
- Traces make it legible. Each call is a span tree you can open up; that's the observability part: you can see exactly where time and tokens went.
- An incident shows up as a trend, then an alert. Quality slides first; monitoring catches it the moment it crosses the line you set in advance.
- Thresholds are choices, not constants. "Below 0.75" or ">2 std-devs from baseline" are practitioner heuristics you calibrate per application.
05Offline vs. online evals, and the tooling
Evaluation happens in two places. Offline (pre-deployment) evaluation runs your app against a fixed dataset of inputs with reference answers or rubrics, so you can compare versions before shipping. Online (production) evaluation scores live traffic (usually on a sample, often with LLM-as-judge evaluators or lightweight heuristic descriptors) to catch regressions and drift after shipping. You need both: offline keeps a bad version from launching; online catches the world changing once it's live.
The tooling market is broad and moves fast, so it's best to think in categories rather than a single "winner." There are open-source options (Arize Phoenix, Evidently, WhyLabs' LangKit, W&B Weave) and commercial platforms (Datadog, LangSmith cloud, Helicone). Many now build on the OpenTelemetry GenAI standard, which is what lets you switch or combine them. One honest caveat: each vendor's capability claims come from its own docs and aren't independently benchmarked here, so read "supports X," not "best at X." And LLM-judge scores aren't ground truth; they carry position, verbosity, and self-preference biases and should be calibrated against human labels for anything high-stakes.
- Offline = test versions on a fixed dataset before launch; online = score sampled live traffic after launch.
- The market spans open-source and commercial tools; no single one is canonical, so pick by your stack and governance needs.
- Treat judge scores carefully: they're useful signals, not objective accuracy, and need human calibration for high-stakes decisions.
06Check your understanding
This is an educational lesson, not operational or compliance advice. The live console is a simplified teaching simulation; its latencies, costs, scores, and thresholds are illustrative, not measurements of any real system. Monitoring approaches, tool capabilities, and the OpenTelemetry GenAI conventions evolve quickly; verify specifics against current primary documentation before relying on them. LLM-as-judge scores are useful signals but are not objective ground truth and should be calibrated against human review for high-stakes decisions.
You finished LLMOps: Monitoring & Observability
Here’s where it sits in your path, and the strongest next move.
Recommended next
What MCP is, how hosts, clients and servers connect, and why it matters.
AI Agents
How agents perceive, reason, use tools and act, and how they differ from chatbots.
Open lesson → AgenticRAG
How retrieval grounds LLM answers, step by step.
Open lesson → AgenticChatbots
How they understand and respond, their limits, and how they differ from agents.
Open lesson →Why a deployed model needs watching
- The outside world changes after launch: inputs, user behaviour, and what counts as a good answer all drift, so behaviour must be watched, not assumed.
- The 2015 paper "Hidden Technical Debt in Machine Learning Systems" (Sculley et al., Google) established ongoing maintenance and monitoring as an inherent cost of running ML in production.
- NIST AI RMF's MEASURE function calls for continuous monitoring, documentation of results, and periodic review of the monitoring process.
Monitoring vs. observability, and the trace
- Monitoring alerts on predefined metrics (tells you something is wrong); observability lets you ask arbitrary questions to find out why, mainly through tracing.
- A trace is the full record of one request; spans are the steps inside it (agent → retrieval → model call → tool call).
- OpenTelemetry's GenAI conventions give a shared schema (invoke_agent, chat, execute_tool spans), but most were still experimental as of early 2026, so treat attribute names as provisional.
What you actually measure
- Operational signals you can simply count: latency, throughput, error rate, token usage and the cost that follows, plus finish/stop reasons.
- Quality is judged because open-ended text has no single right answer: BLEU and ROUGE correlate poorly with human judgement, which is why LLM-as-a-judge and RAG signals like faithfulness exist.
- Drift: data drift is the input distribution changing; concept drift is the input–output relationship changing; embedding drift is a common proxy in LLM/RAG systems.
Run an observability console
- Each call is a span tree you can open up; that's the observability part: you can see exactly where time and tokens went.
- An incident shows up as a trend first, then an alert: quality slides, and monitoring catches it the moment it crosses the line you set in advance.
- Thresholds are choices, not constants: "below 0.75" or ">2 std-devs from baseline" are practitioner heuristics you calibrate per application.
Offline vs. online evals, and the tooling
- Offline evaluation tests versions on a fixed dataset before launch; online evaluation scores sampled live traffic after launch, and you need both.
- The market spans open-source (Arize Phoenix, Evidently, LangKit, W&B Weave) and commercial (Datadog, LangSmith, Helicone) tools; no single one is canonical.
- Treat LLM-judge scores carefully: they carry position, verbosity, and self-preference biases and need human calibration for high-stakes decisions.
Every claim below links to its primary source so you can go straight to the original.
Published by Tech Jacks Solutions · Reviewed June 2026. This lesson explains established concepts and is grounded in the references below; the figures in the console simulator are illustrative and labelled as such.
LLMOps: monitoring & observability in 8 minutes
Tech Jacks Solutions · AI Knowledge Hub · educational summary
Why monitor a deployed model
A deployed ML system meets a world that keeps changing, so a one-time pre-launch test isn't enough. The 2015 "Hidden Technical Debt" paper framed ongoing maintenance/monitoring as an inherent cost; the NIST AI RMF MEASURE function makes continuous monitoring a governance expectation.
Monitoring vs. observability
Monitoring = alert on predefined metrics (tells you something is wrong). Observability = ask arbitrary questions from the outputs (tells you why), mostly via tracing.
Traces and spans
A trace is one request; spans are the steps inside it. OpenTelemetry's GenAI conventions define a parent invoke_agent span with child chat and execute_tool spans, plus attributes like model, input/output tokens, and finish reason (still experimental in early 2026).
What you measure
Operational: latency, throughput, error rate, tokens, cost, finish reason. Quality: LLM-as-judge, RAGAS faithfulness/answer-relevancy, hallucination signals (e.g. SelfCheckGPT). Drift: data drift, concept drift, embedding drift (distance-based, not KS).
Offline vs. online eval
Offline tests versions on a fixed dataset before launch; online scores sampled live traffic after launch. Tooling spans open-source (Phoenix, Evidently, Weave, LangKit) and commercial (Datadog, LangSmith, Helicone); judge scores need human calibration.