u/gaurav_sherlocks_ai

Agent Observability and what I think

Hey all, I wanted to share a perspective on something I've been thinking about a lot lately.

Traditional APM was built for request-response and AI Agents break that model entirely. Because, most of what's on the market right now is just legacy APM with agent added, and that leaves a gap you really only feel when things go wrong. You can see the agent's intent (what it decided to do) OR the system-level impact (latency, errors, resource usage), but not both in the same trace. Unfortunately, you're flying blind through the exact moments when cost spikes.

I think observability at the agent layer is one of the real problems here. It's not solved yet. But it's defined well enough that you can instrument properly if you start now.

UC Santa Cruz published research on this last year (arxiv:2508.02736). They used eBPF to intercept TLS traffic and correlate what the agent intended to do with what actually happened at the kernel level. Less than 3% overhead. Point being that this is architecturally possible.

About 5% of AI model requests fail in production today (Datadog, April 2026 survey). Sixty percent of those failures are capacity-related, not model errors. So, it's an operational gap. And teams that built agent-layer observability into their setup caught those failures before they cascaded into outages. Teams that didn't had incidents.

If you're building agents, start with OpenTelemetry. If you're buying a platform, ask the hard questions: Does this handle reasoning loops as a first-class thing? Can you see the decision tree as a continuous trace? Does it know the difference between a tool failing and the agent misunderstanding the tool? Can you alert on semantic drift?

Those are the questions that separate something actually built for agents from something that's just adding agent features to traditional APM. Honeycomb published their approach. Langfuse and LangSmith are solid for multi-step debugging. There are about 15 tools competing on this now, most built on OpenTelemetry standards.

My candid assessment is that you're going to be in supervised mode for a while. Your agent still needs human approval, there is no way around it right now. That's not going away in the next two years. If a vendor tells you otherwise, that's a red flag.

Curious if people can share a) what does good agent observability actually look like at your scale? And b) what are you currently missing on the observability side if anything?

reddit.com

u/gaurav_sherlocks_ai — 1 day ago

▲ 51 r/sre

How Complex SRE Systems Fail: 18 Lessons for the On-Call Engineer

Hey fam: I recently read Richard Cook's How Complex Systems Fail this weekend. Wanted to share a few observations - the book is from 1998 about hospitals and nuclear plants. For me, as I am building in SRE space, I was recommended and i think the learnings do map onto SRE work as it currently stands.

Complex systems are inherently hazardous - by definition. The risk in having a distributed state in SRE work is structural. Distributed state ships with partitions, drift, and split-brain by default. SRE work or AI agents have to keep those boxed in, not engineer them away.
Complex systems are heavily defended against failure as a design choice. The defenses in the SRE world are stacked: Retries, circuit breakers, quotas, health checks, canaries, multi-AZ, change freezes, human review. Most of them are invisible when they work and you only notice them when something is missing and you are reminded the Pager Duty is a public company.
Catastrophe requires multiple failures stacked, not one. A bad code commit on its own doesn't take you down. A bad commit AND a broken canary metric AND a saturated region autoscaler AND an on-call at dinner with phone on silent does - i know this has happened to all of us. Single points of failure exist, but the outage only happens when the process defenses around them fail too and in a stacked manner.
Your system already contains failures you don't know about. Bugs sitting in code paths you haven't hit, expired-but-cached credentials, drift between IaC and the actual state, dependencies pinned to versions that have since been yanked. You'll find out about some of them at the next incident - unfortunately, you WILL.
Production always runs in degraded mode and this is the default, not the exception. The system where every pod is healthy, every queue is drained, and every replica is in sync does not exist and never has. Something is always partially broken and god system survive by staying useful through that.
Really Bad News is always around the corner i.e. yesterday's uptime sadly guarantees nothing about tomorrow. The same system that served a billion clean events today can serve zero tomorrow. The exec or VP saying "90 days no outage, we must be doing something right" needs to truly be on call once a week for exposure.
Sometimes, there is no single root cause as it's a tree of contributing factors. The node we decide to call "root" says more about our org's incentives than about the system. The S3 outage gets blamed on the engineer typing the wrong argument. Also true is the fact that the command had no confirmation prompt, the blast radius was unbounded, the subsystem hadn't been restarted in years, dependent services had no fallback.
Hindsight biases everything after the incident - the timeline always looks obvious in the post-mortem. At the time, CPU was climbing on 400 other dashboards and 399 of them resolved themselves on their own. We should judge the on-call person's decisions by what was knowable in the moment and not by what the post-mortem reveals.
Operators ship and defend at the same time and the tradeoff is real. Product engineers shipping features and platform engineers protecting the runtime are making the same bet from opposite sides and that's faster shipping means more defense owed. There are orgs that pretend this tradeoff doesn't exist and those are the ones that pay in burnout or outages or usually both.
Every action is a gamble against a system you can't fully observe at any given point in time. Every deploy, every rollback, every kubectl delete pod, every "let me try restarting it" is a bet that something doesn't break in a way you didn't predict. The job isn't to stop shipping or acting. The job is to make the bets smaller and reversible and in SRE world, canaries, feature flags, and progressive delivery are what this looks like in practice.
The on-call engineer resolves all ambiguity in the moment, alone. Dashboards give you signals. Runbooks give you procedures. Neither tells you, at 3am with errors spiking in one region for authenticated traffic only, whether to roll back, drain the swampy region, or wait. That call is what actually helps and changes the system.
Humans are the adaptable element of the system by necessity. Today AI SRE agents can handle the patterns the LLM has seen. When the incident is genuinely new, someone has to decide to failover, drain a node, call the vendor, or accept the data loss and restore from backup. AI agents in SRE work best when they hand the human better context faster, not when they try to replace the decision.
Expertise is perishable and it expires faster than you imagine. Last year's expert on your payments pipeline isn't this year's, because the pipeline has changed and so has the engineer. Tribal knowledge is the most expensive form of knowledge because it leaves with the person. As basic as it may sound, documented, searchable, queryable system knowledge is the only good solution here.
Every change opens a new failure surface like every deploy, every migration, every version bump, every provider switch. The stability you feel right before a big change imo is an illusion. You know you haven't met the new failure modes the change is about to introduce and that's ok. Plan the rollback always before you plan the rollout.
Where you locate the cause determines the defense you build directly. If your outage post-mortem concludes "engineer pushed bad YAML," your proposed fix is "more review." If it concludes "validation pipeline didn't catch an invalid spec," your fix is "better validation."
Safety is a property of the whole system, not the components of the system aka reliability is emergent. That means that a 99.99% service built on 99.99% components is NOT 99.99% reliable. Composition matters more than the spec sheet of any individual piece. "We use AWS, we're fine" is not a reliability strategy however great Amazon is. You still have to design for how the components fail together and there is no way around it.
People continuously create safety but mostly invisibly. The reason systems are up right now is that dozens of engineers are quietly choosing not to do risky things, catching near-misses in review, noticing weird metrics, fixing lightweight deploys nobody asked them to touch. The engineer who broke something Friday and fixed it Sunday gets a Slack post and the one who prevented the break on Tuesday gets nothing. Sadly, most orgs reward the first and depend on the second.
Failure-free operations need experience with failure and that's why game days exist. Two years no-incident usually means all the gambles and actions stayed small, not that you're resilient. The first real incident any org faces will be bigger and less informed because nobody has practiced those situations. It barely takes minutes of seconds for no-downtime to quietly turn into massive-downtime.

Hope this was helpful

reddit.com

u/gaurav_sherlocks_ai — 1 day ago

▲ 0 r/devops

Sidney Dekker's (safety researcher) point of views applied to AI SRE agents

Hi all,

We've been studying Sidney Dekker's safety research in the context of recent developments in the AI agents space (PS: we do build in this space!) and how they map to devops, SRE, infrastructure engineering.

Some things we learned that were helpful of how we would run post-incident scenarios differently:

1/ Old view vs New view. These are the two worlds he split the debate into. The old view says the system is fine, so find the one, the person who messed up. The new view says people act in ways that made sense given the info, goals and pressure they had. Sidney Dekker refers to that in his book as local rationality and applied to AI i.e. an agent has local rationality too, just on a different base.

2/ "Bad model" is old-view thinking. When an agent makes a confident wrong call, the useful question is what the system fed it and what it was optimising for - goals are important and we have to look at it from that lens and consider the model to be an entity who messed up.

3/ This pattern isn't new. Lisanne Bainbridge (we learned about her during this research; fascinating story and worth checking out) also wrote this in her paper Ironies of Automation in 1983 where she said that the failure pattern was discovered and labeled decades before the tooling existed. IMO, that just shows that it is a known category, not a surprise.

4/ Agent drift is 100% the real risk. All of us have seen the agent following the directions rationally and locally do the right thing but in production it goes horribly wrong.

5/ Keep humans in the reps. IMO, the danger is never that agent is wrong once or twice, it is the essentially team losing the ability to be right on the incident that the agent has never seen. That's exactly why you are deploying agents, right?

Thought we'd share the perspectives of some old stalwarts and in this really fast moving field. Hope it helps.

reddit.com

u/gaurav_sherlocks_ai — 5 days ago

▲ 1 r/sre

LLMs solve about 1 in 3 real root-cause cases on a realistic benchmark. Mostly wrong on the hard ones.

Hi team: Sharing something I came across --

Here is what the 2025-26 research actually says about llms doing root cause analysis. Because the demos and the on-call reality are far apart and imo this is the right room to be honest about it.

On OpenRCA, an MSFT and Tsinghua benchmark built to look like real production, llm agents went from solving roughly 1 in 10 real failure cases in early 2025 to roughly 1 in 3 by early 2026 (that is a real jump).

It is also still mostly wrong on the very hard, multi-part failures. Both halves are true tbh and the second half is more top of mind when you / I / SREs are the one paged.

One detail that should make the industry skeptical is that when the system saw a cleaner, reduced slice of the signals, accuracy went up. On a realistic messy slice it dropped. Goes without saying, our production telemetry is the messy slice and everyone's is.

The useful finding is that the lever is not model size, it is structure. A 2026 study ran the full benchmark across several models and the two most common failure modes, hallucinated readings of the data and stopping the search too early, showed up across every model regardless of how capable it was.

Raw model on raw telemetry is near useless. Model plus retrieval plus an SOP that bounds where it can go is genuinely useful as a first responder, tho not as the final word.

So, here is my honest read. Use agentic SRE to compress a mountain of telemetry into a ranked set of suspects in minutes, then a human makes the call - that's the reality of today. It does not replace the engineer and the research does not claim it does.

I've been frequenting this sub off late and as the field evolves, I am curious what would actually make you trust one of these agents on your stack, the headline accuracy number, or the structure around the model, or anything else?

reddit.com

u/gaurav_sherlocks_ai — 7 days ago

▲ 92 r/sre

Notes from AI SRE summit

Managed to attend the Komodor-hosted AI SRE summit yesterday.

Panel was Stefana Muller (Salesforce), Charity Majors (Honeycomb), Itiel Shwartz (Komodor), Sharone Zitzman moderating. Corey Quinn from Duckbill ran a separate session on AI cost economics.

Quick recap of what came up in one of the sessions:

80% of developer escalations are simple. Rerun Jenkins, check logs, restart the prod. Tribal knowledge that mostly hasn't been encoded.
Corey Quinn's session: Every agent invocation has a token cost. Autonomous setups can burn $10 to $50 of tokens per incident before producing useful output. Unit economics getting more attention than model quality.
Charity Majors: Traditional three-pillars observability (metrics, logs, traces) is inadequate for AI systems because agents are nondeterministic. Need to instrument the reasoning chain itself, capture tool calls.
Intercom example came up: 18-month code quality drop before a 5-week improvement streak. Deploy frequency went from 10/day to 20-30/day, error rates up but offset by speed gains.
Enterprise trust boundaries: No direct database access for AI systems, guardrails to prevent customer data exposure. Human accountability stays non-automatable according to the room there.
Hype cycle position from the panel: "Just cutting through the surface." Most companies still in basic Claude Q&A phase. Advanced teams moving toward agents.
Gartner forecast: 85% of enterprises will be using AI SRE tooling by 2029, up from less than 5% in 2025.

Anyone else here attend the summit and want to share takeaways?

reddit.com

u/gaurav_sherlocks_ai — 10 days ago

▲ 22 r/sre

Hey everyone

I read a few things last couple of weeks that kinda seemed to hint at where the agentic engineering field is headed

1/ Datadog's State of AI Engineering 2026
2/ SoftwareSeni's "When AI SRE Fails," and
3/ Berkeley MAST study (arXiv)

TL;DR and my candid read across all three: The category, the tooling, the frameworks are all useful but everyone is actually shy talking about failure modes where agents go wrong.

Two of my closest friends run agentic AI companies. Different verticals, not SRE. They're both facing versions of the same problems, which is why I want to talk about it here where the skeptics live.

Start with the MAST numbers. Now, you tell me how will mid to large sized enterprises adopt an agent under these circumstances:

1/ Real-world task failure rates is around 41 to 86 percent across seven multi-agent systems
2/ Per-call tool failure 3 to 15 percent

Different studies have different numbers but the 41 percent floor is on the simplest tasks they tested.

Production complexity as you can imagine sits closer to the ceiling - which is scary, right?

And the failure shape is the worst possible one -- When a tool call fails the agent doesn't stop. It keeps reasoning on whatever degraded output came back and every subsequent action flows downstream. A simple solution could have been catching drift at each step but instead the agents carry it all downstream :-/

A friend running a CX agent company also described this exact failure that their agent kept resolving tickets confidently using a stale CRM field. This happened for 3 weeks, no one caught it, the agent never doubted itself once. So, they now run an entire layer of work whose only job is to make the agent doubt itself in almost every decision trace.

That work layer, in my opinion, should be the second slide in any agentic AI pitch deck. But, of course there is no incentive to talk about it.

According to Datadog, ~70 percent of organizations now run three or more models in production, and the ones running six or more nearly doubled this year. While it is noble, no one of these orgs has the dependency graph for that fleet drawn anywhere which should be an obvious step if you want to audit when one of the model providers goes down even for a few minutes.

SoftwareSeni documented a four-agent AI SRE running at nearly €8.5K a month in production. The reason no vendor puts a number like this on a pricing page is that they genuinely can't quote it honestly. Token spend depends on how messy your incidents are, and neither side knows that until you've been running together for a few months.

So then, what does human-in-the-loop even mean? To me, it means 3 things and have different modes, costs, considerations:

1/ Engineer drives, agent supports
2/ Engineer supervises, agent acts inside bounds
3/ Engineer audits, agent operates inside policy

I think we can all agree that the third gets sold and the first gets shipped.

Not a lot has been written or researched about postmortems breaking under non-determinism. The same incident when replayed often takes a different tool path and produces a different outcome. The standard post-mortem SaaS template assumes you can reconstruct what happened but you can't. At-least not without agent trace logs and token-level audit trails.

Anyone here had to write a postmortem for an incident an agent drove? How did you actually do it?

(Disclosure: I run a company which builds in this space. Happy to rewrite it if this violates any rules :-))

u/gaurav_sherlocks_ai — 18 days ago

▲ 6 r/GetMotivatedMindset

Made this after realizing some nights only become memorable once everyone stops half-existing inside a screen.

Not a phone hater at all - Just pro-those rare dinners and moments where people are actually there.

u/gaurav_sherlocks_ai — 23 days ago

▲ 3 r/GetMotivatedMindset

u/gaurav_sherlocks_ai — 25 days ago

▲ 210 r/OReilly_Learning+2 crossposts

Google released two early-release chapters from the SRE Book 2nd Edition this week.

>One is the new "AI for SRE" chapter. It's on O'Reilly publication behind a paywall, but a free trial works. Read it last night, sharing the takeaways for anyone who doesn't to read the full thing.

The condensed version:

AI is not a human replacement. The book is firm on this. We still need humans for the high-stakes calls and to maintain the AI itself.
Don't give AI full access on day one. Build trust the way you would with a junior engineer. Let it suggest fixes first, fix small issues next, only then expand its scope.
If the agent can take an action, it must have a rollback. If there is no undo path, the access should not be granted. This is the line I think most teams shipping agents are skipping right now.
When the agent fails or gives a bad suggestion, flag it. The chapter leans on the same principle as good postmortem culture, more feedback and more context means better future execution.
During incidents, the time-saver is not the fix, it is the searching. The chapter frames the agent as the thing that finds the right answer fast across tabs, runbooks, and prior incidents, instead of the thing that pushes the fix.
Dashboards tell you something is broken. AI is positioned as the layer that tells you why, by reading the tickets and the user feedback that the dashboards do not capture.
The framing that stuck with me most: AI does not reduce SRE workload, it raises the reliability ceiling. Cheaper reliability does not mean less work, it means higher reliability demanded across more services. Jevon's paradox applied to ops.

What I would add as a practitioner: the 5-level maturity model they propose is useful, but the gating criteria between levels is where the real engineering lives. "Agent suggested 50 fixes, 47 were good" sounds great until you ask which 3 were wrong and what they would have broken. Most teams I see skipping straight to autonomous remediation are not doing that work.

Worth a read if you are scoping AI in operations in the next year.

(Disclosure: I run Sherlocks, which builds in this space. This is not a pitch for it.)

reddit.com

u/OReilly_Learning — 13 days ago