Modern SRE teams manage several microservices with tangled interdependencies. When something breaks, engineers manually query several observability backends, correlate signals across layers, dig through historical post-mortems, and execute runbooks under pressure. The result: long MTTR, alert fatigue, and toil.
So I built the Autonomous SRE Agent — an AI-driven reliability system that runs the detect → investigate → diagnose → remediate → learn loop, with a phased path toward full autonomy.
The agent is built on strict hexagonal architecture with hard-coded safety guardrails — autonomy is earned through measurable graduation criteria, not granted by default. A clear "what's next" section at the end covers the design-stage components honestly, separated from running code.
🎯 Scope today
The agent automates detection, diagnosis, and remediation of infrastructure incidents, with a target of sub-30-second diagnostic latency. Five incident categories are detected and diagnosed; two have a fully executable end-to-end remediation path in shipped code:
| Incident category | Trigger | Remediation | Status |
|---|---|---|---|
| OOM kill | Memory pressure > 85% for 5+ min | Pod / container restart | ✅ End-to-end |
| High latency | p99 above rolling baseline + sigma threshold | HPA scale-up | ✅ End-to-end |
| Error-rate spike | Surge over baseline | GitOps revert | 🟡 Detect + diagnose only |
| Disk exhaustion | High utilization with projected breach | Log truncation | 🟡 Detect + diagnose only |
| Cert expiry | Cert expires within configured window | Cert rotation | 🟡 Detect + diagnose only |
The 🟡 rows are fully detected, classified, and have a diagnosis pipeline.
🏗️ The architecture
A layered pipeline transforms raw telemetry into safe, verifiable remediations. Six layers, each with a clear contract:
1. Observability / Ingestion — vendor-agnostic by design
This layer gathers high-fidelity telemetry through ports that the domain depends on, with swappable adapters implementing them. Five concrete telemetry providers are shipped:
- OpenTelemetry for app metrics, traces, and structured logs (composite of Prometheus, Jaeger, Loki adapters)
- CloudWatch as an AWS-native composite (metrics, logs, X-Ray)
- New Relic via NerdGraph GraphQL — full implementation of all four query ports
- Pixie eBPF for kernel-level visibility (syscalls, network flows, process activity) via PxL scripts, running as a Kubernetes DaemonSet
- Kubernetes pod logs for container-level evidence
A continuous dependency-graph service consumes trace data to build a real-time service topology, which is essential for calculating blast radius before any remediation runs.
2. Detection — threshold-free anomalies
Static thresholds are dead. The detection layer computes rolling statistical baselines, segmented by time-of-day and day-of-week, then evaluates deviations. Concretely:
-
BaselineServicemaintains per-(service, metric) baselines that adjust for daily and weekly patterns -
AnomalyDetectorruns detection rules across the canonicalAnomalyTypeenum (memory pressure, latency spike, error-rate surge, traffic anomaly, deployment-induced regression, cert expiry, disk exhaustion) -
SignalCorrelatorjoins metrics, logs, traces, and eBPF events into per-serviceCorrelatedSignalsso a single root cause doesn't produce alert storms -
AlertCorrelationEngineuses the dependency graph to fold related alerts into a singleIncident
The output: a canonical, deduplicated Incident ready for the intelligence layer.
3. Intelligence — RAG-grounded reasoning
The diagnostic engine is built around a multi-stage Retrieval-Augmented Generation pipeline:
- Incoming incidents are embedded (sentence-transformers) and matched against a vector store of historical post-mortems and runbooks — pgvector with HNSW indexing, falling back to JSONB cosine similarity in environments without the extension.
- Retrieved evidence is reranked by a cross-encoder (
ms-marco-MiniLM-L-6-v2) for query-aware relevance, then optionally compressed by LLMLingua with SRE-critical force tokens (OOM,kill, etc.) preserved. - A fingerprint-keyed diagnostic cache short-circuits recurring incidents (key:
service + anomaly_type + metric) to avoid redundant LLM calls. - The compressed context is fed to an LLM — Claude Sonnet (
claude-sonnet-4-20250514) by default — to generate a root-cause hypothesis. - A second-opinion validator cross-checks the reasoning. It supports both
RULE_BASEDandLLM_CROSS_CHECKstrategies; the LLM cross-check typically uses gpt-4o-mini as a cheaper, faster reviewer of Claude's output. - The confidence scorer produces a composite score from four weighted components — LLM self-confidence (35%), validator agreement (25%), retrieval relevance (25%), and evidence volume (15%) — so no single signal dominates.
The output: a typed diagnosis with a calibrated composite confidence percentage.
4. Governance — graduation criteria and coordination
Autonomy isn't granted, it's earned. Two pieces of this layer are shipped:
Graduation gate evaluator — PhaseGate.evaluate_graduation() checks five hard criteria continuously:
| Criterion | Threshold |
|---|---|
| Diagnostic accuracy | ≥ 90% |
| Destructive false positives | 0 |
| Sev 3-4 autonomous resolution rate | ≥ 95% |
| Remediation integration coverage | ≥ 30% |
| Clean soak-test days | ≥ 7 |
Phase transitions today are an operator decision informed by this evaluator's output (the full state machine is a Phase 3 deliverable).
Distributed lock manager — Redis, etcd, and in-memory adapters all implement DistributedLockManagerPort, so the agent acquires resource-scoped locks before remediation and releases them after verification. This is the foundation for multi-agent coordination (FinOps, SecOps) once a second agent enters the picture.
5. Action & Guardrails — the final mile
This is where the AI meets reality. Every remediation passes through hard-coded safety gates before execution:
Safety gates (Python-enforced, not OPA):
-
BlastRadiusCalculator— validatesaffected_pods_percentageagainst the plan'smax_blast_radius_percentage, with a hard "scale-up ≤ 2× current replicas" override -
KillSwitch— global toggle that gates every autonomous action; emits domain events on activation -
CooldownEnforcer— prevents the same service from being remediated repeatedly in a short window -
GuardrailOrchestrator— composes the above and produces a singleGuardrailResult(allowed, reason)
Compute executors (idempotent, behind CloudOperatorPort):
| Platform | Adapter | Operations |
|---|---|---|
| Kubernetes | kubernetes/operator.py |
Pod restart, deployment scale |
| AWS ECS | aws/ecs_operator.py |
Service update, task restart |
| AWS EC2 ASG | aws/ec2_asg_operator.py |
Instance refresh, scale |
| AWS Lambda | aws/lambda_operator.py |
Concurrency adjust, redeploy |
| Azure App Service | azure/app_service_operator.py |
Restart, scale-out |
| Azure Functions | azure/functions_operator.py |
Plan adjustment |
After execution, RemediationVerifier compares post-action metrics against baseline within a sigma tolerance. Failed verification returns a FAILED status that the engine surfaces — and a future phase will trigger auto-rollback automatically.
6. Data & Persistence — one source of truth
Consolidated around PostgreSQL (ADR-006), with ten migrations in src/sre_agent/adapters/persistence/migrations/:
-
PostgreSQL is the system of record, with the transactional outbox pattern (
postgres_outbox.py+outbox_relay.py) and aprocessed_eventstable for idempotent consumers -
TimescaleDB extension handles the
vector_embeddingsand metrics hypertables (migration 002, with continuous-aggregate tuning in migration 009) -
pgvector stores embeddings with HNSW indexing, including
SET LOCAL hnsw.ef_search = 100per query for recall/latency tuning, plus a JSONB fallback for dev environments -
Redis Streams is the internal event bus (
XADD/XREADGROUPwith consumer groups, at-least-once delivery) - Every LLM prompt, retrieval, tool call, and decision lands in an immutable reasoning trace via
reasoning_trace_store.py(agent_runs,tool_calls,retrieved_contexts) — the foundation for blameless post-mortems
🔄 The end-to-end flow
Here's what happens from the moment a metric goes sideways to the moment the agent verifies recovery:
Two things worth highlighting:
-
Severity classification is deterministic, not LLM-driven. The
SeverityClassifiercombines multi-dimensional impact scoring with hard rules and service-tier metadata — so the same incident always gets the same Sev. This matters for audit and graduation criteria. -
Verification is structural, not aspirational.
RemediationVerifiercompares post-action metrics against the same baseline that triggered detection. If the agent restarts a pod and the OOM signal doesn't subside within sigma tolerance, the result is recorded asFAILED— and an operator review is unambiguous.
🧱 Design philosophy
Hexagonal architecture (Ports & Adapters)
The single most important architectural decision (ADR-001) was strict hexagonal layering. The domain/ package never imports kubernetes, boto3, anthropic, or openai directly. It depends only on abstract ports/, and every external interaction goes through an adapter — OtelProvider, CloudWatchProvider, PixieAdapter, AnthropicLLMAdapter, OpenAILLMAdapter, PgVectorAdapter, RedisStreamsEventBus, KubernetesOperator, and so on.
This isn't theoretical purity — it pays off concretely:
- The reasoning engine runs unchanged on AWS, Azure, and on-prem Kubernetes
- Every adapter is replaceable in tests with an in-memory fake — see
adapters/coordination/in_memory_lock_manager.pyand the JSONB fallback in the pgvector adapter - No LangChain, no LlamaIndex, no framework lock-in: direct SDK calls with custom Pydantic dataclasses tailored to SRE diagnostics
Autonomy earned, not granted
The agent is currently in Phase 2 (Assist). Diagnosis runs autonomously; remediation runs through the guardrail stack with approval state tracked in RemediationPlan.approval_state. The graduation evaluator continuously checks whether the agent has earned the right to advance:
| Phase | What runs | What's mutated | Status |
|---|---|---|---|
| 1 · Shadow | Detection + diagnosis | Nothing — actions written to audit log | ✅ Shipped |
| 2 · Assist | Diagnosis + plan generation, guardrail-gated execution | OOM and latency remediation against shipped compute executors | ✅ Current phase |
| 3 · Autonomous | Fully autonomous Sev 3-4 execution, automatic phase transitions | Subject to blast-radius limits and kill switch | 🟡 In design |
| 4 · Predictive | Trend analysis, pre-emptive action | Slow-moving issues caught before threshold breach | 🟡 Roadmap |
Graduation requires demonstrated diagnostic accuracy over a sustained window — not a manager's gut feel.
🛠️ Implementation stack
| Concern | Choice | Why |
|---|---|---|
| Language | Python 3.11+ | Mature AI/ML ecosystem, async-native |
| API surface | FastAPI | Async-first, OpenAPI by default |
| Data models | Pydantic v2 | Strict runtime validation of canonical types |
| LLM SDKs |
anthropic + openai (direct) |
No LangChain — domain models stay in Pydantic, no impedance mismatch |
| Embeddings | sentence-transformers |
Local, deterministic, no API tax |
| Operational store | PostgreSQL + TimescaleDB | One database for OLTP, time-series, and vectors |
| Vector store | pgvector (HNSW) + JSONB fallback | Same DB, ACID transactions, no extra ops surface |
| Event bus | Redis Streams | At-least-once with consumer groups, simple ops |
| Coordination | Redis + etcd | Distributed locks for multi-agent fencing |
| Testing | Testcontainers + LocalStack Pro | True integration tests against ephemeral cloud |
The testing approach matters: because this agent mutates infrastructure, contract tests run against real (containerized) backends. In-memory fakes are reserved for unit tests against domain logic.
🚧 What's next — honest roadmap
To stay honest, here's what's designed but not yet shipped. These are the things you might expect from the architecture diagram that aren't running today:
-
Operator dashboard — Next.js SPA with real-time incident feed, confidence visualization, phase tracker. OpenSpec proposal at
openspec/changes/phase-2-7-operator-dashboard/; zero tasks complete. - ChatOps adapters — Slack / Teams Block Kit approvals, PagerDuty escalation, Jira auto-ticketing. Directories specified, not yet implemented.
-
GitOps executor — ArgoCD / PyGithub adapter for rollback PRs.
GITOPS_REVERTstrategy is defined; no execution adapter yet. -
Cert-manager executor —
CERTIFICATE_ROTATIONstrategy mapped, no adapter to invoke cert-manager. -
Log-truncation executor —
LOG_TRUNCATIONstrategy mapped, no shipped adapter. - Phase state machine — the gate evaluator is shipped, but automatic phase transitions and a persisted state machine are Phase 3 work.
- Oscillation detector — multi-agent coordination has the lock primitive; oscillation detection between SRE/FinOps/SecOps agents is a future requirement when a second agent exists.
- WebSocket / SSE stream — REST endpoints are shipped; the real-time push channel for the dashboard is part of Phase 2.7.
I'm sharing this roadmap explicitly because the gap between "impressive architecture diagram" and "running code" is the credibility test for any AI infrastructure project. Phase 2 is what runs today; the rest is sequenced in OpenSpec changes with concrete tasks.
🚀 The bigger bet
By separating cognitive reasoning from infrastructure adapters, and by treating safety as a non-negotiable constraint rather than a feature, the goal is to bridge the gap between "impressive AI demos" and Tier-0 infrastructure automation that an enterprise actually trusts.
Phase 3 is where the agent stops asking for approval on Sev 3-4. Phase 4 is where it stops waiting for thresholds to be breached. Both require the foundation that Phase 2 lays — strict hexagonal boundaries, immutable audit trails, and graduation criteria that aren't negotiable.
Source code, OpenSpec changes, and ADRs: github.com/faizanhussainrabbani/autonomous-sre-agent
If you're working in platform engineering, AI infrastructure, or SRE, I'd love feedback on the architecture and safety patterns. What guardrails would you add? What would you take out? Drop a comment.















