Anyone who has spent time in operations knows that production incidents create a very different environment from normal engineering work.
The pressure is higher and information is incomplete. Everyone wants answers immediately. And ironically, that is exactly when poor decisions become most expensive.
As AI tools become more capable, I’ve been thinking about a simple question:
If an AI agent joined our incident bridge today, what would I actually trust it to do?
Relatable Incident Scenario
This is how a typical incident in a operation role looks like:
HTTP 5xx rates suddenly spike
Latency jumps from 150ms to 2s
Multiple pods begin restarting
When you join the bridge, the first 15 minutes are chaos. People are asking all sorts of questions:
- What changed?
- Is this infrastructure?
- Is this application?
- Did a deployment happen?
- Should we rollback?
Where I feel AI Is Helpful
This is where I happily treat AI as an SRE intern. I don’t ask it to fix anything. I ask it to gather context.
Example 1: Log Summarization
kubectl logs payment-service-1234xzy-abcd
Thousands of lines of output, easily taking 5–10 minutes to even scroll through. How do I get AI to help here? Make it summaries:
Most common errors:
- Database connection timeout
- Retry exhaustion
- Elevated latency after deployment 17 minutes ago
Example 2: Timeline Reconstruction
Find incidents involving Redis connection exhaustion
AI is great at summarizing the events. It can go through a centralized logs and throw out an even timeline, something like:
10:02 deployment
10:04 alerts fire
10:07 pod restarts
10:09 latency spike
Example 3: Runbook Search
Find incidents involving Redis connection exhaustion
AI quickly and efficiently searches for runbooks, previous incident reports and postmortems. Huge time savings and value.
What I Don’t Let It Touch
At some point during every incident, somebody proposes an action.
Roll back.
Scale up.
Restart pods.
Increase database connections.
Disable a feature flag.
These are not information problems anymore. They are judgment problems. We need a human in loop here.
For example, an AI may suggest restarting a deployment because similar log patterns appeared before.
What it doesn’t know is that the deployment currently serves a major customer onboarding event. Or that another team is already running a migration. Or that the last restart caused a cascading failure. These are all business use-cases and an AI will not know these.
Conclusion
If a new SRE joined the team tomorrow, I wouldn’t hand them production access during their first major outage.
I’d ask them to gather information. Build timelines. Summarize findings. Surface patterns. Learn the system.
That’s exactly how I think about AI during incidents. Not as the person driving the response. But as the intern helping the team understand the situation faster.
How would you approach it?
Note: This article was written with the assistance of AI tools for structuring and drafting. The ideas, examples, and perspectives are based on real-world experience in DevOps and SRE.
Originally published on Medium:













