Incidents

Incident Residue

23 April 2026·2 mins

I’ve been thinking for a while about how incident response is going to change, and how it has already changed since the pre-ML days. Todd Underwood did a great chapter in Reliable Machine Learning which tried to illustrate how IR changes in the modern world. In brief, it becomes harder to both investigate what’s going on, and also follow the standard troubleshooting approach of building a mental model in your head of what’s happened when you no longer have a causally strong relationship between actions and outcomes. It’s also going to involve a lot more coordination between different groups, as ML will typically pull in data from across the business to a previously unprecedented extent.

But I came across this today - thanks to Eric Dobbs in RISF - which talks about one likely feature of the future that hasn’t gotten much attention outside leading edge circles, and that’s the fact that as AI SRE systems hoover up the easier tasks, the harder tasks will be the only ones that are left: the “left behind” issue.

Most folks who look at this have pointed out that as the easier issues go away, it’s harder to train on what remains, and (modulo learning styles) I think that’s true; what I think is less explored is how IR changes when you actually can’t construct a model of how the system works by asking a sufficiently aware human. We will, in short, become dependent on the same tools that created the additional complexity to penetrate and resolve that complexity in real-time, every time there’s an incident.

We should bear that in mind when we think about how to staff, and what to pay for, in the domain of incident response. The stuff that’s left behind - the incident residue - is the stickiest of all.

What SRE could be

4 June 2022·24 mins

Today, I believe we cannot successfully answer several key questions about SRE. Let’s start with the most important one: how can we understand what reliability customers want and need?