<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Incidents on Non-Functional Blog</title><link>https://non-functional.net/tags/incidents/</link><description>Recent content in Incidents on Non-Functional Blog</description><generator>Hugo</generator><language>en</language><lastBuildDate>Thu, 23 Apr 2026 23:41:35 +0100</lastBuildDate><atom:link href="https://non-functional.net/tags/incidents/index.xml" rel="self" type="application/rss+xml"/><item><title>Incident Residue</title><link>https://non-functional.net/posts/2026-04-23-incident-residue/</link><pubDate>Thu, 23 Apr 2026 23:41:35 +0100</pubDate><guid>https://non-functional.net/posts/2026-04-23-incident-residue/</guid><description>&lt;p&gt;I&amp;rsquo;ve been thinking for a while about how incident response is going to
change, and how it has already changed since the pre-ML days.
&lt;a href="https://www.linkedin.com/in/toddunder/" target="_blank" rel="noreferrer"&gt;Todd Underwood&lt;/a&gt; did a great
chapter in &lt;a href="https://www.oreilly.com/library/view/reliable-machine-learning/9781098106218/ch11.html" target="_blank" rel="noreferrer"&gt;Reliable Machine
Learning&lt;/a&gt;
which tried to illustrate how IR changes in the modern world. In
brief, it becomes harder to both investigate what&amp;rsquo;s going on, and also follow the standard
troubleshooting approach of building a mental model in your head of what&amp;rsquo;s happened
when you no longer have a causally strong relationship between actions and outcomes.
It&amp;rsquo;s also going to involve a lot more coordination between different groups, as ML will
typically pull in data from across the business to a previously unprecedented extent.&lt;/p&gt;
&lt;p&gt;But I came across this today - thanks to &lt;a href="https://www.linkedin.com/in/dobbse" target="_blank" rel="noreferrer"&gt;Eric
Dobbs&lt;/a&gt; in
&lt;a href="https://resilienceinsoftware.org/" target="_blank" rel="noreferrer"&gt;RISF&lt;/a&gt; - which talks about one
likely feature of the future that hasn&amp;rsquo;t gotten much attention outside
leading edge circles, and that&amp;rsquo;s the fact that as AI SRE systems
hoover up the easier tasks, the harder tasks will be the only ones
that are left: the &lt;a href="https://www.linkedin.com/pulse/what-ai-incident-response-leaves-behind-uptime-labs-tmdve/" target="_blank" rel="noreferrer"&gt;&amp;ldquo;left behind&amp;rdquo;
issue&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Most folks who look at this have pointed out that as the easier issues
go away, it&amp;rsquo;s harder to train on what remains, and (modulo learning
styles) I think that&amp;rsquo;s true; what I think is less explored is how IR changes
when you actually &lt;em&gt;can&amp;rsquo;t&lt;/em&gt; construct a model of how the system works by
asking a sufficiently aware human. We will, in short, become dependent on
the same tools that created the additional complexity to penetrate and
resolve that complexity in real-time, every time there&amp;rsquo;s an incident.&lt;/p&gt;
&lt;p&gt;We should bear that in mind when we think about how to staff, and what
to pay for, in the domain of incident response. The stuff that&amp;rsquo;s left
behind - the incident residue - is the stickiest of all.&lt;/p&gt;</description></item><item><title>What SRE could be</title><link>https://non-functional.net/posts/2022-06-04-what-sre-could-be/</link><pubDate>Sat, 04 Jun 2022 14:14:45 +0000</pubDate><guid>https://non-functional.net/posts/2022-06-04-what-sre-could-be/</guid><description>Today, I believe we cannot successfully answer several key questions about SRE. Let&amp;rsquo;s start with the most important one: how can we understand what reliability customers want and need?</description></item></channel></rss>