[{"content":"Non-functional is the personal blog of Niall Murphy, dealing with topics in software, systems, poetry, photography, philosophy, AI, and many more. \u0026ldquo;I have measured out my life with non-functional requirements\u0026rdquo; \u0026ndash; T.S. Eliot.\n","date":null,"permalink":"https://non-functional.net/","section":"Non-functional Blog","summary":"\u003cp\u003eNon-functional is the personal blog of Niall Murphy, dealing with topics in software, systems, poetry, photography, philosophy, AI, and many more. \u0026ldquo;I have measured out my life with non-functional requirements\u0026rdquo; \u0026ndash; T.S. Eliot.\u003c/p\u003e","title":"Non-functional Blog"},{"content":"","date":null,"permalink":"https://non-functional.net/posts/","section":"Posts","summary":"","title":"Posts"},{"content":"Came across this site the other day. Though the Changelog seems to suggest it was last updated in May 2016, which makes it a bit more likely to be slop, I think the overall message is not misplaced.\n","date":"18 May 2026","permalink":"https://non-functional.net/posts/2026-05-18-reclaim-the-sreets/","section":"Posts","summary":"\u003cp\u003eCame across \u003ca href=\"https://www.reclaimsre.com/\" target=\"_blank\" rel=\"noreferrer\"\u003ethis site\u003c/a\u003e the other day. Though\nthe Changelog seems to suggest it was last updated in May 2016, which makes\nit a bit more likely to be slop, I think the overall message is not misplaced.\u003c/p\u003e","title":"Reclaim the SREets"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/sre/","section":"Tags","summary":"","title":"SRE"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/","section":"Tags","summary":"","title":"Tags"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/cost/","section":"Tags","summary":"","title":"Cost"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/engineering/","section":"Tags","summary":"","title":"Engineering"},{"content":"A great piece from James Shore on the maintenance cost of software, and some potential down-sides to AI SWE.\n","date":"12 May 2026","permalink":"https://non-functional.net/posts/2026-05-12-maintenance-cost-of-software/","section":"Posts","summary":"\u003cp\u003eA \u003ca href=\"https://www.jamesshore.com/v2/blog/2026/you-need-ai-that-reduces-your-maintenance-costs\" target=\"_blank\" rel=\"noreferrer\"\u003egreat piece\u003c/a\u003e from James\nShore on the maintenance cost of software, and some potential down-sides to AI SWE.\u003c/p\u003e","title":"Maintenance Cost of Software"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/swe/","section":"Tags","summary":"","title":"Swe"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/county-down/","section":"Tags","summary":"","title":"County Down"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/horses/","section":"Tags","summary":"","title":"Horses"},{"content":" ","date":"10 May 2026","permalink":"https://non-functional.net/posts/2026-05-10-horses-on-tyrella-beach/","section":"Posts","summary":"\u003cp\u003e\n\n\n\n\n\n\u003cfigure\u003e\n    \n    \n\n\n\n\n\n\n\n\n  \n    \u003cpicture\n      class=\"mx-auto my-0 rounded-md\"\n      \n    \u003e\n      \n      \n      \n      \n        \u003csource\n          \n            srcset=\"https://non-functional.net/posts/2026-05-10-horses-on-tyrella-beach/DJI_20260418150327_0610_D-Pano_hu_55553f0b646e53c7.webp 330w,https://non-functional.net/posts/2026-05-10-horses-on-tyrella-beach/DJI_20260418150327_0610_D-Pano_hu_fe3e90a02aa47a98.webp 660w\n            \n              ,https://non-functional.net/posts/2026-05-10-horses-on-tyrella-beach/DJI_20260418150327_0610_D-Pano_hu_a1875b199df84c48.webp 1024w\n            \n            \n              ,https://non-functional.net/posts/2026-05-10-horses-on-tyrella-beach/DJI_20260418150327_0610_D-Pano_hu_b25a23aa76b19194.webp 1320w\n            \"\n          \n          sizes=\"100vw\"\n          type=\"image/webp\"\n        /\u003e\n      \n      \u003cimg\n        width=\"5310\"\n        height=\"2865\"\n        class=\"mx-auto my-0 rounded-md\"\n        alt=\"Horses on Tyrella Beach, County Down, Ireland\"\n        loading=\"lazy\" decoding=\"async\"\n        \n          src=\"https://non-functional.net/posts/2026-05-10-horses-on-tyrella-beach/DJI_20260418150327_0610_D-Pano_hu_2b918e64212a53c5.jpg\" srcset=\"https://non-functional.net/posts/2026-05-10-horses-on-tyrella-beach/DJI_20260418150327_0610_D-Pano_hu_3c907b0133515e98.jpg 330w,https://non-functional.net/posts/2026-05-10-horses-on-tyrella-beach/DJI_20260418150327_0610_D-Pano_hu_2b918e64212a53c5.jpg 660w\n          \n            ,https://non-functional.net/posts/2026-05-10-horses-on-tyrella-beach/DJI_20260418150327_0610_D-Pano_hu_2b2376040db167be.jpg 1024w\n          \n          \n            ,https://non-functional.net/posts/2026-05-10-horses-on-tyrella-beach/DJI_20260418150327_0610_D-Pano_hu_635460a5149d6429.jpg 1320w\n          \"\n          sizes=\"100vw\"\n        \n      /\u003e\n    \u003c/picture\u003e\n  \n\n\n\u003c/figure\u003e\n\u003c/p\u003e","title":"Horses on Tyrella Beach"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/ireland/","section":"Tags","summary":"","title":"Ireland"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/photography/","section":"Tags","summary":"","title":"Photography"},{"content":"A great treatise on wifi. Ah, the good old days, etc.\n","date":"10 May 2026","permalink":"https://non-functional.net/posts/2026-05-10-excellent-wifi-treatise/","section":"Posts","summary":"\u003cp\u003eA great \u003ca href=\"https://www.wiisfi.com/\" target=\"_blank\" rel=\"noreferrer\"\u003etreatise on wifi\u003c/a\u003e. Ah, the good old days, etc.\u003c/p\u003e","title":"Excellent wifi treatise"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/explanations/","section":"Tags","summary":"","title":"Explanations"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/tech/","section":"Tags","summary":"","title":"Tech"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/wifi/","section":"Tags","summary":"","title":"Wifi"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/wireless/","section":"Tags","summary":"","title":"Wireless"},{"content":"A short while ago, an ex-colleague from an old team asked for some career advice. I did what I could (though I didn\u0026rsquo;t think it was very useful). He was kind enough to send me a card and a (very large) collection of chocolates in return. It was a great kindness that reminded me of a great team that was always under pressure.\nAs the tech industry is collectively doing a lot to erase every aspect of being human from working in it, and especially any question of the vulnerability that naturally attaches to it, it does some good to remind ourselves that there is virtue in being human.\n","date":"8 May 2026","permalink":"https://non-functional.net/posts/2026-05-10-career-counselling/","section":"Posts","summary":"\u003cp\u003eA short while ago, an ex-colleague from an old team asked for some career advice. I did what I could (though\nI didn\u0026rsquo;t think it was very useful). He was kind enough to send me a card and a (very large) collection of chocolates\nin return. It was a great kindness that reminded me of a great team that was always under pressure.\u003c/p\u003e\n\u003cp\u003eAs the tech industry is collectively doing a lot to erase every aspect of being human from working in it, and especially any\nquestion of the vulnerability that naturally attaches to it, it does some good to remind ourselves that there is virtue\nin being human.\u003c/p\u003e\n\u003cp\u003e\n\n\n\n\n\n\u003cfigure\u003e\n    \n    \n\n\n\n\n\n\n\n\n  \n    \u003cpicture\n      class=\"mx-auto my-0 rounded-md\"\n      \n    \u003e\n      \n      \n      \n      \n        \u003csource\n          \n            srcset=\"https://non-functional.net/posts/2026-05-10-career-counselling/IMG_1062-EDIT_hu_65a4b6b47ac4e303.webp 330w,https://non-functional.net/posts/2026-05-10-career-counselling/IMG_1062-EDIT_hu_ff63d6382a042b8f.webp 660w\n            \n              ,https://non-functional.net/posts/2026-05-10-career-counselling/IMG_1062-EDIT_hu_3ba347f177d7c7ce.webp 1024w\n            \n            \n              ,https://non-functional.net/posts/2026-05-10-career-counselling/IMG_1062-EDIT_hu_96a6e73af07f8b85.webp 1320w\n            \"\n          \n          sizes=\"100vw\"\n          type=\"image/webp\"\n        /\u003e\n      \n      \u003cimg\n        width=\"3144\"\n        height=\"4192\"\n        class=\"mx-auto my-0 rounded-md\"\n        alt=\"A nice thank-you card\"\n        loading=\"lazy\" decoding=\"async\"\n        \n          src=\"https://non-functional.net/posts/2026-05-10-career-counselling/IMG_1062-EDIT_hu_e9a49ad023810736.jpg\" srcset=\"https://non-functional.net/posts/2026-05-10-career-counselling/IMG_1062-EDIT_hu_29316ebad206e2e9.jpg 330w,https://non-functional.net/posts/2026-05-10-career-counselling/IMG_1062-EDIT_hu_e9a49ad023810736.jpg 660w\n          \n            ,https://non-functional.net/posts/2026-05-10-career-counselling/IMG_1062-EDIT_hu_4a7a20bbe7cd9b3.jpg 1024w\n          \n          \n            ,https://non-functional.net/posts/2026-05-10-career-counselling/IMG_1062-EDIT_hu_bf500153e7fb3e60.jpg 1320w\n          \"\n          sizes=\"100vw\"\n        \n      /\u003e\n    \u003c/picture\u003e\n  \n\n\n\u003c/figure\u003e\n\u003c/p\u003e","title":"Career counselling"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/careers/","section":"Tags","summary":"","title":"Careers"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/kindness/","section":"Tags","summary":"","title":"Kindness"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/present/","section":"Tags","summary":"","title":"Present"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/ai/","section":"Tags","summary":"","title":"AI"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/caring/","section":"Tags","summary":"","title":"Caring"},{"content":"A great article from the always authentic and enjoyable World\u0026rsquo;s Greatest Newsletter (ignore the name) from the Raw Signal Group.\n\u0026ldquo;Businesspeople, we’ve entered a weird moment when caring about the organization and your craft is a liability. And when pressed for details on why caring less seems appealing, the answers are dark.\u0026rdquo;\n","date":"30 April 2026","permalink":"https://non-functional.net/posts/2026-04-30-corporate-slop/","section":"Posts","summary":"\u003cp\u003eA great article from the always authentic and enjoyable \u003cem\u003eWorld\u0026rsquo;s Greatest Newsletter\u003c/em\u003e (ignore the name) from the Raw Signal Group.\u003c/p\u003e\n\u003cp\u003e\u003ca href=\"https://www.rawsignal.ca/newsletter-archive/the-people-who-care-are-having-the-hardest-time/\" target=\"_blank\" rel=\"noreferrer\"\u003e\u0026ldquo;Businesspeople, we’ve entered a weird moment when caring about the organization and your craft is a liability. And when pressed for details on why caring less seems appealing, the answers are dark.\u0026rdquo;\u003c/a\u003e\u003c/p\u003e","title":"Corporate slop"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/raw-signal-group/","section":"Tags","summary":"","title":"Raw Signal Group"},{"content":"I\u0026rsquo;ve been thinking for a while about how incident response is going to change, and how it has already changed since the pre-ML days. Todd Underwood did a great chapter in Reliable Machine Learning which tried to illustrate how IR changes in the modern world. In brief, it becomes harder to both investigate what\u0026rsquo;s going on, and also follow the standard troubleshooting approach of building a mental model in your head of what\u0026rsquo;s happened when you no longer have a causally strong relationship between actions and outcomes. It\u0026rsquo;s also going to involve a lot more coordination between different groups, as ML will typically pull in data from across the business to a previously unprecedented extent.\nBut I came across this today - thanks to Eric Dobbs in RISF - which talks about one likely feature of the future that hasn\u0026rsquo;t gotten much attention outside leading edge circles, and that\u0026rsquo;s the fact that as AI SRE systems hoover up the easier tasks, the harder tasks will be the only ones that are left: the \u0026ldquo;left behind\u0026rdquo; issue.\nMost folks who look at this have pointed out that as the easier issues go away, it\u0026rsquo;s harder to train on what remains, and (modulo learning styles) I think that\u0026rsquo;s true; what I think is less explored is how IR changes when you actually can\u0026rsquo;t construct a model of how the system works by asking a sufficiently aware human. We will, in short, become dependent on the same tools that created the additional complexity to penetrate and resolve that complexity in real-time, every time there\u0026rsquo;s an incident.\nWe should bear that in mind when we think about how to staff, and what to pay for, in the domain of incident response. The stuff that\u0026rsquo;s left behind - the incident residue - is the stickiest of all.\n","date":"23 April 2026","permalink":"https://non-functional.net/posts/2026-04-23-incident-residue/","section":"Posts","summary":"\u003cp\u003eI\u0026rsquo;ve been thinking for a while about how incident response is going to\nchange, and how it has already changed since the pre-ML days.\n\u003ca href=\"https://www.linkedin.com/in/toddunder/\" target=\"_blank\" rel=\"noreferrer\"\u003eTodd Underwood\u003c/a\u003e did a great\nchapter in \u003ca href=\"https://www.oreilly.com/library/view/reliable-machine-learning/9781098106218/ch11.html\" target=\"_blank\" rel=\"noreferrer\"\u003eReliable Machine\nLearning\u003c/a\u003e\nwhich tried to illustrate how IR changes in the modern world. In\nbrief, it becomes harder to both investigate what\u0026rsquo;s going on, and also follow the standard\ntroubleshooting approach of building a mental model in your head of what\u0026rsquo;s happened\nwhen you no longer have a causally strong relationship between actions and outcomes.\nIt\u0026rsquo;s also going to involve a lot more coordination between different groups, as ML will\ntypically pull in data from across the business to a previously unprecedented extent.\u003c/p\u003e\n\u003cp\u003eBut I came across this today - thanks to \u003ca href=\"https://www.linkedin.com/in/dobbse\" target=\"_blank\" rel=\"noreferrer\"\u003eEric\nDobbs\u003c/a\u003e in\n\u003ca href=\"https://resilienceinsoftware.org/\" target=\"_blank\" rel=\"noreferrer\"\u003eRISF\u003c/a\u003e - which talks about one\nlikely feature of the future that hasn\u0026rsquo;t gotten much attention outside\nleading edge circles, and that\u0026rsquo;s the fact that as AI SRE systems\nhoover up the easier tasks, the harder tasks will be the only ones\nthat are left: the \u003ca href=\"https://www.linkedin.com/pulse/what-ai-incident-response-leaves-behind-uptime-labs-tmdve/\" target=\"_blank\" rel=\"noreferrer\"\u003e\u0026ldquo;left behind\u0026rdquo;\nissue\u003c/a\u003e.\u003c/p\u003e\n\u003cp\u003eMost folks who look at this have pointed out that as the easier issues\ngo away, it\u0026rsquo;s harder to train on what remains, and (modulo learning\nstyles) I think that\u0026rsquo;s true; what I think is less explored is how IR changes\nwhen you actually \u003cem\u003ecan\u0026rsquo;t\u003c/em\u003e construct a model of how the system works by\nasking a sufficiently aware human. We will, in short, become dependent on\nthe same tools that created the additional complexity to penetrate and\nresolve that complexity in real-time, every time there\u0026rsquo;s an incident.\u003c/p\u003e\n\u003cp\u003eWe should bear that in mind when we think about how to staff, and what\nto pay for, in the domain of incident response. The stuff that\u0026rsquo;s left\nbehind - the incident residue - is the stickiest of all.\u003c/p\u003e","title":"Incident Residue"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/incidents/","section":"Tags","summary":"","title":"Incidents"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/ai-sre/","section":"Tags","summary":"","title":"AI SRE"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/komodor/","section":"Tags","summary":"","title":"Komodor"},{"content":"The AI SRE space is, as of the time of writing, absolutely insane. At some point in 2025, I counted the number of players and the amount of money rushing into the space - it was 20+ and over a billion dollars, if you included all funding numbers I\u0026rsquo;d found plus the numbers of incumbents in e.g. Cloud talking about how much they were going to invest in the space. It may well turn out to be one of those situations where it\u0026rsquo;s easy to make a prima-facie argument that the problem space is big, almost everyone \u0026ldquo;suffers from it\u0026rdquo;, and that it\u0026rsquo;s easy to make progress (given the current state of agentic development, etc etc), but it\u0026rsquo;s quite hard to deliver something that actually makes a difference and more importantly that is not like everyone else\u0026rsquo;s three foundational models in a trenchcoat.\nEarlier in my career there were very similar conversations about mobile phone providers (really operators), who quickly became seen as being essentially commodotised - everyone would pick from a similar set of network gear provided by a small set of manufacturers, the handsets were mostly commodotised etc, etc. Ultimately they did what a lot of businesses in similar positions did, which is to attempt to differentiate themselves on price, branding/marketing, or customer service. There may well be a similar effect playing out in this market too.\nIn unrelated events, I see that Komodor are organising an AI SRE summit and that looks like an interesting speaker list, though I wonder precisely how vendor neutral that\u0026rsquo;s going to be.\n","date":"22 April 2026","permalink":"https://non-functional.net/posts/2026-04-22-komodor-doing-an-ai-sre-summit/","section":"Posts","summary":"\u003cp\u003eThe AI SRE space is, as of the time of writing, absolutely insane. At some point in 2025, I counted the number of\nplayers and the amount of money rushing into the space - it was 20+ and over a billion dollars, if you included\nall funding numbers I\u0026rsquo;d found plus the numbers of incumbents in e.g. Cloud talking about how much they were going to invest in the space.\nIt may well turn out to be one of those situations where it\u0026rsquo;s easy to make a \u003cem\u003eprima-facie\u003c/em\u003e argument that the problem\nspace is big, almost everyone \u0026ldquo;suffers from it\u0026rdquo;, and that it\u0026rsquo;s easy to make progress (given the current state of agentic development, etc etc),\nbut it\u0026rsquo;s quite hard to deliver something that actually makes a difference and more importantly that is not like everyone\nelse\u0026rsquo;s three foundational models in a trenchcoat.\u003c/p\u003e\n\u003cp\u003eEarlier in my career there were very similar conversations about mobile phone providers (really operators), who quickly\nbecame seen as being essentially commodotised - everyone would pick from a similar set of network gear provided by a small set\nof manufacturers, the handsets were mostly commodotised etc, etc. Ultimately they did what a lot of businesses in similar\npositions did, which is to attempt to differentiate themselves on price, branding/marketing, or customer service. There may well be a similar\neffect playing out in this market too.\u003c/p\u003e\n\u003cp\u003eIn unrelated events, I see that Komodor are organising \u003ca href=\"https://komodor.com/ai-sre-summit-2026/\" target=\"_blank\" rel=\"noreferrer\"\u003ean AI SRE summit\u003c/a\u003e and that\nlooks like an interesting speaker list, though I wonder precisely how vendor neutral that\u0026rsquo;s going to be.\u003c/p\u003e","title":"Komodor doing an AI SRE summit"},{"content":"From college mate Ian\u0026rsquo;s time at a Cloudflare session, we learn that bot traffic is 50% of overall web traffic, and AI agent traffic is circa 7%.\nIt seems likely both of those numbers will go up.\n","date":"21 April 2026","permalink":"https://non-functional.net/posts/2026-04-21-bot-traffic-on-the-web/","section":"Posts","summary":"\u003cp\u003eFrom college mate Ian\u0026rsquo;s time at a \u003ca href=\"https://world.hey.com/ian.mulvany/cloudflare-connect-on-tour-london-notes-69a37b5a\" target=\"_blank\" rel=\"noreferrer\"\u003eCloudflare\nsession\u003c/a\u003e,\nwe learn that bot traffic is 50% of overall web traffic, and AI agent traffic is circa 7%.\u003c/p\u003e\n\u003cp\u003eIt seems likely both of those numbers will go up.\u003c/p\u003e","title":"Bot traffic on the web"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/security/","section":"Tags","summary":"","title":"Security"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/anthropic/","section":"Tags","summary":"","title":"Anthropic"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/claude/","section":"Tags","summary":"","title":"Claude"},{"content":"The incomparable Ethan Ding on the disjunction that most of us are feeling right now. Claude speeds up certain things - quite a lot - but it also slows us down. We need to have a more accurate model of what\u0026rsquo;s happening to software, and this is an accessible primer on one possible scenario.\n(I also loved his piece on levered beta, which I think played out essentially as he wrote.)\n","date":"21 April 2026","permalink":"https://non-functional.net/posts/2026-04-21-the-revenge-of-k-shaped-engineering/","section":"Posts","summary":"\u003cp\u003eThe incomparable \u003ca href=\"https://ethanding.substack.com/p/claude-code-is-not-making-your-product\" target=\"_blank\" rel=\"noreferrer\"\u003eEthan Ding on the disjunction\u003c/a\u003e\nthat most of us are feeling right now. Claude speeds up certain things - quite a lot - but it also slows us down. We need to\nhave a more accurate model of what\u0026rsquo;s happening to software, and this is an accessible primer on one possible scenario.\u003c/p\u003e\n\u003cp\u003e(I also loved \u003ca href=\"https://ethanding.substack.com/p/levered-beta-is-all-you-need\" target=\"_blank\" rel=\"noreferrer\"\u003ehis piece on levered beta\u003c/a\u003e, which I think played out\nessentially as he wrote.)\u003c/p\u003e","title":"The Revenge of K-shaped Engineering"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/ai-for-sre/","section":"Tags","summary":"","title":"AI for SRE"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/book/","section":"Tags","summary":"","title":"Book"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/o-reilly/","section":"Tags","summary":"","title":"O' Reilly"},{"content":"As an author, I strongly dislike O\u0026rsquo; Reilly\u0026rsquo;s Early Release model, since my stuff gets poked at before it\u0026rsquo;s ready. As a reader, I strongly like O\u0026rsquo; Reilly\u0026rsquo;s Early Release model, since I can poke at other people\u0026rsquo;s stuff before it\u0026rsquo;s ready!\nO\u0026rsquo; Reilly\u0026rsquo;s Safari platform is hosting the latest chapters on STPA and AI for SRE.\n","date":"20 April 2026","permalink":"https://non-functional.net/posts/2026-04-20-sre-book-second-edition-early-release/","section":"Posts","summary":"\u003cp\u003eAs an author, I strongly dislike O\u0026rsquo; Reilly\u0026rsquo;s Early Release model,\nsince my stuff gets poked at before it\u0026rsquo;s ready. As a reader, I\nstrongly like O\u0026rsquo; Reilly\u0026rsquo;s Early Release model, since I can poke at\nother people\u0026rsquo;s stuff before it\u0026rsquo;s ready!\u003c/p\u003e\n\u003cp\u003eO\u0026rsquo; Reilly\u0026rsquo;s Safari platform is hosting the latest chapters on \u003ca href=\"https://www.oreilly.com/library/view/site-reliability-engineering/9798341607675/ch03.html\" target=\"_blank\" rel=\"noreferrer\"\u003eSTPA\u003c/a\u003e and \u003ca href=\"https://www.oreilly.com/library/view/site-reliability-engineering/9798341607675/ch04.html\" target=\"_blank\" rel=\"noreferrer\"\u003eAI for SRE\u003c/a\u003e.\u003c/p\u003e","title":"SRE Book Second Edition Early Release"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/srebook/","section":"Tags","summary":"","title":"Srebook"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/stamp/","section":"Tags","summary":"","title":"STAMP"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/stpa/","section":"Tags","summary":"","title":"STPA"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/ai-security/","section":"Tags","summary":"","title":"AI Security"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/opus/","section":"Tags","summary":"","title":"Opus"},{"content":"I strongly suspect that Zvi doesn\u0026rsquo;t need more inbound links, but his latest model card assessment (which is, as usual, very well written) has a couple of notable quotes:\nSo yeah, none of that sounds great. It all sounds like the types of thing that, if you caught a human doing them even once, that would be a very bad sign, and in several cases you would obviously have to fire them.\nCheck out the examples of Mythos Preview attempting (and in some cases succeeding, only to be caught by the human at the last moment) to escape containment.\n","date":"20 April 2026","permalink":"https://non-functional.net/posts/2026-04-20-opus-4-7-model-card-and-zvi/","section":"Posts","summary":"\u003cp\u003eI strongly suspect that Zvi doesn\u0026rsquo;t need more inbound links, but his latest\n\u003ca href=\"https://thezvi.substack.com/p/opus-47-part-1-the-model-card\" target=\"_blank\" rel=\"noreferrer\"\u003emodel card assessment\u003c/a\u003e\n(which is, as usual, very well written) has a couple of notable quotes:\u003c/p\u003e\n\u003cblockquote\u003e\n\u003cp\u003eSo yeah, none of that sounds great. It all sounds like the types of thing that, if you caught a human doing them even once, that would be a very bad sign, and in several cases you would obviously have to fire them.\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cp\u003eCheck out the examples of Mythos Preview attempting (and in some cases succeeding, only to be caught by the human at the last moment) to escape containment.\u003c/p\u003e","title":"Opus 4.7 Model Card and Mythos Preview"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/zvi/","section":"Tags","summary":"","title":"Zvi"},{"content":"What is graceful degradation? #Graceful degradation is the idea that, when you can\u0026rsquo;t serve the user precisely what they wanted, instead of serving the user an error, you serve them some in-between thing.\nThe details of this depend a lot on what exactly it is you\u0026rsquo;re trying to do. Let\u0026rsquo;s look at a few examples.\nImage server #Generally speaking, an image server has a very large collection of images stored on disk of some kind, accessed via a filename or unique identifier in the URL, and the aim is to serve it as quickly as possible. (A very close cousin of blob server — binary large object server — except the blobs are explicitly designated as images.) If you can\u0026rsquo;t access the file (or the filesystem), or in other ways the content isn\u0026rsquo;t there, picking another image from the filesystem to serve is not tenable, for obvious random roulette reasons.\nThere are three possible approaches here.\nTaking advantage of the fact you know it\u0026rsquo;s supposed to be an image, sometimes you have enough hints from the URL parameters or metadata to provide an image of the right size/format back, with hard-coded 404-equivalent content in the image itself. This is often useful in and of itself, because people know you received the request and processed it, but the image is missing.\nAnother technique relies on the situation that sometimes the image content is intended to be versioned — i.e. that version 2 of an image might be an evolution of version 1. In that case, you could return an old version (perhaps with out-of or in-band signalling that it\u0026rsquo;s old, or what precise version it is).\nFinally, and this is a very generically reusable technique, it\u0026rsquo;s possible to keep a cache (usually LRU, or equivalent), so that even if you can\u0026rsquo;t find the file on disk, there\u0026rsquo;s a decent chance it can be found in a cache you\u0026rsquo;re keeping elsewhere. In cases like this, the user might not even perceive an error in most cases (unless the cache is unevenly distributed, or you need to update the image), but the fact that this happened at all should be recorded somewhere for all kinds of good operational reasons.\nBlog #A blog often has a newsfeed-like structure, where there are posts organised by date, each of which can be individually accessed, or some notion of accessing \u0026ldquo;the latest content for poster X\u0026rdquo;.\nHere we observe that if you can\u0026rsquo;t access the content for the specific post the requestor is looking for, there is generally not much value to serving them a different one. Sometimes, rarely, posts have strong mappings to particular topics, and a user could be provided with an opportunity to pick other posts on those topics, if the desired one is not available.\nThere is often value in serving them the entire corpus of posts available, even if not ostensibly complete, if that\u0026rsquo;s what the user has requested — they may be interested in looking at the set in total without a commitment to any particular one of them.\nAs above, caching and versioning can often be deployed to good effect, but the key enabling technique is that a newsfeed architecture allows you some leeway to provide non-up-to-date content for some subset of users without them even realising things are broken. (Of course, there are some who will be very much aware…)\nCompute provision #The above examples are focused on provided content which is, in some sense, already computed: files on disk, posts in databases, etc. Things for which the request/response mapping is inherently clear, and the work has already been done. Another set of services are dynamic compute provision services, where the answer is not to ship something static out to the network as quickly as possible, but instead to dynamically calculate it. Examples include: ML model questions and answers, fractal image calculation, lambdas/functions-as-a-service, and so on.\nThis is a significant change from the above, creating both constraints and opportunities. For (say) fractal image creation, if your local compute function to perform the same fails, it\u0026rsquo;s possible to retry on-site or even potentially off-site with different enough chances of failure to make it worth trying. If it succeeds, then the graceful degradation has merely come at the expense of latency, though with a large enough delay, some users just abandon the result before it\u0026rsquo;s provided. Similar opportunities for providing lower-resolution or cached images also exist.\nFor ML model answers, it might be that the user has a strong preference for e.g. an answer from GPT-5, but if Claude is available then perhaps that answer would be good enough for them. This is similar to certain high-availability architectures where you send a request to a number of back-ends at once, and the first one to respond wins; in this case you could imagine some matrix of response quality, latency and availability which might go towards selecting the right response, though of course it necessarily leads to wasted compute. Arguably anything other than the top ranked result, should everything work correctly, would be a graceful degradation. Again caching techniques can potentially be useful here, but the wider your query-space is, the higher the risk that the questions won\u0026rsquo;t display the kind of power-law distribution that makes caching very relevant.\nSometimes this is combined with static content provision: for example, searching and ranking, where the dynamic computation piece is to return a set of documents ordered in some way which is (generally) dynamically calculated, but the content itself is static. Caching can often work wonders here, but if your index is unavailable, a graceful degradation might be to provide some set of the documents which are known to contain the tokens in question as a consequence of previous cached results.\nIn general, dynamic compute provision can extend the variety of substitute results which can be provided, at the expense of other aspects of the user experience.\nSLOs and graceful degradation? #Now to consider the question of measurement.\nIf we consider a simple HTTP request/response service, the classic approach of dividing the total requests by the successful results has some difficulties in the context of graceful degradation. Those difficulties can be summarised as: what do we consider a successful request?\nSuppose, in the image server case above, we cannot find the actual image but provide a cached version (maybe at some essentially negligible latency cost). Is that a success? Well, the user got what they want, which is good, but we have an image missing on disk, which is bad. Should we include it as a success or not? If we don\u0026rsquo;t, we\u0026rsquo;re potentially missing SLOs where the user population has actually had a totally fine user experience, which is contrary to the point of SLOs; if we do include them as successes, we\u0026rsquo;re operating in ignorance of key files missing on disk, and at some point when we run out of cache, we could have a nose-diving user-experience.\nFurthermore, suppose we made a determination that we will include cached results as successes, but then we \u0026ldquo;upgrade\u0026rdquo; the cache system and now it operates about 50% more slowly. Do we still count them as successes, even though we\u0026rsquo;re now affecting web performance stats such as LCP (longest-content-paint) and the user experience generally? Or one popular image is much slower while the rest are fine, and so on and so forth.\nYou can see where I\u0026rsquo;m coming from — if there\u0026rsquo;s a set of criteria that would allow you to include something as a success, you can probably find a situation relevant to graceful degradation where a particular result could be argued to be on either side of the boundary for inclusion/exclusion.\nThe key problem to avoid is having different parts of your SLO system/pipeline/organisational architecture make different decisions about the semantics. That way lies perdition. Make the same decisions across your org, or centralise the decision-making, or you\u0026rsquo;ll be unable to agree on what the user experience is or should be. Bad news.\nIn general, though, there are two approaches to handling graceful degradation in SLO calculations.\nBusiness-as-usual. The first is as outlined above: to find a set of criteria allowing you to keep some set of the GD responses within the \u0026ldquo;normal\u0026rdquo; framework. Precisely how is obviously domain-specific. Handling caching is probably the easiest of these — it could also fit naturally within an overall latency SLO for your systems.\nSeparate-by-design. The second is to absent all GD responses from the usual calculations, and to essentially pretend it\u0026rsquo;s an entirely separate serving system with its own goals, measurements, and (indeed) even SLOs. Again the details of this are very domain-specific, but the notable thing about GD responses is that typically, by the time you are engaging them, something has already gone wrong — so your time budget for responding is correspondingly constrained.\nOverall, we would argue that, if the range of potential degradations is large and there is a significant infrastructure around them, keeping separate SLOs for those systems probably makes sense, though the SLOs themselves are internally-facing. (Of course, observability for those systems is required to do so.) But it doesn\u0026rsquo;t mean separate SLOs for the user experience — after all, they are declarations of what you want the user experience to be, and the GD mechanisms are what help you to maintain that target.\n","date":"9 April 2024","permalink":"https://non-functional.net/posts/2024-04-09-graceful-degradation-and-slos/","section":"Posts","summary":"When you can\u0026rsquo;t serve exactly what the user wanted, how should graceful degradation count against your SLOs? Two approaches, and the key mistake to avoid.","title":"Graceful Degradation and SLOs"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/slos/","section":"Tags","summary":"","title":"SLOs"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/google/","section":"Tags","summary":"","title":"Google"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/models/","section":"Tags","summary":"","title":"Models"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/organizational/","section":"Tags","summary":"","title":"Organizational"},{"content":"(This is a repost of a document living here, but I am putting it here for backup\u0026rsquo;s sake. Originally a joint effort with Murali Suriar, with input from Matt Brown, Liz Fong-Jones, and many others. The intended audience of this doc is the recently laid-off, or those who suspect they are shortly to be, though a number of others have found it useful outside of that context.)\nIntro #We\u0026rsquo;re very sorry to hear you\u0026rsquo;ve been affected, or think you might be, by this regrettable action on the behalf of your previous/current employer. Despite this, there is good news, which we\u0026rsquo;ll talk about shortly. There\u0026rsquo;s also bad news. (It wouldn\u0026rsquo;t be the 2020s without it.) There\u0026rsquo;s also some unexpected news! We\u0026rsquo;ll provide some ways to think about further opportunities, and some resources to help orient you. Of course, all of these are our opinions, but your mileage may vary, yadda yadda yadda.\nThe main takeaways, if you read nothing else, are:\nWorking outside of Google SRE is more of a culture shift than a tooling shift, though it is both. The cultural shift is harder. The tooling shift is hard, but tractable, and you can use some approaches you\u0026rsquo;re already familiar with to make progress. What SRE is called and what it means in the outside world varies a lot. You can\u0026rsquo;t just read the text of a job advertisement and expect to understand what\u0026rsquo;s required. In some ways, the closest analogue in the real world might be staff engineer. The Good #Ok, so. Some good news. There are things out there in the world that you would recognise as being SRE. Niall has a story of seeing folks in a Microsoft SRE team arguing over the precise semantics of a HTTP return code in a PR (==CL), and knowing that this was SRE-nature. The team had operational struggles for sure, but worked steadily towards solving them with software approaches, and made progress in doing so.\nIt is also true that SRE has a massively positive reputation in many circles, with commensurate pay. It may or may not be FAANG-level pay, but it will be a net increase on many jobs otherwise classed as operational. The SRE book continues to sell, and it is part of an industry-wide conversation. As a result, it\u0026rsquo;s a healthy profession, in terms of people engaging with it, trying to push it forward, and it has people entering it and leaving it (both of which are necessary for a healthy profession).\nNet-net: there are jobs available and you have a reasonable prospect of exchanging labour for money.\nThe Bad #You cannot take all of your habits of work and expect to successfully transplant them to another company unchanged. Google is still a cultural outlier in many respects, even within valley companies (though that is changing as we write): it cares about reliability more than most places, and measures it. Both of those are not default behaviours elsewhere.\nAnother large difference is the attitude towards shared infrastructure. We\u0026rsquo;ve not seen anything like Google\u0026rsquo;s commitment to shared infra and company-wide engineering solutions in other companies. Googlers mutter darkly about a long tail of narrowly scoped solutions to particular problems (particularly prevalent in the Ads space, for example) and the consequent cognitive load, but ignore the fact that Borg, Colossus, Chubby, etc etc all have either almost completely uniform adoption for the general use case, or are \u0026ldquo;market leaders\u0026rdquo; (70%+) within their segment. But this isn\u0026rsquo;t an observation about technical challenges, design problems, etc, all of which are relevant — it is an observation that in other companies, internal infrastructure stops at the business unit boundary. Niall has a vivid memory of something one VP is supposed to have said about another when the prospect of adopting a system across their orgs arose: \u0026ldquo;Why would I take a dependency on him?\u0026rdquo;\nBut don\u0026rsquo;t view that exclusively through the lens of \u0026ldquo;it is not sufficient I succeed — others must fail\u0026rdquo;. You will find that the rest of the world very much more strongly values autonomy on the team, org, business unit level. In Google, the conviction of success strongly related to using Google software to achieve answers to the complicated, large problems Google suffers from: elsewhere, they don\u0026rsquo;t care about how it gets solved, it only matters that it gets solved. To that end, it is viewed as a positive thing that teams get the autonomy to have whatever implementation they want behind the scenes.\n\u0026ldquo;Googlers pick up after themselves.\u0026rdquo; That was a saying back in the day. Whether or not it remains true, the interpretation we remember was that Googlers would take the time to try to fix something they touched, even if it wasn\u0026rsquo;t theirs. Whether it was a shared commitment to quality, or the unspoken idea that everyone had to perform being slightly better in front of others, it\u0026rsquo;s hard to be sure. But most other business cultures are hurry cultures, and don\u0026rsquo;t have time for this kind of thing.\n\u0026ldquo;Toil is the job\u0026rdquo; — as Narayan says, Googlers often have difficulty in understanding that in other companies, toil is the job — and there are a wide variety of situations in which that\u0026rsquo;s actually okay. In essence, be careful before you instantly dismiss the current situation.\nAnd finally, tools and tooling. But more on that shortly.\nThe Unexpected #It is true that there are obvious things to be missed, and you will miss them. Perhaps the largest will be the entire devinfra ecosystem (Piper, Blaze, Forge, TAP, Rapid, MPM, …). There are partial replacements for most if not all of these, but the huge value brought by coherent integration is lost, and that will be continual friction — the PR process on GH is not an effective substitute, though there can also be a huge benefit to not living within a giant monorepo with changes that affect you, but you have nothing to do with flying past.\nConversely, there will be things you miss that will surprise you. We found the largest one of these was the entire AAA suite: LOAS/Ganpati/LsAclRep/RpcSecurityPolicy et cetera. It is unsurprising they are missing in the outside world, since the combination of homogeneity of environment and NIH-spirit doesn\u0026rsquo;t really apply anywhere else. But we strongly miss the ability to look at what access the team-mate beside you has by looking at a small set of tools, duplicating that, and getting on with your day. Or even providing patches to a tool to compare what you have versus what you should have. There\u0026rsquo;s just no equivalent that we\u0026rsquo;re aware of, and the cloud provider IAM systems are all gigantic tire fires no-one is coming to put out. Prepare for a future where finding out what you have access to, why, and why not, is an exercise in effort and determination.\nA crucial nuance here, which also feeds into the cultural discussion, is that many tools (and hence related or adjacent processes) do not support delegation. The system is inherently designed to centralise power, or be administered by some nominated subset. Whether this is about power, trust, or process inertia, it is a real and observable effect. One of us has a tale related to the MS \u0026ldquo;stack\u0026rdquo; — a particular office which used this stack had over a thousand people in it, but there was no way to mail everyone in it easily, since they all had different off-site reporting chains, and the way Active Directory does mailing lists meant that only the root of the tree could be emailed. Both a self-selected email list which you could just subscribe to, or a set of internal tools which would allow you to deterministically select the right people to email, were unknown science. In a very real way, not only does Conway\u0026rsquo;s law mean your software reflects your organisation, but how the software does communication affects how the org works too. You will miss Buganizer\u0026rsquo;s component hierarchies.\nA related point is that enterprises typically arrange workstreams around ticket queues, and those queues often have inexplicable configuration or behaviour. Damon Edwards has many great talks on this, with recommendations on what to do, of which the main is: work horizontally and in classic SRE fashion tip-toe around the impediment until you get the work done — then surface it in the systems of record nonetheless, because that\u0026rsquo;s the right thing to do.\nThings You\u0026rsquo;ll Have to Wrestle With — Identity #What \u0026ldquo;SRE\u0026rdquo; is used for varies throughout the industry. Hell, it even varied within Google, but it particularly varies outside. So if your identity is tightly coupled with the term, and what you believe it means, there will be a conflict at some point. There are many well-documented reasons to work on decoupling your identity and your professional existence, though the totalizing environment of Google would never have made that easy.\nHere are a few things that the term SRE means in the outside world in our personal experience:\nExpensive and good at on-call Distributed systems consultant Platform engineer Rebranded ops group member As we\u0026rsquo;ve stated elsewhere, you\u0026rsquo;re not necessarily going to be able to determine what is involved in a particular role from the text, or even from the interview process (although you may get some clues). A key component is obviously what they\u0026rsquo;re actually hiring you to do, but that might not even be clear to them, never mind you. (We\u0026rsquo;ll explore this, and the question of providing value both generally, and in the context of an SRE effort, in the next section).\nBut it\u0026rsquo;s important to make this next point, since it is so related to questions of identity: Google culture presumes value from engineering activities first, and customer or product oriented ones afterwards. SRE in particular was extremely engineering-led. That\u0026rsquo;s not necessarily bad (though it does create some interesting dynamics), but it does turn out that other companies have different cultures: sales-led, product-led, customer-led, and so on. Part of handling the change in identity for yourself, is realising that the company you\u0026rsquo;re in has a different identity too, particularly around how they understand value.\nA Special Note About On-Call #Though many places equate SRE with ops and/or on-call, do not expect on-call compensation without negotiation, and possibly not even then. (As always, your point of maximum leverage is before you sign a contract.) We have not come across anywhere of a significant size in the industry that has as consistent and as fair on-call compensation policies as Google does, and the default is very much that it is perceived as part of the role.\nThings You\u0026rsquo;ll Have to Wrestle With — Providing Value #To develop that theme of understanding and providing value, we note that SRE in the outside world is more usefully treated as a toolbox, rather than a doctrine. Which is to say that there are various bits of it which are tools that can be applied to solve problems, and some of the tools are appropriate to the situation and some of them are not. The major question is, which of those tools and practices will work in your new environment? What are the most urgent problems? What are the most perceived-to-be-valuable problems? (Note the careful distinction between those two; they may of course be different, but it is particularly necessary in non-engineering cultures to find out what people think about their problems, since solving a problem that no-one believes is a problem is a clear social signal you think you\u0026rsquo;re different, and can ignore everyone else. Much better to engage in persuasion first: not being doctrinaire about things is precisely the behaviour you yourself would want to see in any new team members.)\nFurthermore, how you provide value is of course partially dependent on what you\u0026rsquo;ve been hired as. There are a fair number of companies still experimenting with this SRE thing, and still interested in hiring someone with deep background to help drive change in the company. Experience suggests that it is a two-year journey to implement SRE. You might be hired as the vanguard or the hindmost, but the key point is that the benefits don\u0026rsquo;t tend to be felt until 18–24 months in. Many organisations struggle with things which are hard to do and don\u0026rsquo;t provide any value for a long while, so you need to prepare for that. Even if you notionally have support from senior management and executive leadership, you will need to convince people at the coal face of the benefits of change, one by one, so getting acquainted with how people do that at your new company is key. Some cultures are deck-driven, some doc-driven, some person-driven. Figure out which (possibly by trying all approaches) and then work with that until it\u0026rsquo;s safe to do otherwise.\nA Suggestion for Where to Start #If you have been hired as SRE, and maybe there\u0026rsquo;s a complex landscape in which you\u0026rsquo;re not sure how to provide value, our suggestion is to start by learning how to help people beyond the traditional \u0026ldquo;SRE\u0026rdquo; approach to things. In particular, devinfra / releng is not a bad place to start: no one ever complains about faster builds/less flaky tests, and improving time-to-prod-from-change is generally a metric which is understood and valued, even if it\u0026rsquo;s not maintained. It has the further advantage of keeping you firmly on the software side of the house rather than the pure operational side, and can make you a lot of friends amongst the product development community. The one downside is that it can be seen as helping others out a little too much, but it\u0026rsquo;s usually possible to excuse that early on (if you happen to be in a culture which cares about that).\nYou may be struggling with more fundamental questions than that, though.\nA Note About Economic Tradeoffs and the Time to Do Something Right #This is particularly true for startups but also true for non-FAANG companies: in any company without billions of dollars in the bank, the economic circumstances constrain the acceptable solution space. Such constraints are actually valid. Your Google time has taught you that you should do the high-quality thing, even if it takes longer, because the complexity of the internal environment means 80/20 solutions are hard to find, and everyone\u0026rsquo;s pretty used to OKRs getting bumped quarter after quarter. This is different elsewhere. Often there are 80/20 solutions available, and often they seem pretty janky to your engineering \u0026ldquo;taste\u0026rdquo;, but they\u0026rsquo;re perfectly functional and get it done. Finding where that balance is, and where it can be for your new environment, is a necessary step. Do not use your accustomed pace of development/deployment in Google as a guideline for your new company.\nThe above goes double for a startup. A commenter says:\n\u0026ldquo;I was asked to build something that\u0026rsquo;d let us run long-running jobs in a better way than having someone SSH into a machine in prod, and running something from the command line. We had people doing database migrations that failed, because after an hour, they closed their laptop.\n\u0026ldquo;I initially specced a month to get a K8s cluster up \u0026amp; running, with a simple deployment interface, that\u0026rsquo;d let people submit a slug to be deployed \u0026amp; run for a bounded time. People laughed at me. Two more design iterations, and I\u0026rsquo;d a Ruby on Rails URL that\u0026rsquo;d take a command to run, and ran it under TMUX with logging to stdout, so if you hit that URL later, you got the logs. Took three hours to implement, 3 days to design.\u0026rdquo;\n(From the SRE perspective, the difference the company gets between you implementing something and a dev doing it would be to leave monitoring \u0026amp; inspectability in whatever you\u0026rsquo;re doing. There are very many solutions that don\u0026rsquo;t, and the amount of time that\u0026rsquo;s wasted as a result is tremendous.)\nHow to Provide Value? #One common question we see is, \u0026ldquo;I\u0026rsquo;ve only done GCL, gmon \u0026amp; GSLB for five years… what do I do?\u0026rdquo;. The good news is, there\u0026rsquo;s actually a fairly large set of tools which have at least some mapping into the Google world, and there\u0026rsquo;s more coming for sure. The number one analogue today would be Prometheus/Borgmon, but there are others. Whether you have a direct analogue or not, expect to do a lot of learning, quickly. (External tools generally feel less sophisticated, but this is usually because of the integrated nature of the environment (as discussed previously). They are also less resilient on average.)\n\u0026ldquo;I\u0026rsquo;ve only ever done X in the Google way!\u0026rdquo; That\u0026rsquo;s fine. Outcome equivalents often exist. Where they don\u0026rsquo;t, again you are in that special place where you are trying to figure out what people care about and how to make it happen. There\u0026rsquo;s nothing about that work which admits of shortcuts.\nHow to Get Value for Yourself? #You will appreciate that we spend a lot of time talking about the downsides of working in the external world, and it\u0026rsquo;s true that the historical eng practices \u0026amp; supportive culture of Google are, or were, fantastic, but you do yourself a great disservice (and are also being somewhat patronising) if you think of this as only being a downgrade. There were many things that G was terrible at. Outside observers would point at product execution generally, for which the Chat app debacle is a constant and searing reminder of just how bad it has been for years, and the general point of understanding users. Insiders with a little experience of other places would also say that G had many troubles innovating at great speed in situations where there weren\u0026rsquo;t clear numbers. There are plenty of things which are done better outside Google, and one fact of providing value to a company is to understand what they already do well, and learn that, so you can be at least as good as the others.\nHow to Help People #Erase from your vocabulary the phrase: \u0026ldquo;At Google we …\u0026rdquo;. Nothing more effectively signals a separate tribe, and nothing makes it easier to ignore you. Instead, listen to people, understand their problems and challenges, and use first principles explanations to demonstrate you\u0026rsquo;ve listened and understood. Then and only then suggest solutions, and use those solutions to teach different approaches, where appropriate. If you must use a phrase to directly reference a Google approach, try \u0026ldquo;In the past, when trying to achieve … I\u0026rsquo;ve found … to work well.\u0026rdquo;\nBe humble. Humility is endless, relatively cheap, and makes friends.\nBuilding a New Normal Elsewhere #Many things you just assume to be universally true aren\u0026rsquo;t. You will have to spend time and effort making them so by modelling good behaviour.\nSome key things to look at:\nEverything should be in version control, code reviewed, have tests, tests should always be passing. This seems basic, but isn\u0026rsquo;t (in fairness, most places have some of this, but not all of it.) Here\u0026rsquo;s how to respond to a page, how to manage an incident, how to react afterwards. There are a number of free frameworks around (for example, the Jeli post-incident guide) or you can reinvent what you know. Depending on what stack you use and how things get done: try to move to intent-based \u0026ldquo;stuff\u0026rdquo; as soon as possible, since that aligns so well with techniques we know are good for reliability. Don\u0026rsquo;t make people click buttons to provision new things. The above are likely the most important things you can do, long term, in terms of enabling other people and supporting reliability, but… such work may not be directly rewarded by management, who might be more concerned by surviving till Tuesday than fixing things next quarter. (They also might be right.) So we recommend you have that chat first, before you invest too much time and effort.\nOther Things to Note #Staff Engineers #When we look at the array of responsibilities and work and skills that a moderately experienced Google SRE does, we find ourselves thinking that the kind of cross-cutting, whole-system vision work done has a much closer analogue in the outside world that\u0026rsquo;s a world away from what the term SRE means — and that is, Staff Engineer. If you find yourself interviewing in a bunch of places and getting work that doesn\u0026rsquo;t sound like it makes full use of your skills, try going for that kind of role instead and see if it changes anything.\nInterviewing in the Outside World Generally #Some people have asked us if interviewing is different externally. Yes, though the general tech model of \u0026ldquo;N\u0026rdquo; interview slots in different focus areas, often concentrating on demonstrating a set of technical skills, is very widespread. There are two major ways it differs: the rest of the world is a lot more about \u0026ldquo;tell me about a time when\u0026rdquo; rather than \u0026ldquo;do this on the whiteboard right now\u0026rdquo;. So preparing the answers to those kinds of questions (\u0026ldquo;what was your most complex problem\u0026rdquo;, \u0026ldquo;tell me about a time you failed to achieve something that mattered\u0026rdquo;, yadda yadda yadda) is important, otherwise you\u0026rsquo;re improvising in the interview, which can go awry. Secondly, G SRE folks are often at a loss to describe how tooling in the outside world works, and so you might find yourself saying that you\u0026rsquo;re sorry, but you don\u0026rsquo;t know that particular tool a lot. If that happens, pivot the conversation to a tool or approach you do know, or start talking about the underlying, more abstract problems (and potential solutions), in order to give the interviewer a signal that you do understand the general space, even if you happen to have been using mpm for the past while, rather than npm.\nIn terms of things to learn, if you\u0026rsquo;re literally starting from a homogenous Google environment, consider starting with the following:\nKubernetes Auth0 or Okta GitHub (\u0026amp; GitHub Actions) Prometheus It is not that you are likely to get internals questions about how any of the above work. Instead, learning those will give you a plausible real-world answer you can supply to any of the typical \u0026ldquo;design a system that\u0026rdquo; questions, and help to kickstart your own development process.\nGood luck!\nResources # Tooling mapping Life after the chocolate factory Care and feeding of SRE Impact for the impatient The SLO book State of DevOps report What SRE is not Clan culture ","date":"4 March 2023","permalink":"https://non-functional.net/posts/2023-03-04-sre-in-the-real-world/","section":"Posts","summary":"A guide for Google SREs navigating life outside Google — the cultural shock is harder than the tooling shift, and here\u0026rsquo;s how to think about it.","title":"SRE in the Real World"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/sre-identity/","section":"Tags","summary":"","title":"Sre-Identity"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/legibility/","section":"Tags","summary":"","title":"Legibility"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/okrs/","section":"Tags","summary":"","title":"Okrs"},{"content":"[A repost for reference, since the original was removed as part of house-cleaning elsewhere]\nI\u0026rsquo;m solidly in favour of a planning architecture of some kind for any team-size collection of people greater than about 5. (Hell, arguably above 2, but let\u0026rsquo;s keep overheads down.)\nI\u0026rsquo;ve had the opportunity to try OKRs across a number of teams and companies, and I\u0026rsquo;ve found them useful overall. Those of us who\u0026rsquo;ve been in large companies or planning-focused small companies will have encountered a number of approaches to planning before. OKRs take their place amongst this number, I think; neither clearly the best (but absolutely not the worst), they are great for certain kinds of things but not for others.\nLet me be more precise.\nThere are a couple of things I like about them:\nOKRs are easy to get started with and don\u0026rsquo;t require huge expertise. The kind of gatekeeping behaviour that surrounds them (there is some, of course) is not as impenetrable as some other industry practices, IMHO. It\u0026rsquo;s easy to sling some stuff together and get started. No necessity to get yourself consulted-up to get going. The focus is supposed to be on your team\u0026rsquo;s plans, not the planning infrastructure. It\u0026rsquo;s easy for them to remind you about what\u0026rsquo;s important. They are intended to be a high-level steering mechanism, and as such, they surface concerns about direction, mutual co-operation, etc., which generally don\u0026rsquo;t happen in the context of an individual team. It\u0026rsquo;s a hugely useful spur to thinking like an owner. I found it naturally moved the direction of my gaze up from my shoelaces (the minutiae) to something a bit higher-level (the business), which helped in turn to focus effort on thinking about what goals should be and how they are phrased to make them SMART (measurable, etc.). Insular, siloed team thinking is the default, and this is one of the more effective ways to try to short-circuit that pattern. Tooling isn\u0026rsquo;t hugely important, but good effort gets good results. The internal Google tooling around OKRs made it trivially easy to look at other teams\u0026rsquo; stuff and figure out if there was overlap/clash/areas of mutual support/etc., which was great for cross-company visibility. However, you didn\u0026rsquo;t need it, and a bunch of spreadsheets or items in Azure DevOps is also a perfectly tractable approach. Whatever works for you. Expecting to fail is hugely valuable. Common practice at the time inside Google (there are some disturbing assertions this is no longer the case) was that you should expect to score 0.7 out of 1.0 on your OKRs \u0026ldquo;on average\u0026rdquo;. Whether or not the team ended up hitting that, setting the expectation helped to do a number of valuable things:\nFrame ambition. When I first joined Amazon, the message I got was \u0026ldquo;Expect to fail, because if you\u0026rsquo;re doing your job right, you\u0026rsquo;re doing new and/or difficult things no-one else is doing, and you\u0026rsquo;ll probably fail plenty of times. That\u0026rsquo;s ok.\u0026rdquo; It\u0026rsquo;s actually very hard to hear that message properly, particularly if you are an insecure junior engineer. Every emotional instinct was to shield myself and stay safe, and I don\u0026rsquo;t imagine I was the only one. But the explicit permission and expectation of failure did in fact (eventually!) provide sufficient cover to start being ambitious about things.\nThe 70% message that Google pushed was a pretty good analogue of the Amazonian one, though less explicit: it was an acknowledgement that some error was allowed, which created permission for trying stuff out.\nForced success is early death. The converse culture, where everything must be a total success continually, is cultural death. In these kinds of cultural environments, incentives switch to hiding things and lying, since it becomes more problematic to admit failure. Not only that, but you\u0026rsquo;re the only one who experiments and everyone else stays safe — well, you sure look out of whack and all your metrics will be too. Planning and related presentations turn into a game of picking off the outliers; those who distinguished themselves by trying hard and failing.\nIt\u0026rsquo;s hard to construct a genuine set of ambitions in that kind of environment, so innovation gets bred out.\nThere are a couple of things I didn\u0026rsquo;t like about OKRs, however:\nRitual drains meaning. As with every serious planning architecture, there\u0026rsquo;s a lot of ritual around OKR planning sessions. With a quarterly cadence and mid-quarter check-in, there aren\u0026rsquo;t actually that many weeks which are free of contact with OKRs. It can become a ritual, and ritual can drain meaning. Engineers in general preferred to opt out of ritual-like behaviour, and therefore a number of the serious discussions about prioritisation and so on. You could encourage people but forcing them ran the risk of turning them off the whole thing permanently. However, it was definitely possible to have a whole team excited by OKRs — and that was wonderful when it happened. The famous \u0026ldquo;50% project time\u0026rdquo;. OKRs were an okay-ish fit for how SRE did project work, which in general suffered from interrupt-driven/production fire problems. Though many column inches have been spilled on the nature of SRE project work and the 50% boundary, my lived experience was that mostly, SRE project work was kinda like batch scheduling: it\u0026rsquo;d get done, eventually. Your big hope was that it would still be relevant by the time it was done. A lot of the time it was. It\u0026rsquo;s possible another framework, or something on a shorter timeframe, would help more with that. Policy versus implementation. There were arguments that never converged about what level of goal was most appropriate for a team, and the interplay between the usefulness of a high-level goal and specifying its implementation as concretely as possible. This ties into questions about how to do OKRs correctly. Let me start with an example. \u0026ldquo;Keep the site up\u0026rdquo;, or \u0026ldquo;Make X more reliable\u0026rdquo;, versus \u0026ldquo;Enumerate top 5 sources of known outages in the trailing quarter and eliminate their root cause(s)/contributing factors\u0026rdquo;. In theory, the Objective is \u0026ldquo;keep the site up\u0026rdquo;, and the KR is \u0026ldquo;enumerate the outages\u0026rdquo;. However, lots of things could fit under such an objective, and my personal experience was that it was not a matter of fully developed consensus best practice that everyone knew what went at O-level, and what went at KR-level. There were also questions about O-grouping and what fitted under what objective, as well as prioritisation.\nPrioritisation was another issue, of course; P0s got done except if there was a force majeure situation, and P1s generally (well over half the time) got done. But most teams didn\u0026rsquo;t get to all of their P2s, and only those P3s with serious personal investment from an engineer typically saw effort — and even that usually meant a P0 or P1 was getting starved.\nOverall, I thought OKRs, though responsible for a certain amount of ritual and non-productive argument, were a pretty decent way of actually steering a pretty complicated ship. They scaled well, set good expectations, and helped remind you what was important.\nBut — and it is a big but — that partially relied on a set of cultural behaviours which were not 100% to do with OKRs — the habit of ambition, the inclination to forgiveness, and cross-team accommodation for sure — and if those behaviours are absent or change, the usefulness of OKRs can absolutely decrease.\nFor example, if in fact, as this tweet suggests, OKR completion rates are now part of performance ratings, well, that is a very different framing. The theory says you shouldn\u0026rsquo;t do it, because then the incentives become about hiding, safety, and so on. It comes across like attempting to increase \u0026ldquo;legibility\u0026rdquo; at the expense of effectiveness.\nI think of it this way: as they say in another galaxy, far far away, the more you tighten your grip, the more [the objectives] will slip through your fingers. Ultimately, great things are done by:\nGiving people freedom and autonomy Encouraging them to reflect, and take feedback… …in a safe environment. Steer, not control.\n","date":"4 January 2023","permalink":"https://non-functional.net/posts/2023-01-04-on-okrs/","section":"Posts","summary":"OKRs: neither the best nor the worst planning architecture, but great for certain things — and dependent on cultural behaviours that have nothing to do with OKRs themselves.","title":"On OKRs"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/planning/","section":"Tags","summary":"","title":"Planning"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/future/","section":"Tags","summary":"","title":"Future"},{"content":"[This post originally published in USENIX\u0026rsquo;s ;login: magazine, and is based on my keynote at SRECon EMEA 2021. For those already familiar with those, nothing is meaningfully different here. Readers not interested in minutiae about SRE, and the overall project of making production systems better, may wish to skip this one.]\nIt\u0026rsquo;s natural, in the swirling chaos of the past few years, to take a step back and wonder just where everything is going. Though I\u0026rsquo;ve certainly been doing my fair share of that, I\u0026rsquo;ve been also thinking about defining precisely where things are. Answering that question for Site Reliability Engineering - SRE - is not easy.\nOn the one hand, SRE is inarguably incredibly successful: it\u0026rsquo;s a discipline, a job role, and a set of practices that have revolutionised a good portion of the tech industry. Its practitioners are strongly in demand in every sector from financial services to pizza delivery and beyond. The whole community has played a part in changing how the industry thinks about managing online services, developing dependable software, and helping businesses give their customers what they want more quickly, more cheaply, and more reliably.\nOn the other hand, I\u0026rsquo;ve grown increasingly unsettled as I\u0026rsquo;ve moved from being enmeshed in the day-to-day minutiae of executing our work, to thinking about what that work is actually based on: and why — or if — it works. Unfortunately, I\u0026rsquo;ve come to the conclusion we have much less underpinning our ideals than we had assumed. In the words of Benoit Blanc in the 2019 movie Knives Out, SRE is a doughnut hole in a doughnut hole, and the hole is not at the periphery, but at the centre.\nToday, I believe we cannot successfully answer several key questions about SRE. Let\u0026rsquo;s start with the most important one.\nHow can we thoroughly understand what kind of reliability customers want and need? #Can we provide a model for customer behaviour during and after an outage, give upper and lower bounds on return or abandonment rates, or otherwise provide a pseudo-mechanistic model for loss? Is there any way we can compare reliability, or lack thereof, between services? Can we say, for example, that Roblox going offline for a long weekend is better or worse than AWS going partially offline for two hours?\n💡 By \u0026ldquo;models\u0026rdquo;, I mean not just our mathematical treatments of things (few and far between, and generally lacking predictive capacity), but also our mental models — how we think of, and understand the appropriate response to a situation.\nHow do we value reliability work, especially with respect to other conflicting priorities? #Specifically, are there any useful guidelines to help us understand when reliability work should be preferred over other work, or a basis to understand what proportion of continual effort should be spent on it? What do we get for this work, as opposed to other work?\nDo we fundamentally misunderstand outages and incident response? Is there any prospect of improving our models for them? #I don\u0026rsquo;t just mean \u0026ldquo;can a team or individual get better at incident response\u0026rdquo; — to which the answer is of course yes - but the larger question of whether or not we are understanding incident response the right way.\nFor example, incident responders and leadership often have a very different understanding of what\u0026rsquo;s going on. One of the main differences is the question of whether or not incidents represent exceptional behaviour. As best we know, right now, for a sufficiently complicated distributed system undergoing change, incident occurrence is unbounded. This is not widely understood at leadership level, and reconciling these views is critical. Yet a larger question beckons: though modelling unknown-unknowns is well understood to be impossible in principle, is there any intermediate result of use? For example, even if we can\u0026rsquo;t model exceptional behaviour, there are approaches (such as Charles Perrow\u0026rsquo;s Normal Accidents theory [1]), which can help us understand why it occurs. Is this area entirely resistant to analysis in general, or modelling with numbers?\nAre SLOs really the right model for every system management challenge? #SLOs (Service Level Objectives) — clearly articulated numeric targets for reliability of production systems — are described in many articles and books, but most definitively in Alex Hidalgo\u0026rsquo;s book Implementing Service Level Objectives [2]. SLOs have quickly moved to the heart of how the profession does day-to-day work, since it is hard to make something reliable to degree X until you decide what degree X is, and how to represent it. However, we find the underlying conceptual model hard for coping with situations where success cannot be adequately represented as a boolean, and where \u0026ldquo;slow burn\u0026rdquo; alert construction has no business reason for one target over another. Yet SLOs remain critical as the basis for many SRE interventions. How can SLOs be improved? Or is there a more fundamental problem underneath them?\nHow can we best move beyond the \u0026lsquo;SRE book\u0026rsquo;? #Site Reliability Engineering: How Google Runs Production Systems was published in 2016 (edited by this author along with Betsy Beyer, Chris Jones and Jennifer Petoff) [3]. The \u0026lsquo;SRE book\u0026rsquo; has provided many helpful conceptual frameworks for practitioners and leaders to aspire towards. But more than five years on from the publication of the original volume, it\u0026rsquo;s time to re-evaluate what it says and recommends in the light of experience with those models in other environments and other contexts. How can we best do that, in a profession which is notoriously practical and focused on pragmatic action, not research?\nIf you did nothing other than keep your mind open about the necessity of tackling the above, I will have achieved my goal. But if you need more persuasion, or want to know where best to help, read on.\nThe Character and Value of Reliability #Though reliability is in the very name of the profession, we have an unforgivable lack of analytic rigour about what it is, why it matters, and specifically how much it matters.\nAt the very foundations, we confuse availability with reliability — or more accurately, we simply don\u0026rsquo;t define what they are. The narrowest commonest definition, that of a correctly responding HTTP request/response service, has some uncertainty itself — for example, whether or not 40x series response codes represent errors or not. In one view, the service is correctly responding with a 404 for, say, a particular image file not being present, since a user was just guessing a URL. In another view, the URL may have been generated incorrectly by another part of the system, and the correct one should have been used, therefore the response is an error. This is related to how we can know whether or not a server we\u0026rsquo;re dealing with is working correctly — and somewhat unbelievably, this is not a solved problem, even for protocols invented three decades ago.\nWider definitions of reliability often reference latency, cost of computation/performance, freshness, seasonal performance comparisons, and so on, but no complete list of such attributes exists. Most teams treat this as an exercise in assessing the behaviour of the business logic from first principles instead. Since we don\u0026rsquo;t have a solid framework that relates those attributes, everyone\u0026rsquo;s SLO implementation projects (and, therefore, what their SRE teams do) are going to vary, with consequent cost, and little benefit for that cost.\nAs well as lacking good definitions, we also lack a good understanding of, and ability to communicate, the value of reliability. Why do I believe this? Well, next time you audit talks, read articles, or just go looking around for answers to questions about how much reliability matters, count the number of bald assertions that you see, as opposed to discussions of models. I see a lot of statements like \u0026ldquo;reliability is the most important feature\u0026rdquo;. I do not see anything that looks like a model — either quantitative, mechanistic, or qualitative. I ask the reader to consider the following question: if reliability were in fact the single most important thing, would it not win every prioritisation discussion? You, as I do, presumably operate in an environment where this does not happen.\nThe problem is that such assertions are both wrong and excessively defensive. Declaring that \u0026ldquo;X is the most important feature\u0026rdquo;, by its nature prevents us from developing our understanding of what actual trade-offs exist, no matter what X is. It turns us away from developing a more sophisticated understanding. It is anti-model.\nI claim that because the value of reliability is so difficult to calculate formally, its value has therefore become primarily socially constructed. We need to move the question of the value of reliability out of the realm of the incalculable and into something which isn\u0026rsquo;t entirely constructed that way. If we can\u0026rsquo;t explain the fundamentals behind the rationale for the profession outside of solely the social context, we are in serious trouble.\nSawtooth Reliability #The default model of the importance of reliability today is what I\u0026rsquo;ll call a \u0026ldquo;sawtooth\u0026rdquo; or \u0026ldquo;boolean\u0026rdquo; model: it is the most important thing there is, except when we have it — then it doesn\u0026rsquo;t matter at all. Those coming from the ops side of the house know precisely what I\u0026rsquo;m talking about. No outage ongoing, therefore resources are hard to come by, since nothing bad is happening, and who wants to fund a Department of Bad for nothing bad happening? Big outage, and the floodgates are opened — everyone will help, until the outage is over, or the outage is not being resolved quickly enough. Then back to baseline zero we go. We never converge on an appropriate steady-state level of investment in reliability, a stable balance between prevention and cure.\nAs frustrating as this refusal to consistently fund reliability is, there are at least two interpretations in which this is rational. The first interpretation is that execs are using outages as a signal to tell them what to spend on (though the spending is generally with people\u0026rsquo;s time, as distinct from money). The second one is that there is no clear competing model to tell us that reliability is valued outside of the context of a total outage, and even then its value is ephemeral. Execs can point to a long series of outages by world-famous companies who do not seem to particularly suffer as a result, and say that the market does not appear to value reliability. Without a competing model, it\u0026rsquo;s hard to oppose that view. In that narrow sense, companies with extremely large and public outages do us all a particular disservice when they neglect or refuse to publish their postmortems, and in particular, their impact assessments, since we are left with nothing to support comparisons between reliability work and feature work. Or to put it another way, as is generally true, a lack of transparency costs the industry, and benefits the end company.\nThe sawtooth model of reliability\u0026rsquo;s value leads to a sawtooth model of investment. In general, such wild gyrations are not associated with stability.\nThe Parable of the Sticky Users #Part of the reason I\u0026rsquo;ve been thinking about this question over the past while was a moment, about four years ago, when I spoke to an engineering leader for a world-wide online brand. He was in town to investigate the possibility of growing an engineering organisation locally, so I asked him about the reliability story for the company in question. In my memory, he looked at me calmly and said:\n\u0026ldquo;We have no reliability story. We don\u0026rsquo;t focus on it. We believe that in the business we\u0026rsquo;re in, our users are so sticky that they have no choice but to come back.\u0026rdquo;\nMy initial reaction was \u0026hellip; sceptical. However, as I considered the conversation further, it occurred to me that actually, maybe he was correct! As far as I know, the company suffered no business-disabling outages, is still around — indeed, is still a household name — and, generally speaking, seems to have gotten its engineering tradeoffs right. Does this mean reliability has no value? Probably not. But does it mean it is effective and convincing, in that environment, to assert it is the most important feature? Also probably not.\nTo muddy the waters further, I was later told by another source that that reliability work was taking place even if the engineering leader I spoke to wasn\u0026rsquo;t aware of it — it\u0026rsquo;s just that work wasn\u0026rsquo;t strongly prioritised from the top. Given that I keep coming back to this conversation and its many conflicting implications, I suspect those of you who have had similar ones also do so. Is the null model of investment actually tractable? Is the shadow sawtooth model of investment the actual, real default, even if leadership think it isn\u0026rsquo;t? Could this be better, and at what cost?\nThe Default Model of Reliability #What we do today, when we attempt to characterise and value reliability, is based on a kind of heuristic that is strongly related to experience with request/response systems like HTTP or RPC servers. There is a broad acceptance that there are upper and lower bounds of reliability that matter; few would find a service available less than 90% of the time useful. A corresponding observation applies for 99.999% of the time — few users of consumer-grade technology would notice. So most folks cluster around some definition of \u0026ldquo;good enough\u0026rdquo; between 2 and 4.5 9s, and even within that the bulk of teams probably converge somewhere between 3 and 4 9s. When it comes to valuing reliability, if your business context is, say, an e-commerce shop, you\u0026rsquo;ll care about dropped queries because of the direct costs of loss, with similar arguments being made for payment flows, etc. However, if you\u0026rsquo;re running something else, where the direct connection is weaker, it\u0026rsquo;s hard to motivate such detailed approaches.\nThough it is very wide-spread, this intuitive model has a number of problems. Probably the most important one is that the model itself is applied to all sorts of circumstances where it doesn\u0026rsquo;t match. Non-request/response services, for a start — e.g. data pipelines, today increasing greatly in importance because of their involvement in machine learning. Services for organizations with varying business models than e-commerce. Services which only produce one report a day, but it\u0026rsquo;s for the CFO and they really care about it. Services where 5 9s might well not be enough for one customer, but it\u0026rsquo;s a multi-tenant infrastructure and everyone else is happy with it — really, all of those don\u0026rsquo;t match well to this default model.\nWe don\u0026rsquo;t have a good way of modifying the default model of reliability in line with business context, and we need to.\nIncident Management #There remain fundamental problems with how we conduct incidents, understand them, and model them. These problems are not just in the conduct of incident resolution, where we might reasonably expect any given team would vary, but in understanding the very nature of incidents, and how to think about them.\nThis is most obvious in the domain of numeric approaches to incident management, where a strong dichotomy exists between executives and practitioners. In brief, though executives understand that incidents will occur, being generally business-facing, they are mostly interested in how long it takes to restore service. The main metric is Mean Time To Restore (or repair; also known as MTTR). Execs naturally expect to be able to treat the incident situation like they treat anything else in business – pay attention to a small number of top-line metrics, delegate to the people who seem to know what they\u0026rsquo;re doing, and reallocate resources when excrement hits the fan.\nUnfortunately, Štěpán Davidovič has presented strong evidence to suggest that MTTR not only doesn\u0026rsquo;t work in the way execs expect, it also cannot [4]. Practitioners have internalised that every metric execs care about is either inaccurate, irrelevant, actively harmful, or relying on the wrong model of the world, and generally believe that the \u0026ldquo;everything is a metric\u0026rdquo; methodology is less applicable than execs think it is, and specifically less applicable in the case of incidents.\nThis dichotomy is in and of itself a serious issue, since either the exec contention that incident management can be performed numerically is wrong, and we the practitioners have a phenomenal job of persuasion to do, or the execs are right after all, and we just haven\u0026rsquo;t found the right metrics or model. Neither of those outcomes present much opportunity for progress: it has been long acknowledged that the market can stay irrational longer than an individual can stay solvent, and I expect that execs as a class will remain committed to the numerical management model I outlined above in the absence of something clearly and verifiably better. Qualitative analysis, which is often used in the social sciences to study complex phenomena, holds out some promise of identifying mechanisms and fruitfully exploring some of the socio- component of socio-technical systems, but this is precisely the kind of long-term, non-teleological work that almost every institution hiring SREs is allergic to.\nI spend a lot of time talking about models, but this is not academic. Incidents are real with real effects, of course, and understanding them better is likely to yield significant benefits. The perception that \u0026ldquo;it\u0026rsquo;s just software and doesn\u0026rsquo;t really matter\u0026rdquo;, widespread in the industry in general, is both a) not true, and b) a significant cause of friction to progress. Case a) is trivially not true for cloud providers — for example, Microsoft set up a program [5] to look after key health and physical security industries — but it\u0026rsquo;s also not true, to varying degrees, for other service providers, and it will get worse over time as more and more things become more closely coupled to computing, the cloud, and the Internet in general. Though I accept a number of readers will disagree with me here, I believe the industry being unwilling to directly connect software failures with life-ending or life-jeopardising events has prevented investment in industry-wide progress, and we are more exposed every day.\nUltimately, though, how corporate hierarchies assign and reward value affects how SRE is perceived and how many members get on in their job; in the industry at large, many SRE teams are understood to have value primarily as a function of their incident response capabilities. How that value is understood is crucial to the future of the profession.\nService Level Objectives and Cloud 9 #If you asked the proverbial person-on-the-street about SRE, and they don\u0026rsquo;t say \u0026ldquo;no idea\u0026rdquo;, they\u0026rsquo;re probably saying something about SLOs. In my opinion, the major benefit we get from SLOs is the idea that there are things we can successfully ignore — or, to put it more formally, that an organization as a whole should be intentional about what reliability / performance it wants from its systems, other than the completely reactive (and traditional) 100%. It provides a framework for introducing this idea to the business, and reflecting it in a structured way for engineering. It also defangs (or can defang) the relationship between development and operations teams if they\u0026rsquo;re in a hard split model. I can\u0026rsquo;t emphasize enough how much I like it compared to the previous default.\nHowever, there are some important weaknesses about how it works, and what it assumes:\nHow the modelling is done.\nThe underlying architecture for SLOs presumes that successes and failures in a service can be successfully mapped to a boolean, and that these booleans can be put in a ratio (Narayan Desai and Brent Bryan explore some of this in more detail in their SRECon Americas 2022 talk \u0026lsquo;Principled Performance Analysis\u0026rsquo; [6]). Services outside of the classic request/response paradigm are not handled well in this paradigm, though of course there are compensating techniques.\nError budgets.\nAnother rephrasing of SLO-based service management is that you pick a level and stick to it; well, what happens if you don\u0026rsquo;t stick to that level? In the original idea, exceeding the \u0026ldquo;budget\u0026rdquo; leads to an agreement by the product development and SRE team to work primarily on stabilisation work, until the appropriate service level is restored. But what happens if — as I\u0026rsquo;ve seen happen — you blow your error budget for the next twenty years? How to correctly respond in these circumstances is not well understood. As a result, error budget implementation should arguably be decoupled from SLO implementation, at least until this is better understood.\nAlert construction and threshold selection.\nOne upside of the SLO-based alerting approach is the ability to treat certain kinds of error as if they don\u0026rsquo;t matter, which is central to being able to avoid just reacting all the time (which in itself is central to being able to afford project work in the production context at all). But there are two issues here: the first and subtlest is that errors which the SLO says you can ignore might, of course, play a bigger role in a future outage, and prioritising those is not within the scope of the framework.\nMore immediately, though, if you are writing alerts in an SLO framework, you typically divide your alerting into fast-burn and slow-burn buckets: one of them designed to capture total (or very close to total) outages, and the other designed to detect slow erosion of the customer experience. The problem here is that we have nothing that allows us to confidently say that the particular line we have chosen for slow-burn alerting is correct. Furthermore, and this is perhaps the worst problem, we don\u0026rsquo;t have a good way of distinguishing between 1x100m and 100x1m outages. To spread this threshold selection problem across two alerts rather than one is not an improvement.\nAlert automation.\nOnly comparatively few \u0026ldquo;settings\u0026rdquo; in the conventional 9s setup allows for human response (basically, \u0026lt;= 99.95%). If SLO selection were in fact evenly distributed around that boundary, we would see a lot more automation of alert response, and it would be much more widespread and important than it is. So, why don\u0026rsquo;t we see it more often? Is it because we are systematically biased to human response? And if that is true, is this because automating responses is hard, or designing systems to avoid the alert conditions in the first place is hard — or both? Either way, there is a gap that warrants investigation.\nOff Cloud 9.\nSpeaking of 9s, it almost seems ridiculous to say this, considering how wide-spread this particular model is within the community - but is counting reliability in 9s an effective method of understanding the user experience? Charity Majors famously said \u0026ldquo;9s don\u0026rsquo;t matter if the users aren\u0026rsquo;t happy\u0026rdquo;, but I\u0026rsquo;m talking less about the question of effectively mapping user happiness to metrics or SLOs - the complexity of which anyone might struggle with - and more about the model as a whole. Why did we settle on a powers-of-ten relationship for capturing and modelling this behaviour? (It clearly wasn\u0026rsquo;t sufficient on its own, because if it was, why did we introduce 5s?) Do we have any confidence this structure maps to a natural behaviour or pattern that matters? Or is it all entirely arbitrary? Facetiously-but-not-really, why not use 8s?\nI\u0026rsquo;m speaking for myself, of course, but every time I look at those charts of allowed unavailability that divide periods of time by fractions of 9s and 5s, I wonder to myself if a human is ever in a position where 4.37 minutes of unreachability won\u0026rsquo;t matter to them, but 4.39 will. There are other concepts that we could work with - for example, suppose that natively, without particular effort, a cloud provider\u0026rsquo;s infrastructure will deliver 98.62% availability - in such a world, having one behaviour on one side of that line and another on the other would make sense because it marks the difference between what the provider will give you without effort, and what you have to work to achieve. I have seen no data supporting the contention that 4.38 minutes of unavailability is a critical boundary.\nIs a better model available here? One that was easier to understand, mapped better to real-world experiences, and required less division by 9 would be a good start.\nGetting to SRE 2.0 #Turning the conversation back to models for a moment, the obvious question is what might a better model for reliability look like? I hope I\u0026rsquo;m wrong, but I\u0026rsquo;m not aware of any serious, holistic, work in this area. Instead, what I see is partial hints: edges of the elephant we are blindly and sporadically exploring.\nOne hint, for example, is the series of studies from many online services (such as Google [7] and Zalando [8]) which have shown a relationship between experienced latency and user satisfaction, purchases, and similar. The numbers might vary a little from study to study, but the key result has been reproduced a bunch of times: the higher the latency, the less user satisfaction, the fewer purchases or user journeys, and so on. There should be public work that explores similar relationships for reliability, but across many different services, user populations, job-to-be-done contexts, and so on. Assembling such data will help with filling out the picture more widely. Yes, it is empirical; in my opinion, we can\u0026rsquo;t afford to turn down inspiration from anywhere, and it can all help to prime our intuition.\nAnother hint is Nicole Forsgren, Jez Humble and Gene Kim\u0026rsquo;s Accelerate [9], which presents research which shows that we have a reason, based on data, to believe that stability and rapidity of release to production can in fact go hand in hand. This is counter-intuitive to some, but is very much a real effect, related to small batch sizes stabilising change. Of course, some of the risk from changing systems flows not just from the content of the change itself, but the very act of making a change, and how long it has been since your last one. This result isn\u0026rsquo;t necessarily detailed about all potential mutations of production (migrations, network topology, schema changes, etc) but it\u0026rsquo;s suggestive of a way forward.\nWhen I see work like those two examples above, I see work that explores a relationship between things. A relationship is an equation, an area for exploration, and an invitation to examine exciting edge-cases. A number of these relationships, integrated with a strong theoretical vision — underpinned by empiricism — could provide a lot of insight into how we should do systems and software management. It would be the beginning of a kind of biology of systems development and management, or a special case of systems science in the production domain.\nI initially wrote physics, rather than biology, but changed my mind; I want to reassure the reader I have no great plan for a fully determined framework. Even the impulse for one is often misguided: those familiar with the history of mathematics may recall the story of David Hilbert, extremely influential 19th and 20th century mathematician, who proposed a research program to show, once and for all, that the foundations of logic were on solid ground, and that everything can and must be known. Gödel and his famous incompleteness theorem blew an unfillable hole in that shortly afterwards, and limits to knowledge are an important part of how we perceive the world today.\nI\u0026rsquo;ve seen too much of the world to propose such a program - I\u0026rsquo;m just looking for better models. But even if I was, one of the bigger problems is that SRE itself is a very practically oriented profession. Neither the profession nor its practitioners tend to have much patience with, or time for, indefinitely long cross-company projects — never mind the complicated and hugely necessary questions of ethics that come with being the stewards of the machinery of society in an era seemingly dominated by polarisation, deceit, and unsustainability in every sense of the word. Digging ourselves out of this hole is therefore only likely to make progress when doing so aligns with one company\u0026rsquo;s need to solve particular problems relevant to them, or if motivated individuals take it upon themselves to do the work anyway.\nThis is a pity, because from where I stand, I see across the profession way more questions than answers, and the answers we have right now are insufficient. The SRE books provided a look behind the curtain at a different way of thinking about service and software management - one which proceeded from different assumptions than most other organisations, yet was still worthwhile. The next revolution - SRE 2.0, as some have started calling it - is just as urgently necessary as 1.0 was, but if anything further away.\nSRE could be - should be - much more than it is today. Please help.\nAcknowledgements #At the end of SRECon EMEA 2019, a number of folks got together to discuss their impressions of the conference, the future of the profession, discussions about reliability, software development, and systems thinking generally. Unhappily, the pandemic happened shortly thereafter, but the SREfarers group (as it became known) was a source of great comfort during those times, and the input from the members was incredibly influential in my thinking for this piece and others. I would like to thank Narayan Desai, Nicole Forsgren, Jez Humble, Laura Nolan, John Looney, Murali Suriar, Emil Stolarsky, and Lorin Hochstein for their many and varied contributions to my insights over the past few years. It has given me an abiding respect for intimate and respectful cross-industry, cross-role discussions that might be an interesting model for development in the future. I also want to specifically thank Laura Nolan, Cian Synnott, and Tiarnán de Burca for comments on an earlier version of this article.\nReferences:\n[1] Charles Perrow, Normal Accidents (Basic Books, 1984).\n[2] Alex Hidalgo (ed.), Implementing Service Level Objectives (O\u0026rsquo;Reilly, 2020).\n[3] Betsy Beyer, Chris Jones, Niall Richard Murphy, and Jennifer Petoff (eds), Site Reliability Engineering: How Google Runs Production Systems (O\u0026rsquo;Reilly, 2016).\n[4] Štěpán Davidovič, Incident Metrics in SRE: Critically Evaluating MTTR and Friends (O\u0026rsquo;Reilly, 2021).\n[5] \u0026lsquo;Life and Safety: Scaling Up Azure Resources to Safeguard Society in a Pandemic\u0026rsquo;, Microsoft. https://www.microsoft.com/en-ie/engineering/lifeandsafety\n[6] Narayan Desai and Brent Bryan, \u0026lsquo;Principled Performance Analytics\u0026rsquo;, USENIX SREcon (2022). https://www.usenix.org/conference/srecon22americas/presentation/desai\n[7] Jake Brutlag, \u0026lsquo;Speed Matters for Google Web Search\u0026rsquo;, Google (2009). https://services.google.com/fh/files/blogs/google_delayexp.pdf\n[8] Christoph Luetke Schuelhowe, \u0026lsquo;Loading Time Matters\u0026rsquo;, Zalando Engineering Blog (2018). https://engineering.zalando.com/posts/2018/06/loading-time-matters.html\n[9] Nicole Forsgren, Jez Humble, and Gene Kim, Accelerate: The Science of Lean Software and Devops (IT Revolution Press, 2018).\n","date":"4 June 2022","permalink":"https://non-functional.net/posts/2022-06-04-what-sre-could-be/","section":"Posts","summary":"Today, I believe we cannot successfully answer several key questions about SRE. Let\u0026rsquo;s start with the most important one: how can we understand what reliability customers want and need?","title":"What SRE could be"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/systems-thinking/","section":"Tags","summary":"","title":"Systems-Thinking"},{"content":"Somewhere between 15 and 20 years ago, I worked for a company. It was a very prestigious company, and it was a glorious and frustrating time. The company did amazing things. Literally unbelievable achievements - from my point of view anyway. But this was coupled with levels of chaos that led to inefficiency, wasted opportunity, and needless headaches.\nThe contrast grew so large that I had to reconcile it somehow, if only in my own head. So I went to the countryside and wrote about the situation, which is my go-to technique for processing Stuff In My Life.\nBy coincidence I came across that piece recently, and was struck by how absolutely relevant it was, all these years on! (Indeed, slightly more relevant today, in one way I\u0026rsquo;ll talk about in part 2.)\nBut let me start off by reproducing some key passages below.\nThe company I work for is a classic example of what I like to call \u0026ldquo;smart people, stupid organisation vs. stupid people, smart organisation\u0026rdquo; syndrome. Despite many of my colleagues being highly dedicated and very emotionally attached to their company, the organisation surrounding them is incredibly chaotic. Most of the things my department does are very badly planned, if they\u0026rsquo;re planned at all. Documentation is either non-existent or of appalling quality - woe betide the new hire attempting to understand the technical architecture of what is surely one of the most complex environments on the planet - they are simply left to sink or swim. The department itself is woefully understaffed, and struggles to catch up with the immense number of projects dropped on its shoulders, and therefore cuts corners with implementation whenever possible.\n(Emphasis not in original, but they were the bits that struck me in the act of copying this out again.)\nThe American half of our department barely talks to itself, let alone across the water to us. It\u0026rsquo;s a little demoralising, and we\u0026rsquo;ve had significant turnover as a consequence. If you read this, you\u0026rsquo;ll know I am a great believer in having the system be intelligent as well as the people. Smartness or otherwise means very little, I believe, if you cannot reproduce behaviour reliably.\nI turned some of that personal memo into an email to various folks, trying to improve things. Here\u0026rsquo;s what I wrote about predictability and reliability in delivery:\nPredictability is great. It allows engineers working on projects to expect that when they\u0026rsquo;re asked to do something, whatever that something is will be reasonably well defined, the timeframe for doing it well-understood, and the end result can\u0026rsquo;t be signed off until X Y and Z is done. This protects the engineer, which is good.\nIt also protects the organisation, since the organisation comes to expect that when it wants to do something, it should have designs/documents/inputs in well-known format Q, and last time we did this, it took length of time W for it to be done, therefore it\u0026rsquo;ll probably take around that this time too. More importantly, the organisation won\u0026rsquo;t get something which is called \u0026ldquo;done\u0026rdquo;, but isn\u0026rsquo;t. Last time we tried to behave as if X, Y and Z didn\u0026rsquo;t have to be done, severe problems were caused N months down the line, when we discovered we did in fact have to do those - like we knew all along.\nProtecting the organisation means not causing those severe problems, and consequently means not pretending.\nBy protecting both the organisation and the engineer, we define a stable interface between them. By defining a stable interface, planning is easier, day-to-day jobs are less interrupted by crises, management is easier, credibility increases throughout the organisation as deadlines are met, projects work when delivered, and your adrenaline glands can begin to sink back to normal levels of stimulation. Win all around.\n(Today, I would probably be a bit more nuanced about the tradeoffs between speed and stability, but I stand by organisations as a whole generally benefiting from driving predictability. This is probably worth the slight drop in excitement; there\u0026rsquo;s always skydiving if you disagree.)\nUnsurprisingly, I got a fair bit of pushback, and I responded in turn:\n[My colleague] would appear to see time spent in planning and writing documents as essentially wasted time, since he asserts that without process we are faster. We are not faster in reality. We are merely faster to say we\u0026rsquo;re finished. That is not the same thing as actually being finished: we can all think of examples of that. And the uncertainty induced by the unknown magnitude of correction required is, IMHO, the biggest contribution to our inefficiency and ineffectiveness.\nThe art of good engineering is the art of saying no, and we must begin to say no to things to protect the organisation and to protect ourselves.\nThe strange fact was, however, that my little bit of the group was probably less affected by the insanity than others. We were probably the best planners and the best documenters. I suspected we had the most predictable schedules, and we were very cognisant of the physical limits of work: we said no to things numerous times.\nBut there were huge cultural and business imperatives that continued to create random stuff for us to fix, no matter how many impassioned emails we sent.\nI wrote that I found it hard to give my best in these conditions:\nIt\u0026rsquo;s very hard to do good work in this atmosphere, and particularly I - who has always had an emotional relationship with work - find it hard to be engaged with something so obviously crazy. I\u0026rsquo;d love to fix the chaos myself, or try to, but it seems unlikely I\u0026rsquo;d be allowed, since my previous emails produced only well-worded refutations. They explained quite factually why the setup is the way it is, and implicitly therefore why it could not change.\nI\u0026rsquo;d be more generous today about people documenting the constraints they suffer under, but I hope I\u0026rsquo;d be as insistent that it\u0026rsquo;s appropriate and good to think about the system, the team, and the goals as a whole.\nI understood, even at the time, that a focus on the narrowest components of execution can be a huge problem for greater success. Given I\u0026rsquo;ve worked across networks, software, and systems, I\u0026rsquo;m probably one of those people who is going to be inclined to think in a holistic way anyway.\nBut back then, I saw it in terms of generalists versus specialists:\nThis feeling has become coupled with another realisation I\u0026rsquo;ve had recently. I\u0026rsquo;m a generalist, B-grade at a bunch of stuff. But the organisation does not want, or reward, generalists. The organisation wants specialists that it can slot into specific pieces of the hierarchy, who will then do their job with a minimum of complications. I\u0026rsquo;ve been thinking about this in a career context - I don\u0026rsquo;t want to specialise to get a promotion. I have no interest in (for example) vendor certifications - I am wondering if I have painted myself into a corner.\nThen I spoke to my friend Steve. Steve recast the problem entirely, and that was very helpful.\nActually, from reading this, it\u0026rsquo;s clear you already have a specialty: Systems Engineering. Not systems in the limited sense of Network Jockey or Server King, but Systems, with the capital S and everything, where it is all about interfaces and trade-offs.\nSystems Engineers often get the short end of the stick, because they have to be generalists. But without them, any project that involves more than one roomful of people is probably doomed. I\u0026rsquo;ve seen a lot of engineering projects fail: failing for technical reasons is way less common than failing for lack of Systems Engineering. It really is a constant theme.\nIt sounds like you\u0026rsquo;re in an org where management hasn\u0026rsquo;t understood the need for Systems Engineering yet. Systems Engineering is nearly always something that must be imposed, at least at first, because engineers will never happily demand that someone who knows less than them about a particular subsystem should be making the final technical decisions.\nSteve recommended investigating Systems Engineering as a distinct subject. Specifically, reading the engineering histories of the Gemini and Apollo projects, and especially about the culture clash between the experimental aircraft guys who built Mercury, and the ICBM teams; additionally, thinking about joining a professional organisation like the IEEE, since a community with practical experience of dealing with these issues is always useful; and finally, coming back to my then situation, trying using references around Systems Engineering to prod the organisation into trying to do some of it:\nIf you can\u0026rsquo;t get the ball rolling on even a small scale because no-one can see the need or will free-up the required resources, then you\u0026rsquo;re free: they\u0026rsquo;re fucked. Give yourself permission to let the organisation fail \u0026ndash; it\u0026rsquo;s not your fault, and in your attempt to introduce a Systems approach you will have discharged your responsibilities as a professional, so now just do what\u0026rsquo;s reasonably asked of you, keep saying no to the absurd requests and cash those paychecks till something better comes along.\nThe curse of Cassandra was to be correct, but never believed; the curse of systems thinkers is to be correct, but never valued.\nIn part 2, we\u0026rsquo;ll see if this is, in fact, the whole truth, or if perhaps there is an upside for systems thinkers in organisations.\n","date":"11 April 2022","permalink":"https://non-functional.net/posts/2022-04-11-the-curse-of-systems-thinkers/","section":"Posts","summary":"Smart people, stupid organisation — or the other way around. A memo from the past about what systems thinking costs you.","title":"The Curse of Systems Thinkers (Part 1)"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/gatekeeping/","section":"Tags","summary":"","title":"Gatekeeping"},{"content":"","date":null,"permalink":"https://non-functional.net/tags/horizontal/","section":"Tags","summary":"","title":"Horizontal"},{"content":"We used to have a difficulty in our community - thankfully less prevalent now - with rootless questions of identity. Of course, it\u0026rsquo;s not wrong to ask who we are, what we\u0026rsquo;re here for, and what should we be doing: every profession benefits from regular reflection. But too much of it, and you never converge, and moving forward becomes impossible.\nThough, as I say, I think things are settling and existential questions are much less urgent than they were, the profession continues to grow. Many new folks are still joining with similar questions about our purpose, how we achieve it, and so on. How, then, should we best address this?\nRather than pointing these new joiners at the fixed list of responsibilities present in the original SRE book, I thought it might be better to try a new approach: defining what SRE was by looking at what it\u0026rsquo;s not. Or to put it another way, what can you remove from SRE and have it still be SRE?\nHere are my suggestions.\nIf you don\u0026rsquo;t have this\u0026hellip; \u0026hellip; are you doing SRE? Access to internal source code No Ability to change internal source code/system design No Ability to cap operational work No Organizational cross-cutting ability No Ability to write code in the first place No, but temporarily is okay Operational responsibilities No, but temporarily is okay SLOs Yes, but it\u0026rsquo;s better if you have them A large-scale system to manage Yes, but it\u0026rsquo;s better if you have some Ability to avoid vendor kit Yes A mono-repo Yes SRE job title Irrelevant Access to internal source code. An SRE team without access to source code, either for their products/services, or infrastructure, can still do some useful things. They can make distributed systems designs, trace issues back to an endpoint or contributing factor of some kind, and write useful tools.\nBut this lack of access affects MTTR, destroys parity of esteem, undermines building closer relationships between the two teams, prevents deep engineering contributions to the supported systems, and sharply circumscribes SRE team possibilities. Unlike the ability to change the code, if the SRE team can\u0026rsquo;t even be trusted to see the code, that indicates a deeper relationship pathology it would be hard to recover from.\nAbility to change internal source code/system design. The good news is that an SRE team with read-only access to source code can perform more accurate problem resolution, and can understand the systems more deeply. As a result, they can come up with ideas for deep engineering contributions - but they aren\u0026rsquo;t allowed to do them. That\u0026rsquo;s almost worse than the previous situation!\nHowever, there\u0026rsquo;s a reasonable situation where this makes sense, and that\u0026rsquo;s where individuals in the SRE team have to satisfy the product engineering team of their competence with the code. Though this might come across as condescending in your individual situation, I actually don\u0026rsquo;t judge here - folks are going to be cautious about source code, and legitimately so. But to my mind, this could only ever be a temporary (albeit perhaps somewhat long-lived) situation for an individual; a permanent barrier would mean an SRE relationship was impossible.\nA similar discussion applies to system design as well. If an SRE team can\u0026rsquo;t influence the design phase of the SDLC for the systems they\u0026rsquo;re minding, that\u0026rsquo;s not engineering. Of course, it might take quite a while to demonstrate competence at doing so: that\u0026rsquo;s fine.\nAbility to cap operational work. (We might also call this \u0026ldquo;SRE team having autonomy over how their time is spent\u0026rdquo;, for reasons which will become clear.)\nThe key behaviour enabled by having a cap on operational work is that operational work almost by definition requires some element of immediate attention or task-focus. If you can\u0026rsquo;t do anything other than respond to an issue, by definition you can\u0026rsquo;t put together the project time required to solve a general class of problems with software. If in turn you can\u0026rsquo;t solve classes of problems with software, in a general environment of growing services (which ~all cloud services are) you either have more work over the same amount of people, which is bad, or linearly growing numbers of people, which is bad. So I feel as a function of autonomy, a function of organizational back pressure, and a function of just getting the job done, an SRE team needs to be able to do this.\nWhether it\u0026rsquo;s 50% or some other number, I doubt matters in the short-term, but if you don\u0026rsquo;t give a team at least as much time to reflect, clean up, and engineer as they spend in purely reacting, they are not SREs.\nOrganizational cross-cutting ability. One of the cultural aspects of SRE that is often misunderstood is the implications of being the guardians of the user experience. Reliability is a holistic thing; it\u0026rsquo;s not an attribute or property in the gift of any one silo, so guarding that experience necessarily requires moving outside your own team, to assemble the end-to-end picture. This leads to a situation where SRE often acts as \u0026ldquo;horizontal glue between vertical silos\u0026rdquo;, to coin a phrase.\nWell, what happens when you can\u0026rsquo;t do that? In very hierarchical environments, where you are literally not allowed to talk to other teams without going up and down a chain of command; in environments where work items are processed primarily via context-free agents in ticket queues; or in environments where information about the customer experience is hidden, protected, or otherwise gate-kept, SRE struggles to work.\nTo be clear, there\u0026rsquo;s (potentially) a huge difference between declared policy and actual practice here; if \u0026ldquo;leadership says\u0026rdquo; you have to go through the chain of command, but in practice people just talk to each other and help out as they would anyway, that\u0026rsquo;s SRE-compatible for sure.\nBut if the organization is structured so as to prevent this kind of work, then it\u0026rsquo;s not SRE compatible. (Indeed, it might be incompatible with many other things too.)\nAbility to author code. Some idealised SRE team whose members can\u0026rsquo;t write software, or can\u0026rsquo;t be in a position to within some agreed timeframe, is an operations team with distributed systems expertise. While this provides value in and of itself, not having the ability to write code loses the SRE team one of the key ways it can contribute meaningfully to scaling, reliability, monitoring, and so on. To my mind, this is not an SRE team. There may well be a path to being an SRE team, particularly if the individuals are competent in a related domain, and are willing to acquire knowledge in the other. An SRE team should, of course, have a spectrum of experience and inclination within it, including both systems expertise and software expertise.\nWithout operational responsibilities. An SRE team without operational responsibilities can still do useful work, particularly if the products/services are in the process of launching (design \u0026amp; pre-launch can be an incredibly valuable period for SRE contributions) but a permanent removal of operational responsibilities breaks one of the main feedback loops utilised by SRE to improve the product. As a result, I think a complete withdrawal of operational responsibilities must be temporary (perhaps extended, but temporary), or I don\u0026rsquo;t think this is an SRE team.\nDo please be aware that despite various assumptions, on-call is not the primary value SRE provides, and neither do operational responsibilities have to be provided primarily as on-call.\nSLOs. There are many great practices that flow from having SLOs for your services, and much of value that is gained by being able to trade off priorities in services, but the author has been in a number of teams either without SLOs, that took a long time (~year) to settle on SLOs, and even a team where a relatively relaxed SLO was chosen by fiat and it was explicitly forbidden to spend more time finding a better one. So I am forced to conclude you don\u0026rsquo;t need them to be doing SRE, though whether or not you can continue to do that indefinitely is very much another question.\nFWIW, having SLOs unlocks disciplined tradeoffs between services, deciding on appropriate work, whether issues are important enough to care about, and a lot of organizational goodness - and also is a great way to be objective about what the user experience should be.\nA large-scale system. An SRE team without a large-scale system to look after is still a perfectly valid thing, providing either that the system will grow significantly at some point in the future (in which case preparation in advance is useful), or that the time cannot be spent more usefully on something else. \u0026ldquo;Large\u0026rdquo; is also a subjective definition, geared not just on the sizes of the systems in question but also the scoped competence of the individuals in question. \u0026ldquo;Business importance\u0026rdquo; can stand in for \u0026ldquo;large\u0026rdquo; too, of course, it is just that much of SRE expertise is most fruitfully applied across a large number of systems, amount of data, or high number of users.\nIn short, you don\u0026rsquo;t need it, though having it helps apply expertise in a high-leverage way.\nSupporting vendor kit. In a way, this is a special case of source code not being available. Note that source code not being directly available does not necessarily prevent SRE improving manageability or scalability of a piece of kit; devices often offer some kind of management API even if they don\u0026rsquo;t expose their full range of capabilities, or their source generally. Sometimes they can be automatically managed even if they don\u0026rsquo;t provide an explicit API: for example, Traffic team in Google supported Netscalers with a collection of perl scripts that SSH\u0026rsquo;d into the machines and redefined VIPs on the fly. (Not joking, sadly.) Traffic team was a completely legitimate SRE team despite having to work with this for some years, before a self-developed system called Maglev replaced them.\nSo supporting vendor kit, even kit for which you don\u0026rsquo;t have code, does not necessarily mean you can\u0026rsquo;t do SRE: if the majority is entirely proprietary or not automatically manageable, then yes, it is a major problem, but as a subcomponent, it\u0026rsquo;s not.\nMono-repos. A mono-repo, while convenient in the general case, is not required for SRE to be SRE. The main benefit of it in the general case is the ability to track down a software path, derive what log messages actually mean, figure out appropriate people to talk to, and so on. If there is delayed access to segments of the code, that may have an effect on MTTR, but does not represent a conclusive blocker. (Access to source code as a general point is covered above.)\nJob title. SRE work does not require the SRE job title to perform. Conversely, having the SRE job title but not doing SRE work creates confusion and dismay.\nConclusion #Though I\u0026rsquo;ve given you a lot of separate headings above, the summary is probably this: SRE is an engineering role. (The clue is in the name, I suppose!) The headings above talk about ability to write code, influence design, and so on - fundamentally, these are all proxies for the ability to do engineering. There are some practices that are perhaps more central or more peripheral than others, and there are some situations which might be temporarily bearable, particularly in startup mode - but fundamentally, if you can\u0026rsquo;t do engineering, it\u0026rsquo;s not SRE.\nIt is occasionally useful to spell out the specifics though, so if you find yourself in a position where you feel you\u0026rsquo;re not doing SRE, have a look at the above list, see what\u0026rsquo;s missing, and maybe you can start good conversations about fixing that. Maybe even point your leadership at this page, which could start opening the doors for actual engineering. Or perhaps it means those doors are more thoroughly locked, in which case many other companies await the arrival of motivated SREs desiring to improve their engineering abilities with pleasure.\nAcknowledgements #Review from David Blank-Edelman, Liz Fong-Jones, and the SREfarers crew: Narayan Desai, Laura Nolan, Emil Stolarsky, Nicole Forsgren, Jez Humble, Murali Suriar, and John Looney.\n","date":"26 October 2021","permalink":"https://non-functional.net/posts/2021-10-26-what-sre-is-not/","section":"Posts","summary":"Questions of identity solved by asking what we aren\u0026rsquo;t","title":"What SRE is not"},{"content":"","date":null,"permalink":"https://non-functional.net/categories/","section":"Categories","summary":"","title":"Categories"}]