DO, RE, Me: Measuring the effectiveness of Site Reliability Engineering

Dave Stanke
Developer Relations Engineering, Google Cloud
Google Cloud

The DevOps Research and Assessment group, or DORA, has conducted broad research on engineering teams’ use of DevOps for nearly a decade.

Meanwhile, Site Reliability Engineering (SRE) has emerged as a methodology with similar values and goals to DevOps. How do these movements compare? In 2021, for the first time, DORA studied the use of SRE across technology teams, to evaluate its adoption and effectiveness.

We found that SRE practices are widespread, with a majority of teams surveyed employing these techniques to some extent. We also found that SRE works: higher adoption of SRE practices predicts better results across the range of DevOps success metrics.

In this talk, we’ll explore the relationship between DevOps and SRE and how even elite software delivery teams can benefit through the continuous modernization of technical operations.

 

 

 

Video Transcript

Thank you for having me. I’m going to go ahead and share my screen. I’ve got a few slides to show. And we’ll get right into it. All right. So today’s talk is Do Re Me: Measuring the Effectiveness of Site Reliability Engineering. My name is Dave Stanke from Google.

So here are today’s highlights. The DORA Organization has been researching software delivery for many years. This year, we did a deep dive into site reliability engineering. We asked teams about their operations practices and about the outcomes those teams are able to achieve. We found the SRE and DevOps worked really well together. They had separate origins, but they independently evolved into very similar philosophies, and today they compliment one another very deeply. We found that SRE practices have been broadly adopted in organizations of all kinds, and we found that SRE can really help to amplify the already powerful impact that software delivery best practices have on achieving business goals.

All right, so let’s take a look at how we drew these conclusions. So first off, my name is Dave and I greet to you today from Jersey City, New Jersey. I do developer relations for Google and my work involves learning about and then teaching and continuing to learn about best practices related to DevOps and SRE. I’d love to hear from you, so find me on Twitter @DavidStanke. All right. What’s DevOps? For a term that gets thrown around so much, it’s maybe a bit surprising that there’s no canonical definition of DevOps, right? There’s no manifesto or accreditation process. And that was actually on purpose. The folks who coined that term really intended that DevOps would be an ongoing process of communication. It’s a conversation.

But of course, that doesn’t stop us from offering definitions for it. You’ve probably heard many. You may have one of your own. I have one. It’s actually a picture. It looks like this. DevOps is when we apply communication and automation to achieve velocity and reliability all in service of happy users. Okay, cool. What do we do with that? DevOps certainly sounds nice. It’s hard to argue with the idea that all of our services should be fast and reliable and we should make our users happy. But how do we apply that? These ideas are supposed to apply to our work to our jobs, right? They should be practical. DevOps can be fun, but we don’t do it for fun. We do it for work. So the question is, how do we put DevOps into practice? I’m here on behalf of a group that takes an empirical approach to that question. That question, how do we do the DevOps?

So within Google cloud, there’s a team called the DevOps Research and Assessment Group, and for nearly a decade, we’ve been running the largest research project dedicated to understanding what does it take to make a software team really excel at delivering value to the business? To do that, we’ve been researching across lots of teams, lots of organizations. We’ve collected data from tens of thousands of technical practitioners who work at thousands of organizations. And with that data, we apply rigorous statistical methods to draw inferences about what best practices there are out there. We publish an annual report summarizing our findings. Here’s a view of this year’s report, which came out a few weeks ago and is available at bit.ly/dora-sodr. S-O-D-R, state of DevOps. It’s a great read. It’s something like 70 pages or so, but with lots of graphs and colors, it’s meant to be read by executives, but a lot of great insight for practitioners, for people at all different levels. I strongly encourage you to go out and grab that and give it a read and please feedback and let us know what your thoughts are. This is an ongoing conversation.

I’m going to dive in a little bit to some of the things we’ve learned and presented in that report. And you’ll also find other topics beyond this SRE topic that we’ll be talking about today. All right. So at the heart of DORA’s research is this predictive model of capabilities and outcomes. So a predictive relationship, it’s stronger than a correlation. It’s not quite a causation. And what it means is that when we look at things that teams do, capabilities, these may be technical, these may be cultural, these may be tools, they may be practices, ways of communicating. We have a whole bunch of them. These capabilities are predictive of certain outcomes. And so that means that if we do have this kind of capability, we’re more likely to get the kind of outcomes that we want.

And as part of that predictive model, we’ve discovered a really reliable set of metrics to measure the effectiveness of a software team. Now, for a long time, it’s been a challenge to measure software development. A long time ago, we learned that doing lines of code or counts of bugs are not good measures, right? They’re easily manipulable, and they’re not really correlated to meaningful outcomes. But with DORA research, we’ve developed these set of concise, robust metrics that actually does predict real world outcomes. There are four of them. Four key metrics. There are two that relate to velocity. So that’s lead time for changes. How long does it take from the time a bit of code is committed to version control to when it’s released for users and deployed to users, however your deployment mechanism might be, or shipping mechanism? What’s that lag time?

The second velocity metric is deployment frequency. How often are we deploying or releasing to users? Is once a quarter or multiple times per day or somewhere in between? Now, coupled with that, there are the stability metrics. So when we release our software, what is our change failure rate? How often does one of those deployments go bad? And we say, “Oh, eek,” and we need to undo that and roll back or fix forward. And when that happens, our fourth and final metric is the time to restore service, median time to restore service. How long does it take to get from, “Eek,” to, “Okay, it’s back and service is restored for the users.” Together, these four metrics, these four keys, combined into a construct that we call software delivery performance. This is a part of our mathematical model that works as a single entity to predict other parts of the puzzle.

And so software delivery performance then in turn predicts business performance. So working backwards from the end result, we can say this. If you want good outcomes for your business, then improving software delivery is likely to help you achieve that. And then these capabilities, which I’m going to dive into a bit, tell us how to improve software delivery. Now I want to take a bit of a sidebar, but it’s actually arguably the most important part of the whole thing, which is culture. Culture is another thing that often feels hard to quantify, hard to measure, and certainly is hard to change, but it can be done. And starting with how to measure. The DORA research uses a framework called the Westrum organizational typology, and this was developed several years ago by a sociologist named Ron Westrum and his team.

They actually were studying teams outside of software. They were studying teams in high information contexts and in places with some pretty serious outcomes, pretty quantifiable outcomes. They were looking at things like emergency rooms and nuclear power plants, and they were looking at how do the people in those contexts communicate? How do they share information? And he drew lines to find three types. So on one end of the spectrum is the power oriented type of information sharing, where people are looking for individual control, individual authority, and new information is seen as a threat to that. So therefore, they try to exterminate new information, and they may punish the bearer of bad news. We also call this a pathological organization.

Now in the middle, and often a very conventional model for a business, is the rule oriented organization. So this is one where you have really defined silos of responsibilities, and new information may not be exactly punished, but it’s not entirely welcome. And the idea is that when something new comes up, people strive to put it back in the box in a sense, try to incorporate it into their framework of what they already know. A lot of times in these kind of organizations, it takes a lot of leaps, a lot of hops for one person to try and connect with another person, because they have to navigate this bureaucracy. And that’s why we call this a bureaucratic organization.

And then at the far end of the spectrum, we have the performance oriented culture. And this is one where new information is seen as an opportunity for learning. Inquiry is really the name of the day, and people cooperate to understand new information and to develop new ways to respond to that information and whatever’s going to come next. We call this a generative culture.

What Westrum found is that in these contexts, and potentially in any context, the teams that perform the best that have the best outcomes, according to their own KPIs were the ones with that generative culture. And in DORA, we’ve found that that same thing applies to software teams. So a generative culture is predictive of better software delivery outcomes and better business outcomes.

Now, these outcomes can be pretty broad, and correspondingly, those measures of software delivery really have a wide range. So when we compare the highest performers, the elite performers, which we’ve come to from the means of cluster analysis, it gives us four clusters of performers. The elite performers are delivering software significantly faster and more frequently than the low performers, and at the same time, they have a faster time to restore service and a lower change failure rate. So that same team that can ship the fastest is also having the fewest defects or the lowest number of impact to their users, the shortest time of impact to their user.What that means is that we see that the teams that go the fastest break things the least. So really, the high performing teams are outperforming their competitors across the board.

All right, now. I said we were going to talk about SRE. What’s SRE? SRE Stands for site reliability engineering, and it’s a way of doing operations. In fact, it’s the way that Google does operations. We started doing it in the early 2000s, and it was really originated from a recognition that we had scaling challenges. We looked at the work, the manual effort that we were putting into scaling, to operating our services, and we said, “Boy, this isn’t going to work when we start getting beyond millions of users into billions of users. There just aren’t enough operations folks in the world to do that. And even if we had them, we couldn’t afford to pay them all.”

So what we did is we started applying principles of automation and repeatability and information sharing to scale up the capacity of those individual operators. And along the way, we’ve developed a few principles. These are the SRE principles. The first one is this. The most important feature of any system is its reliability. Whatever features, whatever cool functionality we wanted to give to our users, if the system’s not working, if they can’t access our system, doesn’t matter what we wanted to offer them. They can’t benefit from it. So job one, priority one, is reliability.

Now, what do we mean by reliability? Well, our monitoring doesn’t decide. We may have all sorts of signals, all sorts of telemetry, but that’s not the final judge of whether we’re reliable or not. Our users get to decide. If our user can experience our system as working, as reliable according to their needs, it’s reliable. And if it’s not, we need to change our definition. So we define our system’s reliability from that user oriented perspective and strive to measure it and improve it from that perspective.

Finally, we’ve learned that in order to meet those reliability goals, we need more than well-engineered software. We also need well-engineered operational systems. And beyond that, to really achieve the high levels that serve a real scaled operation, we need a well engineered business that understands and can support the reliability engineering efforts.

All right, let’s do some terminology. So the key terms of SRE are these. SLIs. Service level indicators. SLOs. Service level objectives. And also sort of SLAs. Service level agreements. An SLI is a metric that defines how our system is doing. And again, it reflects a customer’s perspective of the system. So it may be something like error rates, how many user requests are resulting in a 200, an okay status, versus a 500, a not so well case status. And that can give us a ratio. Out of all the requests that our users are making, how many of them are delivered successfully in a way that the user can recognize as a good response? Now, to that number, we can apply an SLO, service level objective. This says, this is our target. This is our goal. This is the value that we want those metrics to have. This is often expressed in terms of a percentage, a number of nines. Two nines is 99 percent. Three and a half nines is 99.5 percent. These are our targets. These are our SLOs.

Finally, an SLA is something we don’t usually talk about much in SRE. That’s the business agreement that says, “If we don’t hit our goals, our promises, what are the punishments? How do we make up for that?” SREs usually try and stay out of those conversations, which means we need to hold ourselves to a higher standard than the SLA. It definitely has to be aware of the SLO. They work together. But if we’re meeting our SLOs, that should give us coverage that we’re clear on our SLAs.

Now, how do we achieve all of this? One of the things we use is called an error budget. And error budget says this. We’ve set our SLO, our objective at a target that’s 99.5 percent, let’s say. Well, that means that we understand that there’s going to be half a percent error rate. Not everything’s going to work. And we understand that, right? We’ve been around the block. We know that sometimes systems fail and we have to have some sort of understanding and agreement about how much it’s going to fail, how much we’re willing to let it fail, and how we can still be okay with our users.

That becomes our error budget. It’s our tolerance, our risk tolerance for how much failure, how much error there can be in this system and still be doing a good enough job for our users. And that error budget then gives us the capacity to do things that might be risky. Things like making releases. We find that something like 90 percent of all failures happen during a new software or configuration change. So giving people all those cool features they want, that comes with risk. And we use our error budget as a way to know, do we have the risk tolerance now to do something cool and new, or do we need to hold back and focus on reliability? Because reliability is the most important feature. Not the only important feature, but the most important. The error budget is our guide to that.

We like to rationalize alerting. And what I mean by that is that if you get woken up in the middle of the night by an alert that you can’t do anything about, or that you don’t care about, that’s not a good use of your human ability to care. And so we really try to turn off as many alerts as we can and focus them on things where the user experience is actually affected and there’s something you can do about it, and you should do it right now. If it can wait, if the response is something that might be okay in a few days or weeks, let’s not page someone in the middle of the night. Let’s maybe log it somewhere and then we’ll get to it when we have time.

We do disaster preparedness drills, right? A disaster preparedness plan isn’t a plan unless it’s actually been tried. So we do a lot of scenarios in that way. And one of the techniques that underpins all of this is toil reduction. Toil is work that is manual repetitive, not strategic, something like SSHing into a machine to restart a service. That’s fine, it’s good. It might stop the oom, but it’s going to come back tomorrow. So we’d rather try to automate that or make it so it doesn’t even happen in the first place. And by eliminating that toil, that gives us the freedom, the mental cycles to then do more work that helps us grow our systems and scale up.

All right, there are a lot of different ways that SREs can relate to the rest of the organization. These models I’m showing here are very similar to what you might have seen on DevOpstopologies.com. So there’s no one size fits all model, but having that communication platform, using the SLOs, using of the error budgets as a means to balance between feature, velocity and stability, that is the core of how these different roles are going to relate to each other, regardless of what their org structure looks like.

In 2016, these ideas about SRE, which we had been developing inside Google, started to become part of a public conversation when we released the first book on Site Reliability Engineering. And since then there have been multiple additional books and conferences, and a lot of conversations around SRE. So as people in the DevOps community and the community beyond started to learn about SRE, they started to look at it next to DevOps, and they said, “Huh, how are these things related? I see some similarity here. I see some familiarity. Are these things the same? Are they completely different? Are they competitive?”

Well, one way to answer that question is something that we at Google have posited, which is that SRE implements DevOps. SRE, just like how an object oriented class implements an object oriented interface, SRE is a way to achieve DevOps. It practically does the things that DevOps more abstractly prescribes. Now, for people familiar with both of these disciplines, this assertion felt very comfortable. This rang true, right? But it’s kind of more of a hypothesis than an axiom. We didn’t necessarily have the objective data to back this up.

Now I want to take a second to look at the breadth of these terms. DevOps originated right here in this area of the SDLC, looking at how can we get our dev cycles and our deployment cycles more in sync, have better communication and better velocity between them? SRE started a little further to the right. SRE is a little more about, as software is deployed into pro, how do we keep it operating reliably and then provide that feedback mechanism back to the dev teams?

Over time, DevOps has had a bit of scope creep. As it started entering organizations, people started realizing, with reason, that there are important interactions with product, and with the business, with the users. And the scope started to expand, right? This is not a bad thing, though it did lead to some pretty the awful portamentos. Lean product, for example, was introduced into the DORA research as the DevOps product management framework of choice. It’s how product development can be incorporated into how we look at the overall DevOps model.

So what about ops? We’ve been studying stability of software, by which we mean when we do a release, does it last, or do we need to redeploy it because it’s got defects in it? What about after that? What about the ongoing operation of the software as it meets users and as they really experience it and get value from it? Well, this year, the DORA project really wanted to better understand operations and their role within an overall software delivery value strain. Now, DORA had dabbled in this area going back to 2019, where we introduced a fifth metric of software delivery and operations performance. We have the four keys of software delivery, and we introduced another one for operational performance called availability.

Now, the first thing that we did this year to expand our scope is we renamed availability to reliability, because reliability actually, when we consider it from a reliability engineering point of view, encompasses a lot more than availability. Availability means, is the website up? Can I get to it? But the way we think about reliability, we think about more than just, can I get to it? But also, can it give me a response fast enough to be of any use to me? When I get that response, does it have the right stuff on it? Does it have the content that I’m expecting from it as a user?

Reliability really means, is this service living up to its promises, explicit and implicit, to me as the user? And of course, at Google, the way we attempt to achieve this is through SRE, and we wanted to find out who else out there is doing SRE. This gives us the question, how do we know? How do we know if a team is doing SRE? There’s no established rubric that stamps someone as SRE. It’s pretty reasonable to say that Google SREs are practitioners of SRE because that’s a tautology, but even here in the mothership at Google, different areas, different teams, different individuals, all have different styles, and different levels of engagement and different maturities in their SRE practice.

We didn’t want to just ask people, do you SRE? That can lead to both false negatives and false positives. And I’ve seen both in my conversations with customers. Sometimes teams say that they have SRE, but the way they operate is a little bit more like a traditional system. And sometimes teams have developed SRE-like principles, but don’t use that term. For example, Facebook has a discipline called production engineering. It’s, in a lot of ways, very similar to SRE, but of course it has a different name. So we took kind of a duck typing approach. If a team acts like an SRE team, then we say they’re doing SRE.

So to achieve that, what we did is we distilled the SRE book down into succinct statements of practice and we stripped out most of the jargon. We boiled it all down to a set of statements that we added to our DevOps research survey. And we asked people, what do you do? On a scale of one to seven, do you do these things? You’ll see some statements here that reflect the rational symptom based approach to alerting. You’ll see others that describe a toil reduction or disaster preparedness. You’ll also see some that describe the relationship between operations and development and how they communicate, how they prioritize.

Now, this list is necessarily reductive and it’s imperfect and incomplete. I am sure we got some things wrong, but we did get it right close enough that we have statistically significant findings. There is something here. So what did we learn? Well, the first thing we learned is SRE is widely practiced. It goes far beyond Google and Facebook and some of the large companies that have been written about in our books. In fact, we found that 52 percent of respondents reported the use of SRE practices to some degree. Now this can spread widely. Some teams are doing just a little bit. Some teams are doing all of the SRE practices that we asked about. But it really was a very widespread.

And this reflects our experience at Google as well, where some teams do a lot of SRE practices and some aren’t doing so much. So to the extent that I was surprised by, these practices are really widespread throughout the software development community. So with all those folks out there doing this stuff, that leads to the question, of course, how’s it going? Well, good news. SRE seems to work. Recall the predictive model that we use. If a team does X that’s predictive of Y. Y may be another capability or an outcome. Here’s a simplified view of that predictive model. You can explore the whole thing at bit.ly/dora-bfd. That’s the big friendly diagram.

Now, what we found is that, based on the predictive analysis, we can see that SRE has multiple benefits. SRE is good for humans. We found that teams that use SRE practices have less burnout, which I think we can attribute to SRE practitioners having more varied work and doing more strategic kind of thoughtful work. It also helps balance between different kinds of work, between that ops centric work, that toilsome kind of work, and coding. So teams that practice SREs report that they spend more time writing code, and this often isn’t feature code. It’s the kind of code that helps them manage and automate their systems.

So in that light, SRE is good for systems and organizations. I mentioned that we asked teams about how the operations and developer staffs communicate and prioritize. And one of the things that we look at is, do they share responsibility for the operational reliability of their systems? And what we found is that teams that do that, where that ultimate responsibility for delivering a system is really shared between these roles, that leads to better reliability. And in general, across the board, all of the techniques that we’ve studied as part of our SRE construct predict higher reliability.

Finally, this is good for business reliability contributes to delivering on those business goals. Let’s take a look at how that works. It works in the way that reliability is a force multiplier. Here’s what I mean. There’s software delivery performance, as measured by those four key metrics around velocity and initial success rate. Now, what we found is that reliability moves orthogonally to software delivery. Teams that succeed in one of these areas don’t necessarily succeed in both. So they move independently. But when teams shine in both of these areas, you get a compound effect on the value that technology can deliver to stakeholders. Teams with high software delivery performance scores, who also use SRE practices extensively, are 1.8 times more likely to report that their businesses achieve their commercial goals.

Now, finally, we found that there’s a lot of room for growth. SRE is widely used and has demonstrable benefits, but very few respondents indicated their teams have fully implemented every SRE technique that we studied. Teams at all levels have a range of implementations. And at all levels, at every cluster of software delivery performance, teams that also met their reliability goals outperform other members of that cluster in regard to business outcomes.

All right, it’s time for my Hot Takes. These are informed by my reading of the research, but they shouldn’t be taken as formal findings of the research team. First off, there’s a convergent movement here. A lot of companies and researchers have arrived at the conclusion that this generative culture drives organization performance in a complex and high information environment. So we’ve learned about these ways of communicating, these ways of processing information, from SRE culture, from DevOps and the DORA with Westrum culture, from Toyota and their Toyota production system, from the psychological safety research.

You can see tons of articles in the business literature about how these kinds of cultures where there’s high trust and high communication and psychological safety, how these ultimate yield better outcomes. And I think what that’s showing us is there’s something underneath, something that is perhaps a deeper truth, and DevOps ops and SRE and some of the other ways are all different ways of discovering that. And there’s probably continued common ground that we can discover.

Second, SRE isn’t for the Googles of the world. These reliability engineering practices have benefits for teams at all levels of DevOps maturity, but it’s important to note that implementing SRE might not be the highest priority for your team. There are a lot of capabilities and practices that effectively are prerequisites for SRE, and a lot of those are part of the DevOps model.

So one of the things that we do with our DORA research is we offer people kind of a guide, a roadmap to say, how are you doing across each of the capabilities that we’ve studied? These range from things like trunk based development, to continuous testing, to team structure, and providing teams with trust and autonomy. And through the research, we can discover where our team needs help. And it may be that we need to look at our operational practices. It may be that the most powerful thing is something else.

Finally, DevOps is a big umbrella, with or without the extra syllables in the middle. And SRE is a framework for our operations and for communication between operations and other teams. It really fits into DevOps in the same way that lean product is a framework for product development that works well within a DevOps organization. So SRE works as part of an overall DevOps approach to software development. With that, I will say, thank you. And I look forward to answering questions in the Q and A.

I’ll leave you with resources to explore next. First off this year’s date of DevOps report, you can download at bit.ly/dora-sodr. Head over to sre.google, to find free access to all of the books that I’ve shown as well as a bunch of other articles and resources. And I’d like to cordially invite you to the Reliability Discussion Group at bit.ly/r9y-discuss that I host every month. It’s a place for open conversation about reliability engineering where we can share what’s working and what’s not. I’d love to see you there. Thank you.

 

Trusted Releases Built For Speed