Data-Driven Devops: The Key to Improving Speed and Scale
As the creator of Jenkins and co-CEO of Launchable, Kohsuke gets to see lots of real-world software development, and teams and organizations trying to push better DevOps practices forward. In those conversations, he noticed that some are more successful than others. In this talk, he explores where those differences seem to be made. One is around data. Our automation in software development is sufficiently broad that it is producing lots of data, but by and large most of those are simply thrown away. Yet at the same time, management is feeling like they are flying blind because they have little insight! Another is around how they leverage “economy of scale.” Successful teams seem like they’ve managed to drive great uniformity and consistency across software development, which allow organizations to move at great speed and make developers feel great.
VIDEO TRANSCRIPT
Hi, this is Kohsuke I am the co-CEO of Launchable and creator of Jenkins. So today I wanted to talk about the data-driven DevOps which is something I’ve been noticing and thinking about for a while, Which I think is a key to improving the speed and the scalability of the software development. So first let me do a quick introduction of myself. So as I mentioned I created Jenkins, and then this is probably what I’m best known for. It is the open source project. It is all over the world, we are helping all sorts of software development teams everywhere. And I’m pretty sure some of you have used it and are using it as we speak. And then, you know, I was a part of CloudBees, I was a CTO and now I’m an advisor so my role is smaller than before, but CloudBees is helping enterprise everywhere, doing DevOps and digital transformation, so much of what I learned led into what I’m going to talk about today comes from that experience.
And today I’m with Launchable, so we are doing the we are focusing on testing, helping the testing go faster so I’m going to mention a bit of what I’m working on now in this context, because that’s sort of a natural extension to what came out of the things that I have seen. So, um… The automations came a long way in software development, you know when I started Jenkins it was very simply the place where people ran nightly builds and nightly test executions. So that was still develop automation just 10-15 years again. But what started as simple got a lot bigger, and a lot more complicated pretty quickly. And I’m sure you’re feeling that pressure every day.
So, you know, a lot of companies can no longer adjusting, building and testing night, right? People having, you know the delivery pipeline, that might look something like this that starts from build and testing you know, the Dev stage in QA environment, all of that stuff. But actually this is not what’s going on, what’s really is going on is more like this. There is a lot of automation that is far more messier, Held together by, you know the email communication here and there, some implicit processes all that stuff. So sometimes you know the, what I feel like is all the individual pieces of the process is scripted, automated running without human interventions, but by in large, they are still put together and held together by people. And sometimes trying to make sense of the whole picture from those microscopic automation.
At time feels like watching a bee hive, we’re trying to understand a beehive by watching individual bees. That is pretty difficult. So, you know, after seeing, so many companies you know, struggle with this kind of larger scale automations, larger scale software delivery, I started feeling like there are two kind of companies. right? When talking about software development.
One is the donkeys, the other is the unicorns, they kind of look the same or similar but they perform quite differently, So I’ve been wondering and thinking about what makes that difference, and I think one of the key differentiators that segregates these two kinds of companies in my opinion is how they use data that comes out of the software development process to improve their practice, so I wanted to spend these 30 minutes looking into what difference these practices make, and why I think it’s a critical piece. So, you know, because I’m proud of myself for pushing the automations in the software development industry forward I played a role in it But, you know, despite the fact that all these automations are producing a lot of valuable data by in large I feel like these things are just thrown away, only used when something fail for somebody to look into the console or something in order to understand what has failed, so now, I don’t have personal stories on that, but I’ve heard that in this e-waste that people are throwing away, there is higher concentration of rare metals, gold and other precious metals. Then, the raw ore that comes out of the earth so what that implies, the volume aside, the density wise it might make more sense to mine this e-waste as opposed to the, you know, the gold ore. So I think something comparable is happening with the data. By in large these things today are thrown away, but there is actually precious insight into these things.
Companies, I feel like the teams have managed to figure it out, put that into a positive use seems to be getting a lot of value out of it. So, let me make this a little bit more concrete. So, imagine yourself in this situation in this company You’re working in this company that has hundreds of engineers working on tens of this sort of projects, and they are all building and testing on one shared Jenkins infrastructure. So, you know, on this kind of scale you shouldn’t be surprised that they are spending hundreds of thousands of dollars to AWS. just to provide a complete and storage necessary to carry out all this field and test workload So, as the typical Silicon Valley startup the gross of everything, its cost weren’t a concern until one day, it is. So, this company started switching into the IPO mode, so the CFO was hired and one of the first things he noticed was that his company was spending a lot of money, hundreds of thousands of dollars. So what is it? What exactly is it necessary for? Are there any ways to cut it? Those are practically valid questions and it turns out that the people working on this infrastructure had no way of answering it because it was just one giant infrastructure everyone is putting a crapload of workload on top of it, So what should’ve happened in the cases like this is that if the visibility into the cost was provided at the project level then in turn, that would allow developers to be aware of the trade-off that they are making and then that information can help them make the right call. And so, this place had three separate machine types for different kind of workloads. So let’s say for the sake of brevity code in small, medium and large.
Obviously the bigger machines are faster to run, but they cost more money. Imagine every test workload comes with a slider scale bar. That says well depending on which instance you choose these are the times that it’s going to take, and this is the cost. And what that has allowed them to do is for example, something that is running infrequently and it’s like a nightly, the time to executing in not that important so maybe they are willing to go for the longer one. And let’s say something running for validations, and that and that’s something that people are waiting for, then, you know, to not make humans wait, it’s probably worthwhile to spend a little bit of extra money to get the feedback faster, so these are the right trade offs that they can make. And in the absence of this information the developers they have other things to worry about, so they automatically just end up picking like a larger one. Whenever the first sign of trouble people pick up something and let’s just run this in the larger machine and move on. So those are the things that create the wrong incentive structure. So this is an example where quite easy, just small bits of information that the system already knows can make a difference in the core structure.
And these things are happening in so many places. Let’s look at another example, so here is an even bigger company. thousands of developers working on essentially embeddeded system is how I think about it. And this place has achieved amazing scale for the DevOps team, so it’s one DevOps team providing the infrastructure to run, to build and test. And I’m not just talking about Jenkins, all the supporting machineries the source code in this place is so big that they had to come up with some clever Git, the forward proxing scheme just to make the clone go fast enough. So that’s the scale that we are talking about, so the problem that these guys are facing is that when the build fails, when I say a build, or when a test fails. Who is it that needs to be notified? Because what’s happening is sometimes the infrastructure would fail, let’s say the network, the databases for testing or the QA environment gets messed up and then things start to fail in mass. and these are the infra programs that the DevOps team wants to know and sometimes the failures are in the applications, let’s say that it’s a compiler error, the test fails because the assertion didn’t hold. And those are the errors that the engineer should look at. And getting this information right, getting this, you know getting who to notify right that’s actually pretty important, because the infrastructure program is not something the app developers can work on. When you get the failure saying this infra didn’t work out, then it cost the credibility of the DevOps team in the eyes of the app engineers, right? Nobody wants to feel like their build and tests are running in a crap place, that the results are not trust worthy.
But that’s what’s happening, or maybe the easiest way to explain this, I can think of this, is crying wolf, you don’t want to be the boy that calls the wolf when it’s not there. So what they were trying to do what they have done actually, so in this place when I visited there they have already deployed a solution. And you know, for that, because of, surprisingly impactful So they noticed that these different failure modes showed up in different kind of errors right? Which makes sense, so they just deploy something very simple, like regular expression pattern matching against like the last 30 lines of the log files or something along these lines. And they just deployed a number of well understood, regular expressions that says when it matches this form then lets back the infra team or the other way around.
So that kind of very basic solution turned out to be surprisingly impactful. This is such a big program everywhere that they came across another team who in turn, this time deployed Bayesian filter, and this is essentially you know like, I guess I suppose you could say it’s a mess This level of statistics use I suppose. So the idea is this is the same technology that back then the spam filters are using, especially back in the days when people run the main client on their own mac and not using Gmail. So you know they essentially what this system is really clever, so the system sends out the email based on its judgment That it picks up the failure it decides the send either the app developer or the infra people right? And that notification contains the button that says not my problem. So when infra guys gets this email, and press this button then it not only redirects the notification to the app developers, but it also teaches the system that this email, this failure pattern was meant for the app engineer and not the other way around, so based on this kind of input from the humans, the filter can gradually get smarter and its effectiveness is proven by this spam filter, so it does work. So again you know it’s well understood technology, looking at the log files it’s certainly something that a lot of people can think of. But just deploying this tremendous impact in keeping the DevOps team credibility high by masking problems that app developers shouldn’t see.
so that was a great example that data is making a positive impact, perhaps more importantly, I think that the organizational level this kind of data usage, what I’m finding is that it’s very, very important, now a lot of people that I talk to today the practitioners in many ways they understand what needs to be done, but they are often frustrated that the work beta is the right thing is not getting prioritized by the organizations I’m sure some of you feel the same frustration. In some sense I think that the frustration is warranted but in some sense the leader solving that frustration is an easier job that is the helping the people around you who are almost by definition not as technically savvy as you are. Helping them to understand the importance of this effort is actually the job as a DevOps leader.
Because it’s the data and then the story that goes with it that helps these other people see the problem that you see, of course like in you are the subject matter expert you don’t need that kind of data to commit yourself, but other people need that. And that sort of the backbone in which every organization work, right? You need to convince people around you, you need to influence them around you, in order to rally them behind your cause. Just telling them so is not going to work, you need to use data. And then data is also important because it helps you apply the effort to the right place, right? keeps you honest. It’s one thing for you to think that this is the place to make an impact, but let’s make sure that is actually the case, otherwise it’s not different from just a blind faith, and I think we of all people as engineers I think we are proud of the rational thinking the productivity and so on. We should get importance of that. And then the importance of data doesn’t just stop there, Data helps you also show the impact of your work.
For the organization nothing is harder than saying, like, “We need a million dollar investment now and we may or may not be getting any results for the next two years.” Just a very difficult story to find, so if you can use the data to show the people around you, that you made impact, then that’s only going to make it easier for you to make a bigger impact So in many ways I think the data is the common language that connects the business people and the subject matter experts like we are I think sometimes because we see this problem, we see the problem thinking of a problem you’re facing so clearly sometimes I think its gets, we tend to overlook the importance of communicating that in ways that other people understand. In other words, like, what you need to get to is this picture is this, you have an idea, you have a software factory, these machines, you have this process that turns these ideas into a functioning software, so what we are trying to do is to observe how this factory is behaving and then use that information to improve the factory itself over time. Right? One at the time, continuously and I think that’s the theme. That’s the theme that in other industries, like never understood, like manufacturing for example and then, in some sense what I’m saying is this kind of thinking applies just as well to the software engineering So if you look at this program, well if you look at this as a learning exercise then it should come as no surprise that the machine learning can play an increasingly bigger role in this kind of learning and improving process and some companies already doing this, so let me talk a little bit about that.
So this is another company, a much bigger company and you are in the DevOps team, and you have a massive codebase, its modularized and organized and maintained so it’s not a spaghetti, but it is a massive codebase. So it’s like concept dependency, you can probably print that on paper So and then, so this kind of system, it gets pretty expensive and time consuming to test them, but you change something like the effect is quite big so this software keeps growing, because your business is successful and the questions, the challenges you are trying to tackle is how to cut the cost and time it takes for the software delivery process. So the first step that the team has done which is the dependency based compilation, imagine you have these four diamonds at the bottom are files, and then squares in the middle are modules and orange circles are tests. So thanks to the well-organized build system you can create this kind of dependency graph, so based on this if you know that for example the file on the very left, bottom left has changed.
By using the select dependency information you can automatically determine the subset of software graph that needs to be rebuilt and retested in this case, one module and two tests. Like things on the right you can safely infer that they don’t need to be tested. So that’s the, that’s got them going for a while. Now let me point that most of the teams that I go see they are not even doing this level of analysis work. It’s so common to see people just running anything from clean and fresh and the scratch every single time. They are costing a lot of time and money in doing this process, but anyway so this is not what I wanted to highlight, what truly is amazing is coming in the next step, So you know the time with this kind static analysis is that there are certain kind of tests, let’s say like integration test that basically depends on everything else, because by definition they are testing a larger system together so those tests end up being impacted all the time so you can only get so much speed boost out of that to work on select dependencies. So the next step they took is what I think of is predictive test selection. So they trained and deployed a machine on a model which looks at what has changed and predicts a useful subset for the test to be run. So this company is producing 10 to the power of 5 So a 100,000 changes in the order of 100,000 changes in a month so they pulled out the 1% and then trained the model. And then that was apparently enough to make useful prediction and the impact was pretty impressive, so they reported that this model selects only a third of the tests compared to the select based approach. And every time you stop running some tests, you worry about the degradation slipping through. But they reported that this model only misses just 0.1% of the changes. So 99.9% of the time this one third of the test is all the regression that you need.
And this was able to cut database costs into half, so imagine, this place I don’t know how much it’s spending, we must be well into millions and to be able to cut that cost in half and then to use the time it takes to get the feedback, that’s pretty amazing. So I think it’s pretty clear to me, I’m kind of exciting this kind of predicting probability of test failure has many use. In my own Jenkins program, you know every time I make one line change to codebase I have to wait for the pr valuation to complete which takes a whole hour. And that’s a long time to wait. And I’ve seen some companies that are doing far worse. And then you know you have a place some big integration test only run nightly, or even like less frequently. and so, naturally, these programs that these tests get are coming much later So imagine like you can do this similar kind of things, it’s like predicting a failure for every test case, and then instead of running this in their natural order we can sort the high probability test and then run them in this order.
So by doing that you are increasing the chance that the test will fail right away, and that feedback of the developer would be really useful to get the work going. So you don’t have to wait for the whole test to come in, you just need to know your first failure and then that’s, this kind of ordering give you that. Or you can take one step farther And then just ignore the slow failure test and then just focus on the higher ones which create a smaller adaptive subset. So this is actually what led me to Launchable and then this is what I’ve been working on in this company right now as we speak so if this is the interesting program for you, I would love to talk more about this and work with you. But then we move along from this predictive test selection conversation into a the, another machine learning usage So here is another company.
You are in the big SRE team, you seeing the hundreds microservices getting deployed into the production environment that you manage and here at about the pace of 1 deployment per app, per day. So that translates to hundreds of deployments. So that’s a pretty amazing software development shop. I, like most, I say this company is like a ninety third, fifth percentile in terms of the delivery journey like not many companies are at this level of sophistication, but you can also imagine probably in this kind of SRE team, you are probably feeling a bit overwhelmed. Cause your on the hook if something bad happens, but there are so many things changing all the time. It’s like you need some guide, you need some help to get to this point so that you can reduce the risk or contain the damage more quickly. So they have done something similar so they have trained a model based on the 40000 deployments in the past year I believe of which 100 of them are failures. Which again I’m going to point that I think I drew at the third of walking in the third place where the failure rate to the deployment is just 1 out of 400 that’s pretty amazing, but they are not actually happy with that, they are trying to go even better. Anyway they train the model and then the model can predict 99% of the failures beforehand with the 5% false alarm ratio. Meaning the model they flag that, this is a risky deployment, and then that’s like 5% of the time that’s a false alarm like nothing happens.
But the plugin can catch 99% of program, which is almost miles ago. So that’s why they report it. But perhaps more importantly, if you think about the impact of what they have done. So based on this kind of data they were able to learn for example that these outages happened in deployment that is estimated as lower risk by developers. And that the outages tend to happen more when the time to deployment approval via the developer was really short. In other words when the changes are rushed. And then they were also able to coordinate this deployment failure like a longevity, in other words like a longer, older code is more risky, and then you might look at these things and say “Well, duh, we already know these things” but actually then I think you’re missing the point.
This kind of, like quantification, like, yeah you might know that qualitatively but quantatively now you can put the numbers and the cost into this things, and that’s actually pretty game changing. You could say for example every month that the code leaks it increases the likelihood of deployment failure by I don’t know, like 1% or whatever, I don’t know what this company found in terms of numbers, but to be able to show some homework and appoint that numbers and be able to quantify the business impact of these things, then suddenly it gets, it doesn’t the factoring that you always wanted to do or liking, or that service you always wanted to do. Now don’t worry that just the subject to believe conversation. It’s simple like a number calculation. And then that is the kind of thing really can move the organization around. alright, so I’m running out the time so let me just wrap off to a conclusion.
So the thing that connects these stories I think, these practices that these large software development companies seems to be able to take on are actually far more advanced than the smaller shops are. Now I used to think there is an engineer, a small nimble software development team are the best kind of software developer teams that they can make a big outsized impact, sort of like David vs Goliath. Now, after I’ve seen these things, I started to feel like actually some Goliaths are actually far faster and nimbler than the best Davids. And on that I’m going to pause you. I’m not saying every large company is like that unicorn, there are plenty of donkeys here. But I mean again you think in terms of what are making difference, it does seem like the data plays a key role, so on that note On that note, So that is the thought I wanted to leave you with. Thank you very much for listening, enjoy the rest of the show.