You just got a Kubernetes cluster, now what? This lauded talk from swampUP 2020 covers the details of how to take that k8s cluster and turn it into a reliable, secure and stable environment for developers to build and deploy applications. Note: Brendan Burns is the co-founder of the original Kubernetes project.
Hi there, I’m Brendan Burns from Microsoft Azure.
And I’m going to talk to you about production Kubernetes. Alright, so I want to cast everybody’s mind back to about six years ago, when we were starting the Kubernetes project, when we were starting that project You know, the honest truth is that when you were taking the time to go and install a Kubernetes cluster, it was a mountain.
It was built on a mountain of bash, it was built on just a tremendous number of scripts. I don’t know how many people were around in those days. But if you remember the cube up.sh, turning up a Kubernetes cluster was a an ordeal, right?
And so it took you a while to get going, and maybe it would work and maybe it wouldn’t work. Fast forward now, six years later, when Kubernetes is present in every single cloud that’s out there, and it looks more like this. Right? So with the Azure Kubernetes service, a single command line, you know, from the command line tool will create a Kubernetes cluster, short number of minutes, you’re up and running. Right? So we’ve made amazing progress in our ability to start Kubernetes.
But the truth is, that once you have that cluster, sometimes it looks a little bit like this where 28 days later, it’s a nightmare.
You have no idea exactly what’s going on, things are misbehaving. Maybe a few pods have turned into zombies you never know. And so today in this talk, I want to talk to you through a little bit about what happens after the cluster is created. Because we really don’t want to go there. Alright, so we’re going to focus our time on what happens after that cluster has been created. You have an empty cluster ready to go. You’re thinking about deploying production workloads. I want to walk you through the various pieces, you should be thinking about the various steps because really setting that foundation, setting that cluster up for success and setting your developers up for success, we’re going to use that cluster is really the key to a successful Kubernetes experience. Alright, so let’s start out what are we going to talk about? We’re going to talk about two things. We’re going to talk about pipelines. And this is how we’re going to go from code all the way through to code right In a Kubernetes cluster, what we’re going to think about as we set up those pipelines with the purpose of those pipelines are some high level thoughts on how you structure the flow of code to cloud. Once we’re done with that, we’re going to spend some time on cluster daemons. And that is basically the idea of how you set up the cluster itself for success. How should you be thinking about, you know, the things that you put inside of a Kubernetes cluster, not the applications, but rather sort of the supporting infrastructure or the supporting characters that live alongside the applications, make it easier for your developers make it easier for your operators generally make the environment a happier one. Alright, so let’s take a step. Let’s take a look at pipelines. First. The ingredients, obviously, on one side is a Git repo. It doesn’t have to be GitHub, although GitHub is a great solution that a lot of people use, and on the other side, obviously, is that Kubernetes cluster. So what we’re going to be talking about is how we build that bridge, how we build a bridge from get all the way through to Kubernetes In the first place, you know, as with any good bridge, the first thing that you need when you’re building a bridge is support somewhere in the middle. And the support somewhere in the middle is a Container Registry, I want to put forth to that. You should be thinking about splitting a pipeline splitting the bridge from code to cloud into two phases. The first phase is code to container image, whether that’s in the Container Registry, Docker Hub, J frog, there’s a lot of different container registries to choose from, doesn’t really matter which one, but focus on how do I go from code to image as the sort of the first half of the bridge? And then from image to cloud, right? And how do I think about going from image to cloud? Those are the two sort of phases of going from code to cloud. And it’s useful to split things up that way, because then we can attack each of those problems separately, and end up at a good result. All right, so taking a step back, we’re going to focus for right now on how do we get from code to an image. Right, what should we be thinking about? Well, when I started programming A long time ago, and it was a long time ago, there wasn’t really a lot of rigor around this particular step. In fact, many times the code there, the binary that we were running out in production was something that somebody had built on their desktop machine. Right? When I started in the software industry, even unit testing wasn’t really a thing. But of course, we’ve come a long way since then. And it seems obvious that if you’re thinking about going from code to cloud or code to image, the very first thing you want to be thinking about is how do I test that code? Right? How do I make sure that at a micro level, the individual pieces that make up my application are operating correctly? Obviously, unit testing is the way that we do that. And that’s a good first step, but it’s really insufficient, because the thing that unit testing misses, because it uses mocks, and it uses ideal conditions is it misses the boat on how everything comes together. Right and so in addition to the unit testing, you absolutely want to be thinking about integration testing, right? Integration testing is where you put all of the pieces together, and actually test the flow from end to end to show that when you put all the pieces together, they actually work. It just as a side note, this is actually one of the places that in the Kubernetes project, we actually didn’t do a particularly good job. We had a lot of unit testing. And we had a lot of full on end to end testing, which we’ll get to later. But in the middle, this sort of integration testing that you can run on a single machine, but the tests really the end to end flow was an area that we really didn’t invest in as heavily as we should have. And that decision or that lack of a decision is something that actually still kind of haunts you. One of the reasons I want to talk about setting up this foundation at the very beginning, is that this is the sort of thing that is easy to add at the beginning but very hard to retrofit a year down the line or even six months down the line. All right, so when we’re thinking about the pipeline from GitHub, to Docker image, unit testing, integration testing, those are run on the code, make sure that they are all passing. And then once they’re passed, we go about creating an image that goes into that registry. So that’s all stuff that happens after the merge. Right? So somebody writes a whole bunch of code, they merge the code in, they run unit testing, they’re an integration testing and you an image pops into the registry on the other side. So that’s, we’re done, right? Well, it turns out actually, that in any project of any degree of size, well, this is where a lot of people start staying in this world is actually pretty hard. And actually not just hard, but really, really frustrating. And the reason it’s frustrating is because and once the code merges, it’s kind of hard to pull it back out. Right. So if you break the build, or if you break a unit test, or you break an integration test, all of the different people who are collaborating on that code are suddenly stopped, they’re suddenly blocked. They can’t move forward because their unit tests don’t pass either. Not because of anything that they did, but because someone else has merged in a broken change, right? So, at this point, I think most people, you know, have at least a notion that unit testing should occur pre merge as well. Right? So we should not merge anything that we don’t believe will pass unit test this way, it’s on the individual developer to get it right to get it passing before we even attempt to merge the code in. This means we distribute out the testing and one person’s broken tests don’t affect the other developers working on the project. So again, moving shifting things to the left might not matter in a project with one or two people, but as a project scales, and as more and more contributors are contributing the source code based moving testing to the left is incredibly important. But you might think as we’re moving testing to the left, well, why is testing still remaining on the right? Well, that’s because of course, there’s multiple pieces of code merging at the same time. And it’s always possible that three different prs that each pass unit testing when merged and combined together. won’t pass, Martin won’t pass unit tests. Or it’s also possible that there’s a flaky test or any other kind of sort of weird environmental condition. So it’s valuable to run the unit test, not just pre merge, and not just as a gate to merging, but run them afterwards as well, that just ensures that your head branch, you know, the code that everybody’s working on is passing unit tests as well. But I would actually argue, and this is a little bit more controversial. But generally speaking, I think a best practice that it’s not just about unit testing, pre merge testing should really happen with integration testing as well. This is a little bit more expensive, because these tests tend to be a little bit more heavyweight. But it really is critical for all of the same reasons, right? You don’t want to have broken integration testing. And in some ways, actually, the integration testing is more important than the unit testing. Because again, unit testing tests things in sort of nice little lab controlled environments, whereas integration testing actually pushes everything together, and tests to see if it all works correctly. Right. So as you’re thinking about setting up a pipeline, setting up pre emerge, validate and that does unit testing and integration testing, post merge validation, before you push an image out to the registry, these two things together will really ensure that you’re really only pushing high quality images, or at least, you know, as high quality as you can validate through the coverage in these tests. Because again, one of the things that’s not mentioned here, but that it’s assumed is that you’re not you don’t just have tests, you actually have good test coverage. If you don’t have good test coverage, if all the functionality in your product isn’t covered in all of your tests. Well, you know, you’re just getting a false sense of confidence out of the test that you’re running. So independent of this. And actually, you could consider additional pre merge checks that you know, test coverage and actually won’t merge code if coverage is reduced as a way to validate and ensure that coverage remains. So while I have unit testing and integration testing up here, there’s additional kinds of stuff that you may want to put in pre merge things like linting, and other kinds of validation that you want to have in this test environment as well. Alright, so now we’ve actually created an image. Let’s think about deploying that. Well, I said that we were going to go from get to image and then from image to cloud. But the truth is, of course, the get is actually present as well in the process of going from image to cloud. And that’s because in addition to the source code, there’s actually also the config, right? There’s the configuration that drives the deployment of a new image that’s been built by our previous pipeline. Right? So if we’re thinking about how we go from, from an image that we’ve pushed, that image doesn’t just sort of magically appear in Kubernetes, there’s actually a yamo file that tells Kubernetes to update the image. And that should be stored in get for the same reason you store your source code and get, you know, there’s a lot of debate about whether it should be in two different repos in the same repo. I don’t think it really matters one way or the other. You can set up good processes in either case, I do think that separating the testing in both cases is valuable, because you know the kind of testing you do for the code And the kind of testing you do for a new image, it’s going to be different different pipelines, different processes. So there we go, we’re taking, we’re taking a marriage of the configuration in get the image in the repository, and we’re driving it out to our Kubernetes cluster. And that’s sort of the final stage of that bridge that connects configuration to Kubernetes. Now, just like we had unit testing and integration testing, in this process of going from a container image out to production, we’re also going to want to add in load testing. And the reason we want to add in load testing, or you might say, you know, complete end to end testing, this is one of those things that really can’t be tested in a small scale integration testing environment. Right. Something like load testing is inherently a large scale thing. It inherently involves production data involves production load, you can’t you know, probably it involves replicated containers, interplay between different microservices under load. These are things you just can’t test. On a single machine, right you can do if you’re depending on how your system is architected, you may be able to do some degree of load testing in your integration tests, you may be able to do sort of single leaf kind of load testing, if you have a sort of embarrassingly parallel system where each leafs behavior is basically identical. But to really understand if your system is performance under, you know, real life production load, you’re going to have to do some form of load testing. Now, this is also one of those things that is incredibly hard to retrofit. If you know, one of the themes that I really want to hit on, is that as we go into setting up these pipelines, these are the kinds of things you need to do at the beginning. If you don’t get this foundation, right, and it’s a task to go in and add load testing and stub out all the dependencies and figure out how you do it. You kind of will never do it. And in fact, I’ve seen production systems go a long way down the road and have not built load testing at the beginning and persist and not having low testing for way too long. Honestly, right. So again, as we think about setting up all of our production ization really focus on, you know, getting the foundation into the right place really before any line of code is written. All right, but it’s more than just load testing, right? Because the truth is that in any production deployment, you’re not going to just have a single cluster, you’re actually going to have multiple clusters. And the reason for that is obviously, you know, we want to deploy our code and roll out our images as safely as possible, we, we obviously aren’t going to go test in production. That’s, you know, first of all, you probably can’t do load testing in production, because production is already under load. So you need an environment to do something like load testing. But more importantly, if you find problems you want to find you want to do it in an environment where you’re not breaking customers, or at least where you’re breaking customers who know what they’re getting into. And so everything really revolves around identifying a canary and the canary the role of the canary cluster or pre prod or you know, who doesn’t really matter what you call it. But it’s a it’s a cluster. And it’s a deployment environment where you’re doing sort of the last stages of testing before you roll out to production. And in many cases, actually, that Canary environment is actually available to your customers. If they want to come in and test, you know, the beta version, to make sure that your changes don’t break them. Because again, as good as your tests might be, it’s always possible that some customer has a very critical workload that you’re not testing properly. And they can come in and say, Oh, no, actually, you know, the Canaries not working for us, it may have passed all your tests, but it’s not working for us. So having a covariance available, not just to, you know, your deployment pipelines, but to your customers as well is also a good practice. Once you get through passing in Canary, you’re going to have to have at least two production clusters, right? We want to make sure that we are deploying things in a robust way. And you know, generally speaking, people deploy their applications around the world anyway for reasons of latency, but also for redundancy. And the reason for this, of course, is that that cluster is even if it’s in a multi availability zone is still a single point of failure. Right? So if you’re doing a cluster upgrade, for example, going from Kubernetes, 117 to Kubernetes, 118, that cluster could break, it could break your application, the upgrade could fail, all number of bad things could happen. If that’s going to happen. And it’s the only cluster you’re running in, you’re in a bad world of hurt, right? So in addition to your Canary cluster, which is the place where you can do testing, you need to have at least two other clusters for your production workload, both for reasons of you know, geographic redundancy, lower latency, but also simply to be able to effectively upgrade Kubernetes itself and safely upgrade Kubernetes itself. Now, when we think about rolling out software across these three regions, or these more regions, if you have more regions, you want to think a lot about safe deployment. So what do we mean when we’re thinking about safe deployment, well, just like with Canary, you want to the goals behind safe deployment is to deploy this code with a minimal amount of damage, or potential damage to the customers or users of the application that you’re deploying. So I think that when you’re thinking about safe deployment, the first step is to start small. That’s why we started in Canary. That’s why if you can you want to start in a smaller region, right? Maybe you know that, you know, you’re one of the you’re really huge in Asia, but you’re still, you know, developing in the US, or you’re really huge in the US, but you’re still developing in Europe, starting in a region where you have a small presence and you have a small service will lower your blast radius, it will lower the impact of any problems that occur because just because something past Canary Experience shows, it doesn’t mean that it’s going to pass everywhere, right. So start at as you start to roll out into real production, start with a small location. Now it’s possible that if you only have two or three clusters, none of them are small, right. And so you may actually want to split up your roll out into, you know, you may want to introduce, you know, a 10% cluster that is used just to be able to start small like this. Or you may want to actually partition your workload within a cluster, so that you actually have multiple replicas. And you can do roll out across multiple replicas within the cluster of the entire service. So obviously, within a Kubernetes deployment, you’re going to be rolling across replicas, but even the whole service itself, you may want to have sort of two copies of the whole service that are taking production traffic so that you can start with that very limited, small scale, real world deployment. All right, but where do you go from there? Obviously, we need to get everywhere. So we don’t want to just start small, we want to actually get everywhere and what do we think about while we’re doing that? Well, the thing that I want everybody to be thinking about is our goals. is to understand what I call the time to smoke. Right? This doesn’t mean it’s time to go have a smoke break, it means what’s the length of time between when you deploy something, and when you see a problem? Because not every problem is immediately apparent, right? Some problems start as tiny little embers. Let’s say it’s a memory leak. It’s a tiny little Ember, the problem is there, but you don’t notice it until six hours later when everything starts crashing. Right. So understanding your service and understanding what the time to smoke is for your service. We’ll help you set up a safe deployment rollout. Because obviously, after you’ve deployed to that small region, how long do you wait to make sure that it’s okay. Do you wait 10 seconds, so that’s probably too fast. Do you wait a week? Well, that’s probably too slow. How do you figure out the amount of time that should go between a canary rollout stage one rollout, stage two roll out that’s wholly dependent on your application and your expectation about the average time that it takes to go from a problem being manifested to a problem being noticed by your monitoring. And obviously, the shorter that time is, the more agile you can be. And the faster you can roll out code. But it’s really going to depend and I would say, you know, you’re gonna, you’re gonna have to start with something relatively big, maybe a day, maybe six hours. And then over time, as you gain more experience with outages, you can build up sort of a model, either approximate or from real data about how long it takes to go from a problem being rolled out to a problem manifesting in your monitoring and in your alerting. Once you understand that, then you can set up the rest of the safe rollout. So the next thing you want to do is to is to go big, right, so we said start small, but then, you know, obviously there are some problems if you’re in a small region. problems that occur may not manifest in a small region. Because it’s not loaded heavily, there’s not a lot of data. So an algorithm that worked, okay, in a small region is not going to work well in a big region, not going to work well in a place where there’s tons of scale. So you know, you want to go big, and then see what happens, right? It’s not great if it breaks, but you need to start exploring the different things like what did you find out? When you found out when you were in this small region? When you found out generally production traffic wasn’t breaking things. Okay, great. Now we want to test something different? Well, the most different thing that you can test is a is heavy load, you test it in a light load environment. Now you test in a heavy load environment. And that’s a representative of the remaining strategy. So as you think about safe deployment, and as you think about pushing your code out around the world, you’re going to want to think about different dimensions. What did each test tell you? And what are the remaining dimensions of failure that are available? All right, so you do a small region tells you roughly speaking, it’s your code is correct. You do a big region tells you roughly speaking, you can handle the load. Well now let’s see about localization maybe, right? In the US mostly UTF, eight characters in Asia, maybe mostly 16 bit Unicode characters. That’s a big difference potentially, it can show up in bugs, it can show up in all could show up in UI problems, where, you know, if we go to some countries, you know, text is going to go from right to left instead of from left to right. Those kinds of problems are differences that you may not have tested for in the region. So far, they may lead to problems, right. So as you proceed with the rest of your rollout across all of these different zones, you need to be thinking about what does each test tell me? And how do I move to a region and gain more confidence that my rollout is correct. And then eventually, you’re going to reach a place where you either run out of regions, or you’ve sort of tested all of the different matrix of possibilities, and it’s time to just roll out everywhere else. Right. So you know, it’s kind of like the, the state that sells the philosophy of Unit testing, you want to unit test until fear turns to boredom. The same thing is true about safe deployments, you want to test all of these different dimensions into fear that your rollout might break, someone has been replaced with boredom for wanting to get the release pushed out everywhere, right. And so that’s the way to think about setting up one of these safe deployments. Obviously, it’s going to depend on the scale you’re at as well, if you only, you know, if you really only can afford to pay for it really should only be paying for three clusters, you may need to slice up those clusters to get different zones. Or you may need to just simply go for it past a certain point, you’ve run out of regions to test and you’re just going to, you know, cross your fingers and hope for the best. All right, so thinking of all those different regions, let’s think about how we manage config for those regions, right. We mentioned earlier, we’re going to store that configuration into GitHub. And I think the most naive way that people approach this is to say, Okay, well for each region, I’m gonna have a config file. Easy, right? Well, it is because of course, you know, in some cases like one of these keys, the other key there are differences between in the two regions, but the trouble happens when I come and maybe do a change. Maybe I’ve had an life site incident or I’ve had some sort of outage, I need to change one of the configurations, I changed the configuration in two of the configs. But I forget, because I’m human, to change it in the third config. Now I’m in a real problem, because all of that work that I did for safe deployment, it no longer actually applies, because I’m no longer testing in this new in the third environment, the same configuration as I had in the previous two environments. And I think one of the other things we think we should think about as we think about setting up these pipelines is any place where you leave an opening for human error to creep into the system, human error will absolutely creep into the system. This is sort of like the equivalent of Murphy’s Law, right? That anything that can go wrong Will any place that a human can make a mistake, they will make a mistake. It’s just guaranteed experience says it’s guaranteed and we cannot train people to not make mistakes. It’s just not a workable solution. We have to instead build systems and build processes that are incapable of making mistakes, or at least significantly reduce the probability of mistake. So how do we approach that? Well, there’s a couple of different things we should be thinking about. The first is our old favorite, which is unit testing. Right? You know, we’ve talked a lot about how unit testing code has become the standard, right? We know for a fact that I have to unit test my code. But do we unit test our configs? Well, some people do. But it’s definitely not a universal practice. But I would argue, as part of this pipeline of going from config out to cloud, you really should. And then additionally, using a tool like Helm that allows you to separate out a common values file and a template. And then a parameters file for each region allows you to you know, make a change like we did in the previous slide, make it only in a single place and be guaranteed that it’s going to push out to every region that you’re managing code in. That’s incredibly important as well. It enforces consistency. It makes the code review much easier. Really eliminates human error and human mistakes from the process. Alright, so hopefully that gives you a sense for how to set up pipelines for production Kubernetes. I wanted to talk now about cluster daemons. This is sort of the other side of the cluster, which is not about how do I get my stuff into the cluster, but rather what’s already waiting in the cluster when my stuff gets there. So look at something like monitoring. Well, you know, when I’m running a Kubernetes cluster, I want my monitoring already present in the cluster itself, so that when any developer comes to that cluster and creates a bunch of pods, the monitoring system automatically scans the API server finds those pods finds Prometheus metrics that are exposed by those pods, pushes those metrics out to a monitoring says, because let me tell you, you do not want to run monitoring systems yourself. As your monitor is being used here. Obviously, you can use other monitoring sizes as well. And then visualize that monitoring right in something like Ravana, right? So what this means is automatically without me doing any work as a developer, simply by pushing my code into Kubernetes monitoring happens for me. And the reason that the monitoring happens for me and is available to me when I’m debugging something in the middle of the night is because I’ve set up that cluster environment. I’ve set up the cluster daemons such that it’s just automatic for every single application that’s deployed into that cluster. Another thing I wanted to talk about is how we standardize these clusters, right? We want to have this sort of idea of a landing zone, so that when that application lands in a cluster, it looks and feels and operates the same whether it’s in Canary, whether it’s in your first region or your last region. So I guess I think that, you know, we talked a lot about, especially in the early days, about how we wanted to get rid of snowflakes, and how the big problem that we were facing and deploying software reliably with snowflake servers, but the truth is that what’s come along is we’ve created a whole bunch of snowflake clusters. Right? So I’ve replaced one problem with another problem. Stepped it up a little bit, you know, maybe we only have four or five snowflake clusters instead of, you know, 50 or 60 snowflake servers, but it’s still a problem. Right? So how can we go about solving this? Well, one of the ways we can do this is with policy. Right? So when you think about setting up a cluster for production, it’s obvious at this point, do you want to set up our back, if you haven’t set up our back for your cluster? Let’s pop back a few stages. And you know, you go go read about our back end communities and make sure it’s set up but it’s by default set up and in most production clusters these days. But that doesn’t actually help with everything. Because there’s other questions like, Where did you get that image from? Did you get a security review or even a basic thing like what’s your team’s mailing list that we can’t enforce using our back end? So that’s actually where policy comes into play. So there’s a open source project called gatekeeper that has been released was originally developed by Azure was given to the open Policy Project and released as an open source project that can actually do this. So without gate keeper, you can actually put rules that apply to the shape of the API objects that go into Kubernetes. It’s not just about whether you can or can’t create pods, it’s actually about what those pods look like. And that allows you to set the right kind of rules in place so that when things are deployed to your clusters, they all look the same. They all had the same kind of metadata, they all came from the same image registry. They all have, you know, more than one replica if you’re running a production service, basically ensuring that all of the consistent things you want in production are applied to every single cluster. But actually, when we talk about every single cluster so far, this is really only for a single cluster. And so what we’ve done with the gatekeeper and Azure is we’ve actually said, okay, you when you talk about the rules that you want to run, you want to take those rules and put them out in the cloud so that there’s a consistent single place for every single cluster that you run, to find out what the policy is and what the policy is that they should be enforcing that way. There’s a single command and control location. And we no longer have a bunch of snowflake clusters floating around out there. But we rather have a uniform set of clusters with the same set of daemons, the same set of applications in all of the different environments so that all of that testing that we talked about in the latter stage of the pipelines applies, and we can actually safely deploy our code everywhere. Thanks so much for listening. I hope you enjoyed it and have a great swampUP conference.