Get Control Over Your Microservice Sprawl

Tracy Ragan
CEO DeployHub, CDF Board Member

Business agility achieved through fault tolerance and auto-scaling is the promise of a modern Kubernetes architecture.

And a service-oriented approach is at the center of that promise. Microservices have the potential to revolutionize the way we develop software, but we must manage their use to control their sprawl.

This is the focus of Ortelius, a new CDF incubating project that puts organization and control into the use of microservices. Ortelius delivers a centralized catalog of microservice configuration management metadata that allows you to see microservice owners, usage, relationships and ‘blast radius’ even before you deploy with automated updates via your CD Pipeline.

Video Transcript

Hello and welcome to SwampUP. We’re going to talk today about getting control over your micro service sprawl. One of my very favorite topics. I am Tracy Reagan, I am somewhat of a microservice evangelist. I’m super passionate about configuration management, and continuous deployment. I’m also the the CEO of DEPLOYHUB and its co-founder I was recently recognized as a tech beacon in DevOps, which is pretty cool. I’m on the board of the continuous delivery foundation and actually served on the board of the eclipse foundation for a good time. I’m also a DevOps Institute ambassador and the community director for the Ortelius open source project, which we’re going to talk about a little bit today.

 My hobbies: big horses, dogs, and I like martial arts. So we’re starting to really see an emergence of microservices. And they do require some new DevOps solutions. This is a massive shift in how we write and deliver software. A couple of trends we’re beginning to see is the emergence of a service catalog around microservices. There’s a huge market potentially for that, 500 million right now, which is pretty big for kind of an emerging technology. And we’re also seeing sort of a convergence of continuous delivery, in GitHubs, what we like to call experimentation vendors.

 This information is in Tyler Jule’s article around trends and foretelling new approaches to DevOps. We also see an increase in the use of DevOps, the CN\CF did a study last year and that study indicated based on their users, 92% of them say that they’re using containers in production now, and that’s up from about 84% last year. So we’re seeing a continued increase in the use of containers.

 If those containers are moving to microservice, we’re still not clear yet. But we do know that Kubernetes is being used in production about 83%, up from about 78% in 2019. So the emergence of, you know, the containers in production or Kubernetes in production is going to indicate that microservice is going to be the next big shift.

 I like to use this article that was written by SD Times back in 2019 called Microservices – More Isn’t Always Better, and Alexandra Noonan was interviewed, and she talked about her sprawl of microservices. And she said, what happened is we would release an update to a library and then one service would use the new service and now all of a sudden, all these other servers were using an older version, we had to try to keep track which service was using what version of the library. And Randy Hefner says that microservice development falls apart without a coherent, disciplined, managed approach. So microservices are complex and that’s because we’re taking our static application, and we’re breaking it into smaller puzzle pieces. So let’s kind of look through what the problem really is. And in order to really understand the problem of microservices sprawl, we have to understand how we got here. So we have to look at our monolithic practices, compared to a micro service implementation or a micro service architecture. So in our monolithic applications, they’re configured and statically linked, that’s basically where we start this journey.

 Most of you have somebody who’s like a build manager, and that build manager creates build scripts, the static configuration scripts that figure out how to compile and link and what to use in your compile and link process. Super, super critical position. When they do that, they look at version control, and they decide what should be used.

 They worry about, you know, pull requests and merging and pulling out a main line and recompiling that main line. And that’s an in essence what CI does. It looks at these build scripts and says, I’ve got something new in Git, I’m going to pull it out and I’m going to automatically run my build scripts. At that point, you might call something like artifactory to do some scanning and look for transitive dependencies, for example, but at the end of the day, what this whole build step did, it created an application binary and assigned it a version and may have used the build number as the version which is a completely logical thing to do.

 If you’re good at what you did, you generated a bill of material report that showed what libraries were included, what you used. And if you’re really good, you produced a difference report, you compared your two bar reports and said, this is the difference between the previous build and the build we just did.

 All of this, all of this is our Northstar in our monolithic approach, and it’s the beginning of the CI pipeline. And once this is done, you have a release candidate that may or may not be pushed across the CD pipeline. But you’re doing this on a regular high frequency basis to make sure that any change that is being added to Git and is ready to be compiled, hasn’t broken the build. So how do we know we didn’t break the build on a monolith in a microservices world?

 That’s really the trick. So let’s think about how a microservice developer interacts in this world. Micro services are built and deployed independently. That’s the whole point, you’re going to decompose your application and you’re going to deploy these objects independently, and this is what creates the complexity.

 Breaking up is hard to do, and no more so going from monoliths to microservices. Once a micro service developer has completed his code, he’s built us 300 or 400 line Python code or updated it, he’s going to create a new container, and he’s going to register that container to a Container Registry. Once that is in the Container Registry, it’s sort of ready to go. It’s pushed into Dev, or it’s pushed into test or it’s pushed into production environments. And this is where the CD pipeline then picks up to take that container and push it to a testing environment or a production environment.

 Now, if you’re still doing monolithic, and you’re building… your container includes your entire monolithic application, you’re not really doing a micro service environment, you’re really running in a monolithic, and the containers are treated in the same way as the CD pipeline. In a microservices world, that’s different, because the microservices are moving independent of your application. And this is the beginning of our problem, and it’s driven because the monolithic step.

 The link step that statically links all of our external libraries that we’re going to use is replaced with APIs and that’s done at runtime. So think about it this way. We are taking our monolithic application, and we’re breaking it up into smaller reusable components that can be shared. That’s the whole point. And we’re linking them dynamically at runtime. So we have some challenges to get over because of that. So what’s the impact?

 Well, first of all, we don’t have an application version anymore. It’s logical at best. We don’t have an application BOM report, it’s hard to know what version of a particular microservice your application may be consuming. And it’s certainly difficult to generate a difference report to show the last time we built this, this is what it looked like.

 Now today, we recompile that and this is the new BOM report and this is the differences. This lack of visibility actually creates some bottlenecks for high frequency updates. I’m hearing so often that SREs often feel like they’re flying blind when they’re releasing a micro service because they just don’t know exactly what the impacts going to be. So there’s some hesitancy. And while we can push things out on a high frequency basis, what we end up doing is waiting for it to be deployed to an environment and wait for an incident report and then use observability tools to try to track the transactions and sort out what happened.

 Microservice sprawl is another impact of taking our monolithic kind of approach and moving it into a cloud native environment. And the reason why we’re doing this sprawl is to solve some of the problems that I just described. But in essence, we don’t have any understanding of impact. We don’t know what our blast radius is of a micro service. We don’t know if one microservice impacts one application or 20 applications if you’re an enterprise. And that is a problem.

 So let’s take the heart of the matter here. What is it that we’re trying to do and why is it difficult? First of all, SOAs are intended to leverage shared services, not copied services. Which means our monolithic habits have to go away because monolithic habits actually discourage sharing.

 We tried to do object oriented programming some time back and what we ended up doing is creating… you know we have this concept of a dynamic link library or DLL that was supposed to be able to be shared across multiple applications. But we didn’t want to do that. Instead, what we did was we brought that library into our environment and renamed it. And in fact, if you think about how Microsoft, you know, Visual Studio works, it will rename it for you. So that was the way we solved the problem, we… we wanted to use our own version, and we didn’t trust to have a shared version. So shared services do present a new challenge because the concept is if you’re sharing the service, if you update a particular service, everybody gets the fix at the same time, instead of recompiling everything in order to get it out the door.

 That is the beauty of shared SOAs and it’s the beauty of microservices. If you have a security problem, you fix it in one place and instantly everybody who’s consuming, it gets that fixed. There’s no recompile needed, that is the ultimate in-business agility.

 But with this new world, there are no insights into the application and service relationships unless you wait until you get out to the cluster and look at that based on transactions. And that’s not even going to really show you a BOM report, what your application consumes, and certainly doesn’t give you impact analysis, if one of those services are going to be updated, you might want to know that impact before it goes out the door. So it is our monolithic habits that create the sprawl, and there are human elements to this as much as there are actual tool elements. And in our monolithic world, we really relied on ownership.

 We want our own versions of the libraries. You know, you hear things like I don’t want to unexpected updates, our team decides to deploy a new service, even if we don’t own it. Ownership, that is a monolithic habit. Lack of collaboration, our team isn’t told a new service was available, nobody told us. And collaboration is critical in an SOA environment, because you need to have a way to communicate when new features and new services are going out the door.

 Consistency. You know, we used to hear, it worked on my machine. Now we hear, it worked on my cluster. So we don’t need the new version, since it worked on our cluster, we’re going to keep that version and move that version forward. And validation.

 We’ve never tested our application with that particular version of a service and this goes back to our trust issue. We have to be able to trust that somebody who wrote a micro service is going to make sure that it’s going to agree to those contracts that we have in testing. So let’s see what it looks like in action. Not everybody is moving in a Kubernetes or a cloud native environment like what I’m showing here today, but I’m seeing it more than… more than I thought I would.

 This is basically a monolithic approach to building out your applications running in a clustered environment. So in this case, we have a fictional store called the online store company.

 There’s the candy store, there’s the clothing store, and the toy store that we manage in our online store company. We have shared services, we have the cart service, and we have the shipping service, and we have the payment service. Now because all three teams are not comfortable with trusting when a shared service has been updated, they’ve asked for their own clusters, for their applications to run in their own private clusters. So for production, we have three clusters with 12 pods with 12 silence services, and the shipping service has drifted.

 Now in our case here, the candy store and toy store are using the same shipping service, but the clothing store opted out or did not know to update the shipping service. So this is sprawl and drift. So instead of everybody reusing things and picking up the most recent version, we have SILOed them.

 This is a monolithic approach in a cloud native environment. It creates a lot more work. Now we have 12 of these pods and if you multiply that by your Dev test and prod, you’re managing 36 and your applications really only using three shared services. So the Sprawl happens really quickly. So it’s a balancing act, right?

 How do we balance the risk and control? Control is given to us when we SILO these clusters because it gives us a high level of control of what we’re going to use as a service, but the risk is that if one of those services are updated, you create drift, which means critical fixes may not be consumed by all applications, when they can be easily in a service oriented architecture and using microservices. So we have to figure out how to manage the balancing act. And part of that managing the balancing act is changing some of our internal culture because culture matters, as well as introducing some new tools to foster building out an SOA architecture with some insights. So first of all, trusting the use of external team services is critical, we have to get to a point where application teams can trust using a micro service that they have no control over. And they have to have a culture of sharing, sharing components and collaborating around those components.

 You know, we’ve worked on improving our collaboration, and, you know, companies like GitHub has talked about collaboration for quite a long time. But in an SOA environment, we do have to know if the person on the third floor wrote a really amazing security routine, how do we tell the person on the eighth floor that their application should be consuming it. And we have to think about domain driven design, I throw this out because if you’re really learning and trying to understand how to build out an SOA DDD is something you should be looking at.

 We learned about this in object oriented programming, we didn’t implement it very well. We have to at this point, we have to start understanding our microservices based on solution spaces because this is going to help with the sharing and collaboration and the understanding, if you’re an application team, there is a great shipping service out there that you should use and you should rely on that micro service developer to make sure it’s going to perform in the way it should. And then new tooling, new tooling is required, we see tons of new tooling coming out around managing Kubernetes.

 We’re going to start seeing more tooling about managing microservices, the service catalogs, and how to really start getting our arms around an SOA architecture where microservices can be leveraged. So think about building this as opposed to the previous picture. And we have one cluster and we can assume this is prod, test, or Dev and in that one cluster, we have the three stores, we have the candy, clothing and toy store. And they are running in their own namespaces and they have full control over what happens in those namespaces.

 They are communicating however, outside of their namespace into the shared service namespace. And that is where they’re picking up the shipping, the payment and the cart service. Now in this configuration, we have one cluster, six pods, and four namespaces and we have no drift. That’s the critical part, no drift.

 Micro service sprawl creates drift and drift can be a dangerous situation. You know, it makes it sound dramatic saying dangerous, but everybody has been there when you have a library that needs to be recompiled and relinked and everybody has to move to that new version of that library because it is a critical update that we need and everybody is consuming it. That is a monolithic approach. In a microservice SOA approach, all you’re doing is updating it once, everybody gets it. So if we add Dev, test and prod to this, we have three clusters, 18 pods, and 12 namespaces. So we talked about tooling, I’m going to talk to you a little bit about Ortelius, which is an open source project that’s incubating at the CD foundation.

 I have the honor of being the the community director for this project. We’re super excited about its progress and we have a ton of really, really motivated committers that understand this problem and are really working to solve microservice, management and cataloging. So Ortelius tracks what I like to call a logical collection of services, creating a logical view of your application and it adds versioning to that. So you know, we talk about monolithic, and I like to talk about it as a massive puzzle piece that we laminate and we push through the lifecycle. And microservices don’t do that.

 Microservices take one puzzle piece and push it through the lifecycle. So we have to start understanding how these pieces and parts work together. And that in the essence is the goal of Ortelius. Ortelius contracts the blast radius, it knows the impact, it knows if a microservice has an update and is coming across the pipeline, and who it’s going to impact. A critical part of the of an SOA architecture. So as soon as a container is registered, it’s going to tell you who it’s going to impact, you don’t have to wait for it to be deployed.

 Now Ortelius also uses this concept of domains. If you’re working on a domain driven design, it’s going to allow you to build out a catalog where microservice developers can publish that micro service to a catalog, which means that there’s some collaboration for application teams, because they can look to see what’s been published and consume it.

 It restores control by creating these logical application views. So think about when you… in a monolithic world, you take an application, you create a package. That package in a microservices world is going to include independently deployed components or microservices make up that application package. Okay? That’s the base version. If one of those services are updated, you have a new version of your application. And if one of those services are consumed by three different applications, you have three new versions of your application. So Ortelius really works on restoring control by tracking these logical application versions.

 It gives you a BOM report, and it gives you DIFF reports, and it gives you your blast radius report. So what we’re doing is we’re adding that visibility, not observability, into a microservice environment. So if you think about the difference, let’s just talk about this for a minute.

 Ortelius does not run in your cluster, Ortelius sits on top of your cluster and integrates into the CD pipeline, and does automated configuration management to grab this kind of information. In other words, it’s reimagined CI. Observability runs in a cluster and it’s good to show you what your transactions look like. The goal of Ortelius is to provide you some high level information so you can see who the usual suspect is, usual suspect is usually the last thing updated. And you can see that through a difference report. So let’s just think about how Ortelius works. Ortelius is called at the point in time that a container registry occurs. So once that registry occurs, it triggers Ortelius to version the microservice and track the application configuration management and build a relationship map. Now, the way it’s primed is that an application team creates an application based version and creates a package. And that package has all of the services that that application is going to use in its base version.

 Once a new container has been registered, we know that it’s going to impact anybody who consumes the last version of that container, and we start building those maps. This is the essence of how how it starts tracking and building this data of our metadata of deployment information about how your microservices are being consumed, and what your application looks like.

 Once that occurs, it can trigger the deployment or it can be triggered to call your Helm chart, for example. And it does that so that we can get a return feedback. So we need to be able to see what occurred at the cluster level so we can report it back. So we’re reporting the inventory of the micro service across all clusters that has been deployed to.

 Now why is that important? It’s important because every cluster can look different, every cluster can look different, and every cluster can have a different collection of microservices. So every cluster has a different version of your application. This is not new, we’ve done this for Dev, test or prod for a long time.

 That’s the whole point of having these different stages because we have different versions of the application. And Ortelius tracks this information by creating that feedback loop that says, we know that these are installed into this cluster, so we know what version of the application is running at any point in the life cycle. Tricky. It also has a full set of APIs so that you can call into it, and it does all of its magic, all of its versioning and all of its relationship maps, and its domain in a Postgres database. So just to kind of explore the kinds of maps that it does create and we called it Ortelius after Abraham Ortelius, Abraham Ortelius created the first world Atlas. And he did so in a really cool way.

 He was sort of our first open source leader, because he understood that there were people all over Europe at the time who were sailing and building out maps of the world and he pulled them together, all these cartographer together and created the first world Atlas.

 He didn’t take credit. If you look at some of his old maps, you can see where the different individuals who have created parts of the map or have kind of signed the maps. So he pulled together the knowledge of many people to build out our first world Atlas. And in many ways, that’s what Ortelius is doing.

 It’s pulling together information about these different microservices, where they’re deployed and thus, creating maps. You know, if you really want to think about it, we’re taking a giant cluster and we’re versioning every point in that cluster. And what we get are BOM reports at an application version, we can show the differences between those application versions.

 In this case, down here, we’re showing that the cart service was updated. And in this case, we’re showing that the cart service was updated, and when it was, it impacted three different applications. Now, this kind of data can easily be passed to the rest of the pipeline. So for example, if this cart service has changed, we need to execute the test workflows for these three different applications. And in essence, what we’re doing again, is reimagining CI and understanding how the pieces fit together, so that we know that we have to notify these other application teams, that something in their world is changing, and they should know about it. So the results, feedback from our early adopters, they say they are seeing a reduction in sprawl of their microservice and redundant coding by about up to 50%. And they’re saying that this automated configuration management saves one to two hours of manual work per deployment.

 Now, if you’re doing one or two deployments a week, that may not be too much, but if you’re doing them every day, or if you want to get to doing them every hour, that’s a big number. SREs are now given the visibility to make some data driven decisions before they do a deployment like the blast radius. And we really are driving these high frequency updates built into existing CD pipelines so that we can connect into the CD pipeline of your choice to evolve it to start managing the concept of microservices and doing that automated configuration management to get those reports. And that’s the end, we are about out of time.

 Let’s do continue the conversation, you can reach me at Tracy@DeployHub.com, mt Twitter account is @TracyRagan and you can find me at Tracy-Ragan-OMS. Go to the Ortelius website Ortelius. io, you can check us out on GitHub. And if you’re interested in contributing, we would love to have you.

 This is a big problem. We need as much input from SREs and developers and testers and operations folks as we can to really start pulling in the kind of data that we need to really create a solid hub of deployment metadata that keeps us all pointing towards the North Star. And that is Ortelius open source development in Google Groups.

 Thank you so much for listening. And hey, the JFrog folks, thank you for having me again. I love doing these.

 I love you all. Thanks. Bye bye.

Release Fast Or Die