Kubernetes in Production is Hard!
In the past two years, JFrog moved to deploying and managing JFrog SaaS applications in Kubernetes on the three big public clouds – AWS, GCP and Azure. During this period, we gained a lot of useful and important lessons. Some – the hard way… In this session, Tsofia shares with you some stories from JFrog’s journey and the (sometimes hard) lessons learned.
VIDEO TRANSCRIPT
Hey everyone and thanks for joining me today at SwampUP, I’m glad to have you here with me. First, let me introduce myself. I’m Tsofia Tsuriel from JFrog and I’ve been working in JFrog for more than 3 years now. I started my career as a configuration management engineer which is a predecessor of what we’re calling today, DevOps engineer. After that, I worked as a C real-time software developer for several years and then joined JFrog. From my very first days at JFrog, I’ve been working on Kubernetes projects and in 2018 we’ve started to work on the Kubernetes-based solution for the JFrog SaaS application.
I’m very proud to say I’m a part of this project in the company I think it took the JFrog SaaS solution to a whole new level, and it’s a good example of a new generation SaaS solution based on Kubernetes platform. Today I’m here to talk to you about our journey in the Kubernetes world, the journey to make it the infrastructure platform for the JFrog SaaS. I’m going to talk to you about the problems we had to solve, how we solved them using Kubernetes, what we’ve learned throughout this process, so you can learn from our experience as well.
I’m sure the different companies and different groups might have a different experience, but basically I think the problems we want to solve are common to a lot of companies, and using Kubernetes for resolving them is probably a very good, if not the best, existing solution. I will try to give real examples throughout this session and during this session, I will be able to answer your questions in the chat window so feel free to ask anything you want. So, first, we’re discussing why using Kubernetes at all? Kubernetes solves real problems. A very recent blog post posted on Stack Overflow mentions several reasons for the question why Kubernetes is so popular. First, infrastructure as data. Needed resources are expressed in a very simple way in Kubernetes, using YAML files. Having everything defined in YAML files gives the option to manage them under version control, such as Git and also make it very easy from a scalability perspective since everything can be easily changed and updated in a YAML file. S
econd, extensibility. There is a big set of existing resources such as stateful sets, configmaps, secrets, cron jobs and more and users can add more types according to their needs. Third, innovation. Over the last few years, there are 3 or 4 major releases every year with a lot of new features and changes, and as it seems, this is not going to slow down. Fourth, community. Kubernetes is well known by its strong community and supported by CNCF. KubeCon is the largest ever open-source event in the world. Octoverse, the annual GitHub survey from 2019 shows that Kubernetes is one of the top 10 popular open source projects by contributors and Helm which is the Kubernetes package manager, is one of the top 10 fastest growing open source projects by contributors. On the last two years, Kubernetes comes in as one of the three most loved platforms by developers along with Linux and Docker in an annual survey done by stack overflow. In JFrog, we found some reasons for why we should use Kubernetes. First, I need a running environment. NOW! Developers as well as production need high velocity of creating environments for their daily work. This is something that can be easily reached using Kubernetes. Second, branch integration environment with other products’ branches. For example, Artifactory developer wants to set up an environment to test his branch with other applications’ versions. In the past, they had to set up a VM, install an OS, set network configuration, etc. This takes a lot of time.
Again, this is something you can do very easily using Kubernetes. Third, better resources utilization for development and production. In Kubernetes, an application can be started with a low amount of CPU and memory and increase up to a specific limit according to the consumption. It makes it easy for admins to manage their resources and save costs. I will come back to this later in the presentation. Fourth, support a new distribution for JFrog products. As mentioned, Kubernetes is very popular platform, so JFrog must supply installation of its products on Kubernetes for our customers and the community.
Fifth, taking our containerized products to production. Or in other words, development environment must be production-like as much as possible. Having an application in Docker is nice, but we need much more than that and that’s what a full solution platform like Kubernetes gives. Sixth, vendor agnostic as much as possible. As we must be able to deploy all of our products on the main cloud vendors, we better have one API for that and Kubernetes provides an abstraction on top of the vendor with a single idea. Seven, auto-scaling and auto-healing of application. These are the highlights of Kubernetes, the features that make almost everyone want to use Kubernetes because of them. And indeed, they are great features that increase the reliability of your application. For easy management of applications’ deployment on Kubernetes, we decided to use Helm. Helm is a tool for managing Kubernetes’ packages called charts. Like RPM or Centos or Red Hat. Helm charts helping, defining, installing, and upgrading Kubernetes applications.
JFrog publishes and maintains official Helm charts on all its products in the Helm app for the usage of our customers, the community and of course, our SaaS solution. So how are we using Kubernetes in our internal environment? For development purposes, we have a CI/CD process that installs JFrog products on Kubernetes. Per need, a developer can initiate this process and specify a branch version to install without our application’s master or branch version.
The deployment process uses JFrog official Helm charts and the result is an isolated Kubernetes name space with all the applications installed. Here, you can see the output of the Jenkin’s pipeline job, with various installation steps and in the end, the developer gets his dedicated JFrog platform running. For staging and pre-production purposes, we have several manage clusters at least one of each cloud vendor and we are running full CI/CD processes of JFrog product installation along with Kubernetes infrastructures and tools in order to run tests and reveal bugs before upgrading our production environment. Back then, three years ago, we investigated several options for self-managing Kubernetes’ clusters in AWS, GCP and Azure. We found that for us, the easiest and immediate approach will be to use managed Kubernetes solutions provided by the cloud vendors EKS, GKE and AKS. We understood that managing the clusters by ourselves requires a lot of resources and skill we don’t have and we better focus on the things we’re good at.
For the past 10 years, we have production environments running EKS, GKE and AKS on various regions with the same API. We still have to deal with cloud vendors’ specific services like logged balancers, databases, etc. Moving to work with Kubernetes reduced the amount of vendor-specific APIs we need to manage. A new customer deployment, nowadays is a self-service and fully automated process. No need for DevOps engineer intervention. The environment is ready for use within several minutes on any supported cloud region. Hands-free automation. Talking about having Kubernetes in production system maybe sound very romantic to some people, you’re simply running one or two commands, and everything is just running in its place. You have the application auto healing and auto scaling so now you’re free to sail to sunset better with a glass of red wine in your hand. Very soon, you find out this is not really the case. You find your ship in the middle of the storming sea, struggling big waves, breathing to drowning. And find yourself helpless and terrified. I know this sounds a little bit dramatic but this is not far from the feeling you have the first time you’re facing a service downtime with your new Kubernetes system. Once you pull yourself together, you’re starting to understand there is work to do to stabilize your ship and make it stronger and resistant. So next time, and of course there will be some of them, it faces high waves and big storms the shock will be smaller and the durability will be better. Let’s now discuss the things we learned and found as important during our process to stabilize our Kubernetes production ship. Visibility.
The first thing that we learned is that because of Kubernetes’ complexibility as a platform, we need visibility of what’s going on in the system. Whether it’s for the use of developers, or teams, DevOps or any other person. No more SSH to the server and “get me the logs”. Developers should not need _ access to debug their applications and furthermore life where Docker images do not have basic Linux _ on them. Therefore, we understood we need detailed logging and monitoring tools to run on the clusters continuously, collect all the logs in the matrix and provide us a reliable visibility of the system’s status along with all the information developers and support teams need to debug their application, real time and historical. At first, we started with local ELK and Prometheus set up but very soon, with a growing scale, we found that exactly like with managed Kubernetes cluster, we should go with a professional managed solution for our logging and monitoring. This removed the need to manage the scale of monitoring and logging system which were very time and resource consuming for us. Dev = Staging = Production. We noticed that many times when we deploy a newly release of one of our products various functional and performance issues are found only in production.
This always led to a hurried release of fix versions, and many sleepless nights as mentioned before, we have 3 environments, we learned that we must minimize the differences between those environments as much as possible. The ability to test on production-like environments give the option to run tests and checks at development time and can minimize our release time, risks and _. Today, developers can deploy a branch to assess environment which is identical to production for preforming professional and non-professional tests with it. Know your limits. One of the significant things we learned is that we must know our application. This means you need to learn and understand how your application works in terms of resources usage, memory, CPU, databases, permanent storage if needed and anything your application needs in order to perform and run efficiently with minimum force. There is always the balance between high utilization versus high performance. You can give your application a high amount of resources so it will run very fast and without _ , but those resources are probably idle sometimes, can cost you a lot of money, you need to find a balance the correct ratio for your application to run properly and efficiently.
This is also true to the class turn offs, which have defined size and code a specific CPU memory amount. There is also the Kubernetes scheduler that ensures there are enough resources for all the bugs on the _ itself. But you have to quickly calculate and understand how many _ you’re going to run, what are the resources they need and based on this plan, the cluster size and the node size. Don’t forget, the system _ and components such as Kubelet, Kube proxy, container engine and all processes also require resources. Each vendor has its own definitions for system services and it must be considered in your plans and calculations.
Pod priority and pod quality of service. Funny story about pod eviction, is that one day, in the very beginning we found dozens of pods of the same application in eviction mode and we got panicked since it was hard to see there is one pod and it could run mistake. That was the… the point when we learned about the existence of pod eviction state. Thanks to auto-healing and auto-scaling, the customer didn’t experience a downtime. With the respect to resources limits mentioned on the previous slide one of the major issues with applications is the resources they are using Kubernetes’ scheduler is a component that’s responsible for finding the best note for a pod to run on considering resources needs, if they are defined. There are 3 quality of service classes in Kubernetes. The first, guaranteed. All pods containers have specified the same value for CPU requests and limits and for memory requests and limits. This option is the safest one as Kubernetes promises to schedule it in a note with all the needed resources and it protects the note from being evicted in case the note is running out of memory. On the other hand, it might be riskful in terms of node resources if you are setting requests and limits without preliminary tuning.
The second if burstable. The pod is not guaranteed but at least one of its containers has specified a memory or a CPU request. Burstable is good for most of the pods but it is not totally safe as your pod might be evicted by Kubernetes in some cases. The last is best effort. Pod doesn’t specify any CPU and memory requests and limits. This kind of pod should be avoided as it is the first one to be evicted and might be very riskful as CPU and memory can grow with no limits. We found the guaranteed option as the best in the very early set-up stage when you want to stabilize your system and then you learn your application and its needs, but it is the most expensive option. After some time, when you tune your application, do optimization and reset the resources then you can start to use the top priority mechanism which defines what’s the most important applications for you when there is an eviction process. For example, when we started, we loaded a lot of applications none of them had defined pod priority soon, we found that Artifactory is the heart of all the services that a lot of pods depend on should have higher priority than a logs collector pod that can stop and start from the same place after a _. Another nice example for improving the service availability by using pod priority is RabbitMQ. When RabbitMQ runs on a cluster in Kubernetes there’s some interesting faults that happen as a result of RabbitMQ leaving and joining the cluster frequently.
Once we added pod priority to RabbitMQ it made it much stable and reduced dramatically the faults related to RabbitMQ cluster. Zero downtime upgrades. As a SaaS, we should aim to minimize downtime of the service for any reason. Application version upgrade is something we should be able to run whenever we want. For example, an urgent security patch of Artifactory must be installed to customer’s productions environments and we can’t delay or wait for maintenance meeting. Having the application running in high-availability mode in several low-balance pods eliminates the risk of downtime when upgrading the environments. With Kubernetes you have the option for a rolling update. So, at a specific time, only one pod is going down for an upgrade and the others are still functional and they’ll be upgraded on their turn. Important point to mention is that your application must have HA ability the option to add or remove a service pod smoothly without issues and again, for this you must know your application and change accordingly what’s needed to support it.
In addition, there are also Kubernetes version upgrade, cloud vendor upgrades, or tools upgrades that must be running without interrupting application availability. One great mechanism in Kubernetes are the probes that help you to ensure the application is indeed functional, and if not, restart it. Readiness probe is an automatic mechanism to know when a container is ready to start accepting traffic. Liveness probe to know when to restart a container. whereas with respect to upgrade process if for example, you are upgrading the RDS and re-connection to the database is needed the liveness probe will identify the faulty state of an application and will restart it so re-connection will be done automatically. Security. In JFrog, we love to eat our own dog food and use our products in our daily work and processing.
Having said that, it is clear that the best way to manage our Docker images and Helm charts is using Artifactory. We are using an internal Artifactory server as a Docker repo and Helm repo and during deployment process, everything the deployment needs is fetched from Artifactory. So, we have full control and visibility of what’s running in our development and production Kubernetes cluster. On top of that, Xray runs and scans all the 3rd party Docker images stored in that Artifactory, so only scanned and approved images find their way to Kubernetes’ cluster. And this ensures there is no unapproved code running on our cluster. Continuous learning. This is one of the main messages I would like you to take from this talk. The understanding that the learning process of Kubernetes itself and of your system is never ending. Kubernetes as a technology is new for almost everyone, Developers and Ops teams must learn how it works, how to use it learn best practices and recommendations which are available all over the net. Learn how to look for loops in metrics and how to gain required information to debug applications you should always continue to tune your infrastructure and applications and their behavior. Reviewing metrics from the past is very important you must check how your application behaves over periods of time. Take a specific process of behavior that hurts your system, such as bad performance process or high-network perception process and investigate it over a day, a week two weeks, a month, etc.
and see if there is any pattern repeating itself As an example, in one of our proactive checks we’ve found increasing CPU over several weeks and that reveals the _ that it causes CPU log problems and memory error. This is something you can’t find when reviewing a day or a single week. Know that today, there are also great machine learning and APM tools that help you find this kind of issues or even better, alert you about them according to predefined rules. Don’t take any choice you took in the past for granted. Take into account that at the time you took that decision, you didn’t know the things you know now, and you even experience a thing or two and now you have a better knowledge and confidence under a system that needs to be handled. Keep asking yourself if it makes sense don’t be afraid to change, if needed. Finally, I’d like to recap the main points I mentioned in this talk. If that’s the best for you, use manage Kubernetes and always stay focused on what you’re good at. Visibility, visibility, visibility. The key for debugging and controlling your system. Don’t be afraid to use manage solutions. All environments should be similar as much as possible to reduce risks and potential bugs in production.
Know your limits. Learn your application and set resources and put priorities accordingly. Take actions in order to have a 0 downtime upgrade. Change your application to be a real HA. Run rolling upgrades and use Kubernetes probes. Think about security and try to minimize potential risks that may come from your Docker images. Expect failures. Be mentally prepared for them and try to be proactive as much as you can. Don’t forget you always must continue to learn your system tune your configuration and metrics and re-validate your choices. I hope that each and every one of you will find one or two lessons learned relevant for your organization and take it back to improve your Kubernetes production experience. The Kubernetes journey probably never ends. but I hope that this talk convinced you that Kubernetes in production is actually not that hard. It is possible, it is fun and as we always say in the Swamp once you leap forward, you can’t go back.