Kubernetes in production is hard! [swampUP 2020]

Tsofia Tsuriel, DevOps Engineer @Jfrog

July 7, 2020

< 1 min read

Here is the JFrog’s journey to Kubernetes: https://jfrog.com/whitepaper/%20the-j…
In the past two years, we moved to deploying and managing JFrog SaaS applications in Kubernetes on the three big public clouds – AWS, GCP and Azure. During this period, we gained a lot of useful and important lessons. Some – the hard way… In this session, I want to share with you some stories from our journey and the (sometimes hard) lessons learned.

Video Transcript

Hey everyone and thanks for joining me today at SwampUP,
I’m glad to have you here with me.
First, let me introduce myself.
I’m Tsofia Tsuriel from JFrog
and I’ve been working in JFrog for more than 3 years now.
I started my career as a configuration management engineer
which is a predecessor of what we’re calling today, DevOps engineer.
After that, I worked as a C real-time software developer for several years =
and then joined JFrog.
From my very first days at JFrog, I’ve been working on Kubernetes project
and in 2018
we’ve started to work on the Kubernetes-based solution
for the JFrog _ application.
I’m very proud to say I’m a part of this project in the company
I think it took the JFrog _ solution to a whole new level,
and it’s a good example of a new generation _ solution based on Kubernetes platform.
Today I’m here to talk to you about our journey in the Kubernetes world,
the journey to make it the infrastructure platform for the JFrog _.
I’m going to talk to you about the problems we had to solve,
how we solved them using Kubernetes,
what we’ve learned throughout this process,
so you can learn from our experience as well.
I’m sure the different companies and different groups
might have a different experience, but basically
I think the problems we want to solve are common to a lot of
companies, and using Kubernetes for resolving them is probably
a very good, if not the best, existing solution.
I will try to give real examples throughout this session
and during this session, I will be able to answer your questions in the chat window
so feel free to ask anything you want.
So, first, we’re discussing why using Kubernetes at all?
Kubernetes solves real problems.
A very recent blog post posted on Stack Overflow
mentions several reasons
for the question why Kubernetes is so popular.
First, infrastructure as data.
Needed resources are expressed in a very simple way in Kubernetes,
using _.
Having everything defined in _ files
gives the option to manage them
under version control, such as Git
and also make it very easy from a scalability perspective
since everything can be easily changed and updated in a _ file.
Second, extensibility.
There is a big set of existing resources such as _, _
_, _ and more
and users can add more types according to their needs.
Third, innovation.
Over the last few years,
there are 3 or 4 major releases every year
with a lot of new features and changes,
and as it seems, this is not going to slow down.
Fourth, community.
Kubernetes is well known by its strong community
and supported by CNCA =
_ is the largest ever open-source event in the world.
_, the annual GitHub survey from 2019
shows that Kubernetes is one of the top 10 popular open source projects by contributors
and Helm which is the Kubernetes package manager,
is one of the top 10 fastest growing open source projects by contributors.
On the last two years, Kubernetes comes in
as one of the three most loved platforms by developers
along with Linux and _
in an annual survey done by stack overflow.
In JFrog, we found some reasons for why we should use Kubernetes.
First, I need a running enviroment. NOW!
Developers as well as production
need high _ of creating environments for their daily work.
This is something that can be easily reached using Kubernetes.
Second, branch integration enviroment with other products’ branches.
For example,
artifactory developer wants to set up an environment to test his branch
with other applications’ versions.
In the past,
they had to set up a VM, install an Ops, =
set network configuration, etc.
This takes a lot of time.
Again, this is something you can
do very easily using Kubernetes.
Third, better resources utilization for dev and production.
In Kubernetes, an application can be started with a low amount of CPU and memory
and increase up to a specific limit
according to the consumption.
It makes it easy for admins to manage their resources and save costs.
I will come back to this later in the presentation.
Fourth, support a new distribution for JFrog products.
As mentioned,
Kubernetes is very popular
platform, so JFrog must supply installation of its products on Kubernetes
for our customers and the community.
Fifth,
taking our containerized products to production.
Or in other words,
development environment must be production-like as much as possible. =
Having an application in _ is nice,
but we need much more than that
and that’s what a full solution platform like Kubernetes gives.
Sixth, vendor agnostic as much as possible.
As we must be able to deploy all of our products
on the main cloud vendors,
we better have one API for that
and Kubernetes provides an abstraction on top of the vendor
with a single idea.
Seven,
auto-scaling and auto-healing of application.
These are the highlights of Kubernetes,
the features that make almost everyone
want to use Kubernetes because of them.
And indeed, they are great features
that increase the reliability of your application.
For easy management of applications’ deployment on Kubernetes,
we decided to use Helm.
Helm is a tool for managing Kubernetes’ packages called charts.
Like RPM or central surveillance =
Helm charts helping defining, installing and upgrading Kubernetes applications.
JFrog publishes and maintains official Helm charts on all its products
in the Helm app
for the usage of our customers,
the community and of course, our _ solution.
So how are we using Kubernetes in our internal environment?
For development purposes, we have a CI/CD process that installs =
JFrog products on Kubernetes.
Per need, a developer can initiate this process and specify a branch version to install
without our application’s master or branch version.
The deployment process uses JFrog official Helm charts
and the result is an isolated Kubernetes name space
with all the applications installed.
Here,
you can see the output of the Jenkin’s pipeline job,
with various installation steps
and in the end, the developer gets his
dedicated JFrog platform running. =
For staging and pre-production purposes, we have several manage clusters
at least one of each cloud vendor =
and we are running full CI/CD processes of JFrog product installation
along with Kubernetes infrastructures and tools
in order to run tests and reveal bugs before upgrading our production environment.
Back then, three years ago,
we investigated several options
for self-managing Kubernetes’ clusters in AWS, GCP and Azure.
We found that for us, the easiest and immediate
approach will be to use managed Kubernetes solutions
provided by the cloud vendors
EKS, GKE and AKS.
We understood that managing the clusters by ourselves
requires a lot of resources and skill we don’t have
and we better focus on the things we’re good at.
For the past 10 years, we have production environments running EKS, GKE and AKS
on various regions with the same API.
We still have to deal with cloud vendors’ specific services like
logged balancers, databases, etc.
Moving to work with Kubernetes
reduced the amount of vendor-specific APIs we need to manage,
and new customer deployment, nowadays
is a self-service and fully automated process.
No need for DevOps engineer intervention.
The environment is ready for use within several minutes
on any supported cloud region.
Hands-free automation.
Talking about having Kubernetes in production system
maybe sound very romantic to some people,
you’re simply running one or two commands and everything is just running in its place.
You have the application auto healing and auto scaling so now you’re free to sail to sunset
better with a glass of red wine in your hand.
Very soon, you find out this is not really the case.
You find your ship in the middle of the storming sea,
struggling big waves,
_ to drowning.
And find yourself helpless and terrified.
I know this sounds a little bit dramatic
but this is not far from the feeling you have the first time you’re facing a service downtime with your new _ system.
Once you pull yourself together,
you’re starting to understand there is work to do to stabilize your ship
and make it stronger and resistant.
So next time,
and of course there will be some of them,
it faces high waves and big storms
the shock will be smaller and the durability will be better.
Let’s now discuss the things we learned and found as important
during our process to stabilize our Kubernetes production ship.
Visibility.
The first thing that we learned is that because of Kubernetes’ complexibility as a platform,
we need visibility of what’s going on in the system.
Whether it’s for the use of developers,
or teams, DevOps or any other person. =
No more SSH to the server and “get me the logs”.
Developers should not need
_ access to debug their applications
and further more
life where _ images do not have basic Linux _ on them.
Therefore, we understood we need detailed logging and monitoring tools
to run on the clusters continuously,
collect all the logs in the matrix and provide us
a reliable visibility of the system’s _
along with all the information developers and support teams
need to debug their application,
real time and _.
At first, we started with local ELK and _ set up
but very soon, with a growing scale, we found that exactly like
with _
we should go with a professional managed solution
for our logging and monitoring _.
This removed the need to manage the scale of monitoring and logging system
which were very time and resource consuming for us.
Dev = staging = production.
We noticed that many times
when we deploy a newly _ of one of our products
various functional and performance issues are found only in production.
This always led to a hard release of
fix versions, and many sleepless nights
as mentioned before, we have 3 environments,
we learned that we must minimize
the differences between those environments as much as possible.
The ability to test on production-like environments give the option
to run tests and checks at development time
and can minimize _, risks and _.
Today,
developers can deploy a branch to asses environment
which is identical to production
for preforming professional and non professional tests with it.
Know your limits.
One of the significant things we learned
is that we must know our application.
This means
you need to learn and understand how your application works
in terms of resources usage,
memory, CPU, databases,
permanent storage if needed
and anything your application needs in order to
preform and run efficiently with minimum force.
There is always the balance between high utilization versus high performance.
You can give your application a high amount of resources so
it will run very fast and without _,
but those resources
are probably idle sometimes, can cost you a lot of money,
you need to find a balance
the correct ratio
for your application to run properly and efficiently.
This is also true to the class turn offs,
which have defined size and code a specific _ memory amount.
There is also the Kubernetes scheduler that ensures there are
enough resources for all the bugs on the _ itself.
But you have to quickly calculate and understand how many _ you’re going to run,
what are the resources they need and based on this
plan, the cluster size and the note size. =
Don’t forget, the system _ and components such as
Kubelet, _, container engine and _ processes
also require resources.
Each vendor has its own definitions
for system services
and it must be considered in your plans and calculations.
Pod priority and pod quality of service.
Funny story about pod eviction,
is that one day, in the very beginning
we found dozens of pods
of the same application in eviction mode
and we got panicked since it was hard to see
there is one pod and it could run mistake. =
That was the…
the point when we learned about the existence of pod eviction state.
Thanks to auto-healing and auto-scaling, the customer didn’t experience a downtime.
With the respect to resources limits
mentioned on the previous slide
one of the major issues with applications
is the resources they are using
Kubernetes’ scheduler is a component that’s responsible for finding the best note for a pod to run on
considering resources needs, if they are defined.
There are 3 quality of service classes in Kubernetes.
The first, guaranteed.
All pods containers have specified
the same value for CPU requests and limits
and for memory requests and limits.
This option is the safest one as Kubernetes promises to schedule it in a note
with all the needed resources
and it protects the note from being evicted
in case the note is running out of memory.
On the other hand,
it might be riskful in terms of
_ resources if you are setting requests and limits without preliminary tuning.
The second if burstable.
The pod is not guaranteed
but at least one of its containers
has specified a memory or a CPU request.
Burstable is good for most of the pods
but it is not totally safe
as your pod might be evicted by Kubernetes in some cases.
The last is best effort.
Pod doesn’t specify any CPU requests.
This kind of pod should be avoided
as it is the first one to be evicted and might be very riskful
as CPU and memory can grow with no limits.
We found the guaranteed option as the best in the very
early set-up stage
when you want to stabilize your system and then you learn your application and its needs,
but it is the most expensive option.
After some time, when you tune your application, do optimization and reset the resources
then you can start to use the top priority mechanism
which defines what’s the most important applications for you when there is
an eviction process.
For example, when we started, we loaded a lot of applications
none of them had defined pod priority
soon, we found that artifactory =
is the heart of all the services
that a lot of pods depend on
should have higher priority than a logs collector pod
that can stop and start from the same place after a _.
Another nice example for improving the service availability by using pod priority
is _.
When _ runs on a cluster in Kubernetes =
there’s some interesting faults
that happen as a result of _ leaving and joining the cluster frequently.
Once we added pod priority to _
it made it much stable and reduced
dramatically the faults related to _ cluster.
Zero downtime upgrades.
As a _, we should aim to minimize
downtime of the service for any reason.
Application version upgrade is something we should be able to run whenever we want.
For example,
an urgent security patch of artifactory
must be installed to customer’s productions environments
and we can’t delay
or wait for maintenance meeting.
Having the application running in high-availability mode
in several low-balance pod =
eliminates the risk of downtime when upgrading the environments.
With Kubernetes you have the option for a rolling update.
So, at a specific time, only one pod is going down for an upgrade
and the others are still functional
and they’ll be upgraded on their turn.
Important point to mention
is that your application must have HA ability
the option to add or remove a service pod smoothly without issues
and again, for this
you must know your application and change accordingly
what’s needed to support it.
In addition, there are also _ version upgrade
cloud vendor upgrades or tools upgrades
that must be running without interrupting application availability.
One great mechanism in Kubernetes are the probes
that help you to ensure the application is indeed
functional, and if not, restart it.
Readiness probe is an automatic mechanism
to know when a container is ready to start accepting traffic.
Liveness probe
to know when to restart a container.
whereas _ to upgrade process
if for example, you are
upgrading the RDS and re-connection to the database is needed
the liveness probe will identify the _ state of an application
and will restart it so re-connection will be done automatically.
Security.
In JFrog, we love to eat our own dog food
and use our products in our daily work and processing.
Having said that,
it is clear that the best way to manage our
_ images and Helm charts
is using artifactory.
We are using an internal
artifactory server as a Docker _ and Helm _
and during deployment process, everything the deployment needs
is fetched from artifactory.
So we have full control and visibility of
what’s running in our development and production Kubernetes _.
On top of that,
X-Ray runs and scans all the 3rd party Docker images
stored in that artifactory, so
only scanned and approved images find their way to Kubernetes’ cluster.
And this ensures there is no unapproved code running on our cluster.
Continuous learning.
This is one of the main messages I would like you to take from this talk.
The understanding that the learning process of Kubernetes itself and of your system is never ending.
Kubernetes as a technology is new for almost everyone,
Developers and _
must learn how it works, how to use it
learn best practices and recommendations which are available all over the net.
learn how to look for loops in matrix
and how to gain required information to debug applications
you should always continue to tune your infrastructure and applications and their behavior
reviewing matrix from the past is very important
you must check how your application behaves over periods of time.
Take a specific process of behavior that hurts your system, such as
bad performance process or high-network perception process
and investigate it over a day, a week
two weeks, a month, etc.
and see if there is any pattern repeating itself
As an example,
in one of our proactive chips we’ve
found imprisoned CPU over several weeks
and that repeals the _ that it causes
CPU log problems and memory error.
This is something you can’t
find when reviewing a day or a single week.
Know that today, there are also great machine learning and APM tools
that help you find this kind of issues
or even better, alert you about them according to predefined rules.
Don’t take any choice you took in the past for granted.
Take into account that at the time you took that decision,
you didn’t know the things you know
now, and you even experience a thing or two
and now you have a better knowledge and confidence under a system that needs to be handled.
Keep asking yourself if it makes sense
don’t be afraid to change, if needed.
Finally, I’d like to recap the main points I mentioned in this talk.
If that’s the best for you,
use manage Kubernetes and always
stay focused on what you’re good at.
Visibility, visibility, visibility.
The key for debugging and controlling your system.
Don’t be afraid to use manage solutions
All environments should be similar as much as possible to reduce risks
and potential bugs in production.
Know your limits.
learn your application and set resources and put priorities accordingly.
Take actions
in order to have a 0 downtime upgrade.
Change your application to be a real HA.
Run rolling upgrades
and use Kubernetes probes.
Think about security and try to minimize
potential risks that may come from your Docker images.
Expect failures
be mentally prepared for them and try to be proactive as much as you can.
Don’t forget you always must continue to learn your system
tune your configuration and matrix and re-validate your choices.
I hope that each and every one of you
will find one or two lessons learned
relevant for your organization and take it back to improve your Kubernetes production experience. =
The Kubernetes journey
probably never ends.
but I hope that this talk convinced you that Kubernetes in production
is actually not that hard.
It is possible, it is fun
and as we always say in this _
once you lead program, =
you can’t go back.