Simplicity & Velocity: Focusing on Your Core Business

Yoav Landman
CTO & Co-Founder
JFrog Logo

JFrog CTO and Co-Founder Yoav Landman illustrates how the ever-growing velocity and complexity of creating great software drove JFrog to create an End-to-End DevOps Platform. The Platform addresses scale, security, and data protection challenges. It supports diverse and hyper workloads, allowing you to focus on your core business by harnessing the cloud and managed services to reduce operational complexity.

Adopting Best Practices Across a Hybrid/Multi-Cloud Environment

 

Video transcript

My name is Yoav Landman. I’m CTO and co-founder of JFrog. The world of software is evolving at a growing velocity. The expected time to deliver new software versions is shrinking from day to day and we have to meet these challenges on a daily basis, updating tens of thousands of customers and deployments across the globe, 24/7. To meet these growing demands, companies need to focus on their core business and harness cloud capabilities and manage services to reduce this operation complexity and overhead. JFrog platform allows you to focus on your core business while having the best end to end DevOps platform for managing your software creation pipelines in the cloud. Today, we’d like to share with you some of the key challenge that we faced while creating this platform and how we address them to create a unique multi-cloud value. We can broadly speak about six challenges that we faced while creating the JFrog platform.

The first one is the challenge of scale, running thousands of customers in multiple regions, across the world. And today we support around 20 regions. The next challenge is the challenge of applying updates. We want to be able to apply software version updates as quickly as possible, and more importantly, when there’s a need to apply a security patch, we want to do it almost immediately. That led us to develop some very interesting infrastructure in order to be able to do it, I will touch on that later. The next challenge is the challenge of continuous security. It’s all about making sure that our infrastructure and our applications are free from any security, vulnerability, or weaknesses. On top of that, there is a challenge of data protection. We want to make sure that your data is kept secure and safe with us, and of course is never lost.

Then there is another challenge, which is the challenge of diverse workload types. Our applications are very different from each other and they impose different needs in terms of databases, networking, compute, and so on. That’s another challenge. And the final challenge is the challenge of hybrid workloads. The JFrog platform is both multi-cloud and it supports the combination of SAS and self-hosted. So,that’s the final challenge.

Let me start with the first challenge, the challenge of scale. Our platform is comprised of dozens of microservices and those services speak with each other through a service mesh. Today, we’re using traffic, which is also what we’re using in the on-prem version. And we wanted the ability to dynamically scale those services up and down as needed. And this led us to migrate from our legacy infrastructure that we used to run the cloud on to Kubernetes early in 2017.

The first thing we’ve done, once upgrading to Kubernetes, is to get rid from the default auto scale that is offered by Windows, because we found out that the rules that are implemented there don’t meet our needs for how we want scale up and scale down to be triggered. So, today, we’re using our own customer auto scale across all clouds. Another challenge that we had to deal with is how to spread the load for optimal resource utilization. For instance, DB connections. We don’t want to max them out as we onboard new customers. So we have full automation infrastructures to handle that.

Next, we have customers that want to have their own private environments, and they want to have a fully isolated infra. So we’re using same automation, same templates to provision that. So our system, our management systems have to also cater for that.

Another challenge that we have to face is the need to support distributed topologies, or cross region deployments. It’s very common to find customers that have multi-site development environments and they need some sort of a duration between them or distribution to edge node, edge nodes, for example. So, this requires our automation to support multi topologies. Another thing that is very common is the need to customize things on a per customer base on a region based and so on. So we have a hierarchical framework that allows us to set up feature flags in hierarchy. So though every layer inherits from, from the layer above it. So we have global and we have regions, and then we have specific subscription tiers, and finally specific customers. And for some example of those features are unique C names or certificates per customer, geo IP restrictions, but virtually almost every feature in the platform is customizable and you can override it in each of these layers.

This is very flexible. And finally, what we found out almost immediately once moving to this scale, is that manual changes can never happen. So any manual change is of course, audited, it cannot happen, but any change that is not automated is going to be wiped out by the automation. And we are also looking today at tools to allow us to discover those drifts in full automated way.

Let’s speak about updates. Updates in JFrog are fully end to end automated process. When we started, we looked around for tools that are out there. So we looked at ALGO and FLUX and they were quite young back then. And we had to device our homegrown system to collaborate with other homegrown systems that we have in JFrog. And this is a system we internally call autopilot. What this system does is it runs a fully progressive update and it monitors the output as it advances, and we support plugable metrics for this monitoring.

When upgrade progress, we know how to stop or proceed with them based on this output. Another thing we wanted to have really early on is to be able to take zero day security updates and also some updates that are very low risk and apply them almost immediately to the system. Earlier, I spoke about the need to apply zero day patches almost immediately, and that led us to improve our deployment infrastructure and to cut the deployment time by almost 95% compared to what it was before. So this is great. Another challenge that we’ve facing is that the development of the platform applies full dogfooding, meaning that we’re using the JFrog platform in the cloud to create the JFrog platform in the cloud, and also the self fostered products, of course, and that bears a risk of breaking our own compiler, if you wish. And for that, we had to develop a set of harness tools and checkpoints, making sure that every cloud update is bulletproof.

The next challenge I would like to speak about is continuous security. All our products are scanned by JFrog Xray. So, this is another aspect of dogfooding that I mentioned before. And a good example for that is the log for shell vulnerability to swamp the world, we identified immediately that our services were not impacted because of this continuous scanning. Another type of continuous scanning that we apply is on our infrastructure. We scan our Terraform templates. We use Terraform heavily. We also monitor the access to the system. We use list privilege access control policies, which we both enforce and monitor. And finally, the events and the logs of the system are fed into an incident response infrastructure and any suspicious activity triggers automated workflows in the incident response system. We take special care of the privacy of your data and all of your data is being stored, encrypted at REST, both in the database and in the object storage.

Speaking about data, I would like to talk about data protection. I mentioned the database and I mentioned object storage. These are the main places where your assets are entirely being located. We’re not using any file system or something like that for long term persistency. And that also buys us features that allow us to do DR for instance, using the cloud managed services. So for the database we’re using cross-region replicas, and those replicas can be promoted to primary in the case of a failover and for object storage, we use cross-region replication, which is offered by the cloud provider. On top of that, we also take care of data co-option by having point-in-time backups. So, if a customer by chance, created any data loss, the old data can be recovered. It will be in the backup. Another aspect of that is that since we’re using just database and object storage in the database, we achieve point-in-time backups using point-in-time recovery checkpoints.

And in the buckets, we actually differentiate a little bit between the different cloud providers. This is a case where, not everyone is identical in its support. So we may use either bucket versioning or soft delete to achieve some sort of point-in-time backup. And finally customers can also achieve active-active multi-site cross-region Federation of content. This is not part of the DevOps platform. It’s part of the platform products. It’s a feature in the platform product. And currently we support federation for binaries for artifact repositories and for security objects, and we’re going to add more objects to the federation support in the future. The next challenge I would like to speak about is the need to support diverse workloads. So our products are very different from each other. We have artifact management and security scanning and distribution to edge nodes and CI CD, and IoT updates.

And all those products, they exhibit different demands and introduce very different workload requirements in terms of CPU, IO, memory, network, and so on. And that means we need to use different node pools for this product. When having to split a lot of the cluster between different node pools, we need to worry about two major concerns. The first one is what we just spoke about the difference between application types or products. And the other one is almost the subcategory of this, it’s the difference between subscription types. Some of our applications are heavily data and network bound. For example, Artifactory, while others are heavily CPU bound, Xray scanners, for instance, or JFrOg pipelines. So we need to use different node pools for them. The other reason I mentioned formatting different node pools is the difference between subscription types and their expected workload on the system. For example, free tier customers may introduce very different workload on the system compared to enterprise customers.

And then we need to match the node pools to the activities based on those two criterias- the difference in products and the difference in subscriptions. So we need to match the node pool types, according to those two different criterias the different products or applications and the different subscriptions. And we support seamless migration between node pool types by using O-Ring upgrades and internally we use label selectors in Kubernetes to achieve that. Finally, another highly demanded feature that we have to support is workload that is not just diverse between applications, but is also hybrid or amphibian as we call it in JFrog. And it’s hybrid in two ways; between clouds and between the cloud and the OnPrem. So first we are multi-cloud and we are deployed across all major public cloud providers, giving you the freedom of choice. So AWS and GCP and Azure, and more to come.

However, this also means we do not lock ourselves to any one provider solution, but instead work harder, which might, may require custom development to make sure that we provide you with the best quality service across providers. There are some exceptions to that. I will touch on them in a moment. Moreover, our customers have workloads that span both self hosted and cloud info. For instance, you may run your CI in a self hosted environment and then distribute to production in the cloud, or you can run most of your software creation pipelines in the cloud, and then distribute to self hosted edge nodes or edge nodes. In other clouds. These use cases are very common across the jet for customer, and we have to support all those scenarios. This has also led us to invest in features for fast automated data migration, form self hosted to the cloud, and also between clouds.

I mentioned that sometimes there are certain features that are supported more natively by some cloud providers. A good example is private linking, which is very common among our enterprise customers. And it allows you to use a secure tunnel between JFrog and your own private cloud. A different example is support for CD and downloads or for direct downloads using the object store, which may be faster. In these cases, we may prioritize support for the vendor that best offer this solution, but we will expose the feature in the most generic way we can to allow for future support by other vendors to be adopted. Addressing all these challenges is what allowed us to create a true flexible end-to-end software supply chain cloud platform. It is a hybrid and multi-cloud platform. So you have the freedom of choice. We know we are storing and managing the workflow of one of your most valuable assets, your software releases.

And this is why protecting and reliably handling these assets is top of mind for us requiring high degree of security and integrity checks as part of our cloud ops. This, together with tools that allow you to easily manage this cloud transition from legacy self-hosted or other clouds is of course the result of the great work of the amazing team of DevOps and SREs here in JFrog and these together with automated tools to allow this cloud transition to happen easily and automatically is the result of the great work of the DevOps and SRE here at JFrog. And the goal is one, to allow you to concentrate on what really matters, focus on your core business to create great software. Thank you, and I hope you enjoy the rest of the conference.

 

Release Fast Or Die