Automated Artifactory HA Pipeline – Hank Hudgins, Capital One
JFrog Artifactory assists developers in automating their release pipelines, so it’s important to ensure the release pipeline for Artifactory itself is solid enough to ensure stable, continuous availability of the platform.
In this swampUP session, Hank Hudgins discusses how Capital One has automated the release pipeline for their resilient HA implementation of Artifactory in AWS. Some of the topics covered include: build staging, static analysis, unit testing, security scanning, Terraform architecture, Chef configuration management, Inspec testing, validation testing, performance testing and rollback strategy.
VIDEO TRANSCRIPT
Okay. I guess we are good to start. So hey, everybody. Hope you guys are having a good Swamp Up so far. I’m Hank Hudgins. I work at Capital One. I’m a senior software engineer and I work on the artifactory platform team. And today, we’re going to be going over automated artifactory HA pipeline. So we’re pretty much just going to break it down into the different stages that we have in our pipeline and just relate it back to how we implement those stages. And just so you know, there are feedback cards on all these tables, so if you guys could fill those out when you have chance, that’d be great.
Basically, the overview of what we’re going to be talking about. I’m just going to give a brief background into our implementation of artifactory and just go into some of the benefits of having a solid automated release pipeline. And then we’re going to have an overview of our pipeline and then go into each of the different stages and just kind of describe what the purpose of that stage is and how we implemented that in Capital One. And then at the end, we’re going to go through the rollback strategies as well because that’s very important. And then if we have time at the end, we’ll go into questions. We can have a breakout session after, as well.
So to start off with our background, so artifactory, obviously, is our enterprise solution for binary storage and proxy and remote repositories. And I am a member of that artifactory team and I’ve been at Capital One for over two years now. And what we do is we basically develop the infrastructure and the automation for the platform to ensure that our users get new features as they become available. And we kind have a system so that users can vote on which features and which repositories they want the most and that’s how we kind of dictate how we’re going to release those new repositories or features.
And so some of the types of things we release in our automated pipeline are artifactory upgrades and any infrastructure updates we need to make. We’re currently operating out of AWS, so if there’s anything that we want to tweak in our autoscaling groups, for instance, if we want to change the scale, we have to release that through our pipeline as well. And then we also have server configuration updates released. And I’ll go into that a little bit later. We actually use chef cookbooks to configure out servers, so obviously those are also going to get pushed through with our pipeline. And then last thing is server security patches. We try to stay on top of any patching that goes into our security for our EC2 instances. So we regularly are cycling out our instances and that also goes through our pipeline.
In our particular artifactory implementation, we have two HA clusters. We have a primary artifactory HA cluster that uses are hitting most of the time. And then we have a secondary artifactory HA cluster in another [inaudible 00:03:11] region that we use for our DR, our disaster recovery region. And so we try to keep those two artifactory clusters in sync as much as possible. In case something goes wrong with our primary region, we can just flip traffic over to our DR region.
And so some of the benefits of solid automated release pipeline … I have them in here. There’s probably more, but these are the ones that I could come up with. So the first one, speed up the delivery process, having a solid pipeline freeze up your developers to focus more on development. And they don’t have to focus on releasing to the different environments, making sure they do all the testing themselves. You can have all that automated so it’s a great way to free up the developers to focus more on development and get that release process moving faster.
Second is, it’s more easily expandable. So let’s say you have a pipeline and you’re releasing an artifactory cluster into one region. You want to expand to a second region. It’s a lot easier to do that if you have solid automated release pipeline. You probably just have to add some more configurations to release to that new region. You already have the testing in place. You already have all the other stages in there so that it just goes through smoothly and you don’t really have to focus on building all that out again.
And then third, it has repeatable, thorough testing. And the main goal of all this testing is to ensure a seamless user experience and have no negative user impacts. Obviously, if you could do releases without users even realizing that something’s happening, that’s the best scenario. So that’s what we strive to do and it’s a really good goal to achieve.
And then the last one here, enables effective rollback strategies. So it does this in two ways. The first is it enables you to detect faults faster. You have automated testing in your pipeline, so you can detect when things are misconfigured much faster than if you just spin up a cluster, for instance, and then you realize some functionality is not working just from doing some manual testing. So that’s much faster. And then it also helps orchestrate deployment in a way that you can rollback from without impacting users. So I’ll go into how we did that in a little bit here.
So this overview here, this is how we set up our pipeline. So we have a Jenkins pipeline that’s integrated with our Github repository. And whenever we make a poll request into our main repository, that triggers some small-scale testing. So first, some fast feedback. So it actually runs through some quick static analysis, spins up a single artifactory instance and just makes sure that that artifactory is coming up. So it’s not quite the HA scale, but it is good for just getting that quick feedback on a poll request.
And then once we determine that that looks okay, we can merge that into our main repository and that triggers a build of our dev environment. And that actually has multi-region HA clusters. So we do have … They’re a little bit smaller scale than production, but it is a good way to test out that our infrastructure is building correctly to two regions. And so we do run some automated testing against that, as well. And then once we determine we have enough code changes and things that we want to release to production, we actually mark that code with a QA tag and then we build that tag to deploy a production- scale QA environment. So it’s the exact same number of instances and production. And actually, we are copying our production data down to have an almost exact match of what we have in production. And that’s really helpful for making sure our releases don’t cause any issues.
So, yeah. Once we determine the QA is looking okay, we promote that same QA tag to a production tag and release to production. So we use Terraform for creating our infrastructure. I put blue green deployments in here, but it’s not quite blue green. You’ll see what I’m talking about. It’s a little complicated when you have a centralized database connected to these artifactory HA clusters. So we have a pseudo blue green strategy for our deployments. And then we use a chef cookbook for our server configuration. And that just installs, configures, and starts our services. And different services we have are artifactory Nginx for the reverse proxy and the Datadog for monitoring and Splunk for collecting our logs.
Okay. So we’ll go into each of these different stages in our pipeline now, just kind of go over the details of each and how we are implementing these. So static analysis, this one runs on poll requests and our development merges. Yeah, where our mergers to our main branch that build the dev environment. So these ensure that our code structure meets industry standards. It gives us really fast feedback, makes sure we’re not introducing unmanageable code structure into our repository. And so we handle this with a series of linters. And what we have done is, actually, we’ve containerized all these linters into a doctor container that runs in our Jenkins pipeline. And that just makes sure that we don’t affect Jenkins in any way and we can run any of these different linters against our files.
So the linters we have are the cookstyle, foodcritic, rubocop. We use all those for our chef cookbooks. We have Jenkins file linter, Pylint because we actually have some custom Python scripts, and ShellCheck, and then we have Terraform validate for our Terraform, and markdown lints because it’s really important to keep your documentation in nice, organized format.
And then the next stage, we have security scanning. You guys probably heard a few times today already that it’s good to shift left your security scanning. Make sure you catch those vulnerabilities before you go to production. So we do some security scanning in our pipeline. We primarily do it to scan for dependencies and binaries in our custom scripts. I don’t think we scan our artifactory binary, but I assume they’re giving us good stuff. So don’t worry about that. So we actually have two types of security scanning. Static security testing, which is just analyzing the code for potential exploitable code structures. For instance, if you had some SQL injection potential in there or cross-site scripting, stuff like that. And then the other type is composition analysis, and this is more what X-ray would give you. And that analyzes dependencies for known vulnerabilities and give recommendations on versions to use that have fixes for those vulnerabilities.
And so some of the custom code that we have that we send off to be security scanned, we have an AWS lambda function code that basically forwards our Cloudwatch alerts to our Slack, because that’s usually where we hang out on a day-to-day basis. So it’s really easy to see those some in. And then we also have some custom Python scripts to assist in the replication between our two artifactory clusters. And, yeah. We ran into some issues where we had some large repositories that weren’t quite getting the Chron replication, the native application to work, so we had to assist that with some of our own custom scripts.
And then the next is a unit and integration testing. Again, this really probably only applies to any custom code you might have, so you just have to test and make sure that it’s not going to break any expected behaviors. And this also can apply to custom user plugins. So I don’t know how many of you guys use custom user plugins in artifactory, but they actually use artifactory’s public API library. So it’s a little tricky to test those, but what we did was we basically containerized an insulation of artifactory and just inject our user plugin into that container artifactory and then we’d run tests against it. So it’s testing the functionality of the user plugin without actually integrating it into our environments as a part of our pipeline.
And then we’re getting into more of the deployment process here. So build staging before we even deploy anything, we have to stage our files into a place where autoscaling events at runtime won’t have dependencies on other systems. And that includes artifactory. So obviously, if you’re releasing artifactory, you can’t depend on artifactory for runtime scaling because, if artifactory ever goes down, you’re stuck. You can’t scale anymore. So what we are doing is we’re actually storing our build files in an S3 bucket. And we just source everything there, from then on, at runtime. So as a part of our build stage in our pipeline, we do pull our RPMs from artifactory, but then we upload them to S3. So the RPMs that we’re using are artifactory RPM and Nginx RPM.
And then our chef cookbook, we also push up to artifactory. We build and push it up to artifactory NS3, as a part of our dev build. And then our QA and production builds will pull that same cookbook from artifactory and put that in S3, just to ensure we’re using the same one. We’re not having to rebuild it throughout our environments.
And then we get into deployment. So this gets into a little of our deployment strategy. So basically, our goal is to deploy the new stacks without affecting availability of artifactory and ensuring the users don’t even detect that something’s happening. That’s the mail goal. So what we do first is we actually fail all of our user traffic from our primary region to our DR region and then we can focus on deploying to our primary region. So as a part of that, we scale down the existing stack to just a couple member nodes. There’s no users hitting this stack anymore. It’s okay to scale it down. And we actually remove the primary node as well because the new stack coming in is going to be joining that existing stack or cluster, artifactory cluster, because they’re going to be hitting the same database and the same S3 backing storage. So they’re effectively just cycling instances into the same artifactory cluster.
So that’s why we scale down the number of member nodes. It frees up licenses for the new instances that are coming up and the new primary node will be the primary node for that cluster. So once we launch the stack, we run some testing against it. And if the testing is successful, then that’s when we actually fail the endpoints to the new stack, or switch them over to the new stack. And we’ll also flip traffic back over to the primary region.
And then once we flip traffic back over to the primary region, we wait a little bit just to see if users hitting the cluster is going to cause any additional issues. And when we’re ready, we can deploy to our DR region, as well. So that’s kind of the blue green strategy there. I know it’s not quite blue green, but it’s a little pseudo. And then because we keep those clusters in sync, the deployment process is relatively seamless from a user’s perspective. And I say relatively because we have these clusters in different regions so, obviously, there’s going to be a latency difference between the two. It’s effectively no user impact.
So now I’m going to go into a few of the different testings that we do after we deploy. We have to pass these testings before we determine that it’s okay to flip traffic back over. So the first one we do is configuration testing. And this is just to ensure everything is set up correctly on the servers while they’re not serving traffic. So we actually SSH tunnel from Jenkins into each of the new servers and we remotely run NSpec tests to make sure everything is set up correctly.
So NSpec is very closely tied with chef cookbooks. I think it’s even part of the test kitchen. So we decided to use NSpec along with our chef cookbook to make sure everything is set up right so we can assure our services are enabled and running. So that’s the artifactory Nginx, DataDog, and Splunk. And then we also ensure that the config files are all correct, in the right places, and have the correct permissions. And then we also ensure the networking ports are open as expected. So three of these have ports that we are interested in. Artifactory has the tomcat ports, both for artifactory and the access service. And then we have Nginx ports, obviously. And then DataDog is monitoring JMX on artifactory. So that also is using a port. So just make sure all those are correctly opened and listening.
And then any failures with all of these testing, post-deployment testings, any failures in any of these is going to cause a rollback. So the second kind of testing we do post-deployment is validation testing. And this is a suite of tests we have to ensure that artifactories or repositories are all working as expected. So we’ve actually containerized the package managers that pull test packages across all the repositories that we have configured. And this is good because it’ll catch issues if any new system configurations will break resolution of those, any of these repositories. So for instance, if we change some Nginx rules, that could affect our doctor’s repositories because our Nginx is set up to forward our doctor traffic to specific repositories. So it’s really important that we test to make sure all these repositories are still working okay.
And then it also tests deploying artifacts, not just pulling artifacts. We also test deploying artifacts to each of our different virtual repositories. And then last, but certainly not least, performance testing. So JFrog is handling a lot of performance testing on their end, but everyone’s use case is different and we’ve actually been burned on some of our implementation with this before. So it’s very important that you perform this test your use case. And this is just to ensure that you don’t have any performance degradation before you release to production. Because the worst thing is when you put it out in production, users start hitting it, and all of a sudden artifactories are running super slow and everyone’s freaking out. So it’s really good to test this ahead of time.
So we have JMeter actually simulating production-level traffic and we analyze the overall TPS, which is the transacts per second. And the reason we’re … Actually, I’ll get into that a little bit. So [inaudible 00:20:06] a 15-minute load test that runs as a part of our pipeline. And that runs against QA. So after QA deploys, we run this 15-minute load test just as a part of our automated pipeline. We also have an option of one-hour load tests for any major upgrades or large changes that we really want to make sure everything is looking okay. That one is not run automatically. We run that manually. But it’s good to have that option.
And then we also, like I was saying before, we sync our data from production to QA to make sure we have as close a match to what we have in production as possible because that is really where you’re going to see things go awry. If you have less data in your QA than your production, you may not catch something that is going to cause users issues. So, yeah. It’s a good idea to do that if you can.
And then, it’s a little challenging to model traffic because artifactory has such a wide range of supported package types with their own package managers. And with our specific implementation of JMeter, we’re limited to only making API calls. So we basically went through our production logs and got a list of the most frequently-called API calls. And we tried to match the rate of API calls at peak traffic. It’s still not perfect, but it’s pretty useful for detecting trends in performance. We are actually actively working on this, I think, right now actually to improve how we’re making these calls and hopefully get to a point where we can actually use the package managers themselves to simulate this production traffic, because trying to simulate a doctor poll with a series of API calls is not ideal.
And then our rollback strategy. So we have two levels of rollback, actually. We have an in-region rollback. So if testing fails, this automated testing, we will actually remove the new stack and restore the old stack to its previous state, before the user traffic ever hits any new instances. And so that’s our automated rollback. We also have a DR failover rollback. So this is more of a strategy in the way we’re deploying. So we deploy to our primary region first with users hitting our DR region. And that really helps us … Well, once we flip user traffic back over to the primary region, we can actually watch users … monitor the system in the primary region while users are hitting it and make sure we didn’t miss anything that might be a little wonky. So if we do detect something is not performing the way we want it to, we can quickly just fail over that traffic back over to our DR region. It’s got the same state as before. We haven’t updated it. So that’s kind of a quick rollback strategy in the case of something that we might’ve missed.
And then once we failed over to the DR region, we can restore our primary region by rolling back the database, if we need to, to a previous snapshot. And then just let the regions come back in sync through the replication process. And then just a note here in the case of major upgrades, automated rollbacks are not possible without also rolling back the database because, more often than not, those major upgrades involve some kind of database schema changes. So you can’t just spin up the old artifactory version against the migrated database schema. So it’s a good idea to have some kind of process to roll back that database and then bring things back in sync. And actually, we don’t have this currently automated, but hopefully we will, eventually. Luckily, we haven’t had to do it too much. So that’s a good thing.
So in conclusion, automated release pipelines have many benefits. Speed up delivery process, they’re more easily expandable, they have a repeatable, thorough testing, and they enable rollbacks strategies. And even with [inaudible 00:24:42] products like artifactory, it’s important to thoroughly test your infrastructure and configuration to build confidence in your releases. And always look for opportunities to improve your pipeline. And investments now make your life easier down the road. And that is my advice. Cool. Thanks.
So we probably have a little time here, if anybody has any questions for me. Yeah.
[inaudible 00:25:18]
What do we use for security scanning?
Yeah. [inaudible 00:25:27]
Right. So we actually have an internal infrastructure for security scanning that actually integrates multiple products together. It doesn’t currently include X-Ray, although we’re probably going to evaluate that in the future. So that one, for sure, would be handling the binaries. I’m not 100% sure on what is doing the static scanning. So I can get back with you in a little bit. I can look into that.
I’m sorry. Could you go to the discussion room and continue?
Oh, yeah. Yeah. Sure. So, yeah. We can break for the discussion room. If anybody has any questions outside of this, just let me know and we’ll follow the [vec 00:26:11] here and get those questions answered. Thank you.