The Enterprise Holdings DevOps team walks through their company’s journey into the world of DevOps. Breaking down cultural barriers in a large organization brought struggles, but also revealed enormous opportunities to strengthen security more holistically as well as enable agility through the solution development life cycle. Learn about the political and technical challenges of DevOps transformation in a large enterprise.
All right thank you ladies and gentlemen for joining our talk today. My name is Nathan Kicklighter, I’m an architect for Enterprise Holdings.
I brought two other presenters with me today, specifically because they are a subject matter expert in pretty much CI and CD, all the other ones work a lot more on the DevSecOps piece. I work for a pretty large organization, Enterprise Holdings is flagships are Enterprise, Alamo, and National. A little over 2000 people in IT. Our team is a little bit different and I’ll get into that. Whereas last year, we were here, and we spoke about transformation of a large organization, and how to do that through grassroots. We really were just getting started, we had our CI/CD framework, as we like to call it, was all set up, ready to be adopted. But then, we started rolling out, we only had five applications out of 200+ that we support today, that actually adopted our framework, and it was more of a pipeline. But since then we’ve had pretty good success, over the past year. We’re up to about 70 that have adopted, or are looking to adopt, or are in the process. I’m trying to coin that as our CI/CD framework adoption.
So our team, like I said, is a little bit unique; none of us are developers, none of us work in infrastructure, we have a lot of different talents, came from different areas. And so everybody can kind of work more focused on the specific items that we run into. Right now, there’s about 4+ engineers working on the project, took about 2 architects to get it off the ground, started about two and a half years ago, and just wanted to keep you guys up to date on our progress, and the problems that we see, and how you have to adjust from what may be the norm.
We also do make sure we took account for all of the different middleware platforms that we use, various different ones at various levels of legacy, either they’d be on PRIM or using any of the cloud solutions that are out there.
So needless to say, that was our dev ops experience, that’s how it really looked like last year, when we left here, it was kind of, we felt better; we felt better, we’re doing the right thing, we’re moving forward. So we really felt like it was smooth sailing.
It’s going to be easy for us to go out and get people to adopt, but that’s not the case. We had to do a lot of marketing, there’s a lot of organizational issues that we had to go through, just getting acceptance from your leadership, to spend time doing this without being able to show a perfect ROI, it’s kind of hard for us to tell that story. That’s why our journey has been a little bit painful. But as we move forward this year, and we start really pulling back the covers on our tool chain and what we’re using to adopt like a dev ops process and culture, with the CI/CD framework that teams could easily adopt and move forward with their SDLC.
So without further ado, I’d like to bring up TJ Smreker, he’s going to give you more of a overview of the SDLC as we saw it, and how we were able to implement it, and more on the dev ops culture change. TJ?
Thank you Nathan. I’m TJ Smreker, I’m a Systems Engineer at Enterprise Holdings. So what I’ll start off with is software delivery life cycle, or SDLC. And I’ll caveat it with; this is kind of from our perspective; like Nathan said, we’re in kind of an interesting situation. We’re not developers, we don’t have any apps that we support. But we support application teams, in their applications. And we’re kind of between them and infrastructure.
Again, this is from our perspective, and how we kind of developed this process. But at a super high level, starting off, it was a developer developing locally, checking that source code in the repository. Building that application on a build server. Running any kind of legal security scans for vulnerabilities. Packaging up that application and publishing it to Jfrog Artifactory. Deploying that application, testing, in that environment, and once all testing suites have been ran, and in each environment, we deploy to production.
So pretty simple, pretty high level. But to pull it out a little bit further, there was a lot that we had to include to get that process going. So we created, developed a lot of different things that we had to incorporate in our SDLC, and as you can see here, there’s onboarding that’s also included. The adoption piece, and that’s something I’ll talk about here shortly, was a huge undertaking.
So just a closer look: DSL framework, that is actually the automation that we built around the onboarding process, or the adoption. We developed templates, those templates are [inaudible 00:04:57] libraries, all of that being stored in source control, along with our packaging scripts, our automation tool, and Ansible, all of the repositories that we’re using for that, stored up in source control, along with our PROD files, and scripts.
All that I’ll talk on a little bit more as I go through here. But just to give an idea, there’s a lot that’s involved. In trying to architect something like this. And that’s not limited to the tooling, either. So when we look at products, our department actually owns almost all of the products that are included in the SDLC. So we have a good perspective of where those things need to be placed, how to integrate them, so for example, with continuous integration, we have our automation tool in Jenkins. We have our vulnerability scanning tool, we have Artifactory, we’ve got our build tools with continuous delivery. We’re still using Jenkins, still have Artifactory involved, now Ansible is introduced, along with our middleware solutions, and our testing suites. All which we own in our department. So that was able to make things easier for us, but it was still, you had to identify where all those pieces fit in.
So, bring it out even further to show kind of what’s all involved. There’s a lot. You have to identify where those pieces fit in, how to integrate them, including the tools. At this point, as you can see, we got us a product, we have our process. We know where the tooling fits in, the integration points… we feel like we’ve created us a home-grown DevOps process.
At that point, we’re like, well I think we’re becoming DevOps! So what do we do with it now? We need some applications to be able to use it.
So we were like, well, DevOps are supposed to make your lives easier, right? It is, but getting there can be difficult, and if anybody ever tells you that going DevOps is easy, they’re lying to you. It’s hard. It’s difficult. There’s a lot involved, and I feel like from our perspective, it was especially difficult; like Nathan mentioned, we’re a large enterprise. We have a lot of IT employees, and a lot of applications, so we’re having to consider a whole bunch going into this.
So what we started off with was, we got to identify; what’s our common denominators, across the board? Whether it be applications, platforms… whatever. That’s where we started, and we’ve got 200+ applications like Nathan said. They’re running on multiple platforms, different middleware solutions. So if we have one application or a couple applications running on the same platform, more than likely, one isn’t the same to the next. So in a roundabout way, our applications are a bunch of snowflakes. And that’s difficult to maneuver through all that.
And while we’re trying to find out all those commonalities, we also want to make sure we’re standardizing best practices. Because we don’t want to repeat any of the mistakes of our past, if we have a bunch of applications doing one thing, that’s against best practices, that’s not something that we want to introduce into our new standards. So in this scenario, if you’ve got a bunch of people doing one thing, it doesn’t mean that that’s what you need to do; you need to identify what is that best practice?
While we’re looking at all that, we want to utilize what we currently have, if it’s viable. So, easy examples was Jenkins was already in our environment, we had app teams using it already, so they’re a little bit comfortable on that. That was something easy to implement. Along with our build methods, our deploy scripts, we felt those were easy pieces to reuse. We didn’t want to reinvent the wheel for no reason.
From that perspective, it made things a little bit easier, but we still had a lot to consider.
At this point, we’ve got something that works. We had that big process that we built out, so we had partnered with an app team, and we coin them as our champions. And we used them in a POC situation. But it was proof of concept, so what we had come up with was tailored to those applications. So once we got it then working, we were like, well? We want all the applications to be able to use this! But now we need to make them dynamic. We want them to be able to be reusable. It’s got to be easy for the app teams to adopt, because they’re busy developing, and we felt like it was our job to make it as seamless as possible for them to be able to adopt this process.
So what we ended up doing is we developed templates. They are written in Groovy. Because our automation tool is Jenkins, we develop Jenkins files, but we also develop them to be agnostic. So, if we did decide to change our automation tooling, that we’re able to pivot pretty easily going forward.
Like I said, we developed a couple of templates; four to be exact, so we’ll just dive right in, this is kind of the meat and potatoes of how we were able to drive out adoption.
And we’ll start with the build. The build is just a continuous integration piece. So we were able to identify seven events we felt were necessary for every application to be successful in this process; we turned those seven events into stages in the Jenkins files. And as you can see, we have little snippets of code up here, I know it might be a little hard to read.
But the idea is to show how they’re segmented throughout… as we go through the Jenkins file, each stage has its own section, so that if something did go wrong, it was easy to identify where it actually failed.
So moving through these templates, for the build, we would be cloning the repository obviously, we’d build that application on a shared build agent, that we use for all of our applications. We’re running our vulnerability scans, we’re packaging up that application, we’re publishing it to Artifactory with metadata so that we can use that later, for security purposes, or tracking, or what have you.
Along with, we’re going to tag the Repo, and then we’re going to clean up the workspace. So it’ll look like nothing was ever there. One thing to note is: for us to be able to keep these templates dynamic, we actually created an app-specific file that’s in JSON, and so each application has this file. And it stores the variables that are necessary for that template to be able to use as it goes through the pipeline.
So for each template, it actually is pulling what necessary variables it needs.
The next one being; deploy. So deploy is just continuous delivery. Now, and I’ll talk a little bit about this in a second, we’re not full CD here yet. Because of a couple of reasons, but, it doesn’t look like there’s a lot going on here. Especially if you looked at the continuous integration part. But we’re using Ansible as our configuration management tool, which is actually doing a lot of the dirty work, heavy lifting behind the scenes, which was able to make things a lot cleaner, because it’s like I said, we have so many different platforms, and middleware solutions, that we have to be able to tailor everything to those different solutions.
Here’s an example of, if we onboard a team, what they would be getting. It’s a simple deploy to an environment, run a couple of checks to make sure the application comes up, and, actually, whenever we’re deploying we’re actually writing back to Artifactory with metadata on what environment its deployed to, when it was, if it’s still deployed. But then like I said, we’re running checks to make sure that the application’s up.
The most important part of this is that last one: no tests assigned. So, pretty obvious, there aren’t any tests assigned to this application. But that was something that we saw with a lot of our applications, is they didn’t have automated testing. So we tried to give them this easy path, so that you can develop your automated testing, and you just have to plug it right in to your pipeline.
So obviously nothing’s happening here, but it’s a dynamic stage. So here’s an example of a team actually taking advantage of that. They’re actually running three tests, after the application is deployed. And you’ll see up here, with the code, that’s actually coming from that app’s specific JSON file. So again, it’s specific to that application so it’s stored in that file. So whenever we run this template, it’s actually going back into that JSON file, and saying, “Oh, you’ve got tests here? Let’s run them! Oh you got another one? Let’s run that one!” And it loops through those.
Couple things to call out for us. Our build and deploy is being run on one Jenkins instance, and we’re doing our testing on another Jenkins instance. So we have separation of duty there. Make it a little bit more secure.
One thing you might notice and I’ll call out, is like I said earlier, this isn’t very CD, we’re only deploying to one app, or one environment. We did this for a couple of reasons: one, comfort. We had to get buy-in from teams. So we want to make that easy path for them, we had teams still deploying manually. Or, we still had teams, out there using Jenkins but in freestyle form.
So there were some teams that had that familiarity with it, but we wanted to make it as easy as possible for them.
Two, teams just aren’t ready. Like the previous example, of not having any tests available, teams just don’t have those tests, so we can’t go to a full CI/CD model, if you don’t have the appropriate testing suites.
Three, and the third reason is flexibility. We would probably have developed this anyway, just for those ad-hoc deployments to an environment, whenever it’s needed.
But that leads us into the next one, the full pipeline. So this is exactly what you would expect out of a CI/CD model. Except in our environment, we’re excluding production. So this is another separation of duty, or segmentation, if you will. We were actually deploying to production on a different Jenkins instance from the rest, so that non-prod isn’t able to affect production.
But from a non-prod perspective, it’s full CI/CD; we’re actually utilizing those templates that I just mentioned before, the build and deploy, we’re doing the exact same things. The only thing that’s different is this approval stage. So that’s actually a pause in the pipeline. And we put that in place, again, for comfort, little bit of a safety guard, for app teams and for their managers to feel comfortable moving to this because if they don’t have the appropriate testing in place, and they run this, it’s just going to jump to the next environment, and they’ll be like, what’s going on here?
So we put this in place, but these Jenkins files are super flexible; things can just be removed. And whenever they feel like they have the appropriate testing included in their pipeline, they can just remove that stage and then they can get to that full CI/CD model.
The last thing is prod deploy. Prod deploy is where we feel like, app teams are getting the most benefit up front, from this. And again, we’re segmented, only deploying from one Jenkins instance to production. For some applications it’s super simple deployment. Deploying to one instance, running a check, calling it a day.
But, in our big enterprise, that’s not the case for most of them. Most of our applications are in app families, so they have multiple applications they’re deploying together, and to multiple clusters, so that the applications can stay resilient.
In this model, traditionally, we’re having a production operations team do the deployments; they’re opening up multiple PuTTY sessions to run, scripts to kick off these deployments, we’re also deploying one application after another; so deployments are long. They take hours.
In this model, it’s hard to troubleshoot if something goes wrong. It’s hard to collaborate as well if something goes wrong, you can’t even look to see what actually happened.
For people that need status updates; don’t have the appropriate people running the jobs to get the status updates, so you’re relaying a message. And then, like I said, we’re going one after another with these applications, which is making these deployments super long. So, Jenkins was able to help and address a lot of these issues, because if something does go wrong, you’re able to identify exactly what stage it failed in, and you can troubleshoot and collaborate in real-time, because everybody can see this. Everybody’s got access to our prod Jenkins, we give read access to everyone. So that anybody can go out there; you can do retrospectives if something does go wrong.
Along with like I said, addressing any kind of situation that you run into, and status updates aren’t really needed anymore. But we were still seeing a little bit of an issue that I mentioned earlier, we’re not seeing that time benefit, we’re still deploying one application after another. So they were still long deployments.
So what we did, as we partnered again with our champions team; they had five applications at the time, this actually took place right before Nathan came out here last year, to give his talk. And like I said, five applications, and what we developed was what we call an “orchestration job”. So that orchestration job controls all five of those applications, so you can deploy any combination of those applications through one job, so it made it easier for whoever was running it, you’re not having to go to five different jobs to kick off builds, it was easier for everyone to identify exactly where we were in the process.
And it ended up being a huge time-saver; because what we did is we utilized parallel staging, we identified the best possible way to use our system resources, so that we could deploy applications. Because a lot of our applications are WebLogic, so you can only deploy one application on a server at a time, mostly. So we ended up using that parallel stages, identified which applications can be deployed when, and also were able to include a lot of the manual steps that were included in this long, orchestration plan.
So by doing that, a normal five application deployment for this team was taking about four hours. And they were deploying monthly. And on our first attempt with them, we were successful, and we were able to cut it down to an hour and a half. On our first try.
So that was a huge win for us, huge win for them. They’ve had so much success with this, they’ve actually taken it, full ownership of the process; and they’ve grown from there. They’ve taken what we had, and they’ve expanded on it. And I think the last time I checked, they were round about an hour on their deployments now. And they’re actually adding more applications to it, so. That was a huge win for them, and again, huge wins for us.
Last thing I’ll call out is: we were like, what happens if we get a half hour into this deployment, and one of these parallel stages fails? The entire pipeline’s going to fail! We were losing about a half hour of time. So we actually developed a dynamic stage, that’s checking the status’ of those deployments in that parallel stage, and we’re saying, “If it fails, don’t actually fail the job. We’re going to put it in Unstable, and we’re going to pause the pipeline.” That way, we can get the necessary parties involved, to say, “Okay, we can either address this situation, and do whatever necessary actions are needed to take place, and if we are able to address those then we can just continue with the pipeline, so we’re not losing any time.”
Again, that was another big one for us as well.
So a couple more short things. DSL and automation. This was for our benefit; like I said earlier, we had 200+ applications that we were hoping is going to adopt this. There’s a lot that’s involved in trying to get an application to take on this. And we’d actually built out an entire process around onboarding, so it was more streamlined. And then we also utilized the Jenkins Job DSL Plugin, so that we could automate… create those Jenkins Jobs, creating that app-specific file, getting those templates in that file under their Repos. So something that was taking us a couple hours to do, was now just taking minutes.
That was really helpful for us, to be able to help these teams adopt this process.
The last thing is just a standard process across the board. That was the whole idea of all of this, was, we wanted to take this picture, and basically lay it over the top of everybody that we supported. That way, for us, it was easier to support, for them it’s easier to scale, it just helps everybody out, and it allows us to try to strive towards that DevSecOps solution.
The last thing I’ll call out is, the team I worked on, we documented the hell out of this thing. And we needed to, because there was so much involved. Now we knew we couldn’t document everything, because we’re dealing with so many snowflakes, and there’s so many different custom solutions. But, we wanted to get as much information to our customers, which were our app teams, out there, so that the process was as easy as possible for them. Like how to get started, any pre-requisites they needed, before they got started so that they weren’t spinning their wheels, or wasting their time, we were able to get them going right away.
And then along with documentation for just admin, we have, I don’t know, 40 people in our department, and we only had three people who knew this entire process. So just documenting it as much as we could, and we spent several months documenting because there was so much involved.
And with that being said, it sounds pretty easy, right? Not a whole lot to it? What I like to say, is DevOps is just sunshine and rainbows. So now Jim’s going to come up here, and he’s actually going to tell you about how security works in our solution.
Thanks TJ. I’m Jim Lesser. I’m also a systems engineer at Enterprise. And I’m going to talk to you guys about DevSecOps and how security is a very important part of our organization. And how hard it actually is to implement.
So, starting off, a little bit of security history, and where we lacked it. We started off with using common users for running all of our processes, and those common users had common log-in information. The problem here was that, the teams that actually own the applications that these processors were running on, didn’t even own those users. We lacked testing in our security enhancements. We unfortunately were really in one of those situations where, whenever we were making those enhancements, it was a test in production situation.
And we also had a very terrible, antiquated FOSS intake process, or free and open-source software. It would easily take, well over a month for us to bring new components into our environment, and that had no correlation at all to being agile. And this also increased the time that it took to mitigate any security vulnerabilities that did pop up.
So that being said, when we started looking at security, and where we really needed to incorporate it, we realized that it needed to be incorporate into the entire SDLC. We need to make sure that we’re focusing on not only development and our builds, but even QA, and especially production environments. The nice thing about building it in to the entire SDLC, is that we can build approvals into each step. Nowadays things move so fast that we really can’t afford to wait.
Our first place that we were focusing is on the account management, and how it relates to segmentation of duty. This was something that our information security team didn’t actually require at first, but really once we got into architecting the solutions, we realized that we needed it in order to be successful. A lot of the teams that we piloted this with had some push-back at first, because they didn’t want the added responsibility of having another account that they had to be responsible for. But ultimately, they realized that once they started using it, it made their lives work a lot smoother, because they didn’t have that reliance on the support teams like they used to.
Testing is been very important to us, moving forward too. We want to make sure that anytime we are thinking of implementing any new security best practices, that we aren’t breaking current solutions when we’re implementing the new stuff.
Now the big place that we’ve been focusing lately is the vulnerability scanning. This is a very important thing to be thinking about in development, because we want to make sure that we find all those vulnerabilities before we actually introduce them into our environment. And we also want to make sure that we’re focusing on getting those into our QA and production processes. Because really, we know that new vulnerabilities pop up every day. So we need to make sure that we’re being proactive in searching out for those new offenses. And while we’re doing that, we’ve built out the appropriate SLAs to remediate all those vulnerabilities, based on its severity.
So jumping in: what does that actually look like in our environment? We are allowing our developers to actually use their desktop kind of like a sandbox. They’re able to pull in whatever components that they want to, new products, new versions… with the caveat that they do a scan on their desktop to ensure that they aren’t introducing those vulnerabilities, before they push their code changes to the source code repository.
We know we don’t live in a perfect world though. And whether it be malicious or not, those scans don’t always get done. So we wanted to make sure that we built that into our CI/CD pipelines also. As TJ said earlier though, “Everyone’s a snowflake.” So how do we incorporate that in?
No matter what architecture, even if teams are using the same architectures as each other, we know their build processes are never going to be identical, there’s always going to be little changes here and there.
So we wanted to make sure that we came up with a solution that allowed our developers to actually use the same exact build processes that they were using. We’re just having them add in an additional stage into their pipelines. Doing it easy like this makes it a lot easier to get that adoption, and it has a lot less disruption with the current processes.
Now the nice thing about the way that we’re doing our vulnerability scanning in our pipelines is that we actually built in a composition check as we’re calling it. So that way, we know that vulnerability scanning can actually take a good amount of time. So it actually checks to see if there’s been any new components added, or version changes, before it actually scans, so that way it’s not wasting the time if it’s not necessary.
Now building it into our pipelines like that, it also allows us to integrate with a ticket-tracking system. This is not only going to add visibility and ownership to the teams who are making these changes, and potentially introducing vulnerabilities, but it also allows our information security teams to have awareness on what’s going on within our entire environment.
So that being said, everything’s not always going to be perfect the first iteration. We’re always going to be finding new things that we didn’t think of before; those surprises. And even with all the planning that we do, there’s always going to be pain points, there’s always going to be unknowns. So one of the things that we’ve really focused on is making sure that we just start somewhere; get it out in the environment, because we know that any security that we can implement at any time is going to be better than having no security at all.
It’s not always about not knowing what the final implementation’s going to look like, but just making sure that you really understand the tools that you’re implementing.
So, what has made it so hard for us to implement?
It’s about getting that buy-in; this is one of the most important points. Because you’re really only as secure as your weakest link. And getting that buy-in, how have we been getting it? We’re partnering with our teams. First adopters are key to being successful. As TJ was talking about the orchestration jobs before; if we had been working in a black hole and been doing that on our own, we would not have been able to be successful. The teams that we work with, they’re ultimately the experts on their exact build processes. So by working with them, it allows us to be more successful.
We also know that everyone’s time is precious. We want to try and entice them with something. And that helps out a lot, so looking back at the composition check; building the vulnerability scan into their jobs, we’ve basically… by our teams adopting this process, they’re able to go from waiting over a month, as I said earlier, to get new components into their builds from getting it instantaneous, so, they get the win there.
And finally, it’s just really making sure that we show our teams exactly how they can be successful and show them what success looks like in order to get things going.
We also are making sure that we’re staying in constant communication with all of our teams. We actually use road shows, we go out and we actually make sure that the teams that we support actually know what all of our current offerings are, and any ideas that we have coming down the line.
We use this to sell it to them, show them how it benefits them, because ultimately, everybody has to be onboard, and willing to adopt these new best practices, in order for us to be successful.
So, kind of wrapping things up then: what does security look like with the full CI/CD adoption? It’s making sure that security is a part of everybody’s job. It’s not just Development, it’s not just Security; but it’s Operations, it’s Testing, it’s everything. This is really what full DevSecOps adoption needs to look like in an organization. Because we really want to make sure that all of our processes are secure by design.
Just to let anybody else… we’ve told our story last year, we’re trying to tell our story now, just so everybody knows; our goals for 2019, we’re really focusing on trying to integrate within our infrastructure teams for true full stack integration. This is including server environment network provisioning, and really this is all due to the increased velocity that we have with infrastructure changes within our SDLC.
That’s all I have. Thank you guys for your time, we appreciate it, and we’ll be available for questions later. Thank you..