DevOps Automated Governance

John Willis

This presentation is intended to guide organizations on implementing an automated process for tracking governance throughout the deployment pipeline, by providing a reference architecture. A sample use case is also provided to further enforce these best practices. Ultimately, a DevOps automated governance process can give organizations the assurance that the delivery of their software and services are trusted.

VIDEO TRANSCRIPT

Hello, It’s John Willis. So we’re gonna cover a topic called automated governance and it will be an overview. So I work for Red Hat. About six months ago I joined up a team called a Global Transformation Office with Andrew Clay Shafer there, he’s the guy on the left, and then Kevin Behr, and then that’s me so we’re going towards the right short guy, and then Jay Bloom. We’re trying to figure out the next ten years in terms of what sort of transformation should look like. We’ve written a number of books, I’ve written a DevOps handbook, “Beyond the Phoenix Project.” Kevin was the co-author of the Phoenix Project. Andrew was one of the authors of Web Operations and did some work on site reliability… so, as Andrew likes to say, we wrote some books. So that’s enough of my intro. So I wanted to talk about this idea of automated governance and how I…

how I got involved in this and it’s really a number of things but I think I’ll start off with the place where I was able to at least galvanize this idea into a paper and then I’ve done some further papers here but… Gene Kim, part of IT revolution, the author of the Phoenix Project, invites about 40 of us to Portland every year and we’ve been doing this for seven years and we work on these papers that become forum papers, little e-Books, and I’ve been doing it, I think, 2014 was the first year so I’ve done it every year. Overall I counted, including this year, we just finished up about four or five books that will be coming out this summer, I think it will be about 30 books over seven years, and so I’ve been involved. Some of them I’ve shared the projects that I’ve worked on, many years I was just a floater on many of the books. So, some years I like to go back and forth between the different projects.
But the reason I’m pointing this out is that starting back in 2015, there was a working group that I worked with indirectly on this eBook and all of this is on Creative Commons, you can download any of these from… from IT Revolution forum papers. It was called “An Unlikely Union: DevOps and Audit” so that was at least from the sorta DevOps IT revolution perspective. The first conversation about how audit and what we do with DevOps should have a tighter conversation. Segregation of duties, those type of things to sort-of prove out and we talk about this in the DevOps handbook too. We talked about how certain ways of patterns, if you could use DevOps to deliver software, could actually apply to some of the compliance requirements PS ideas as others. And then in 2018, this was a great paper, I didn’t write this but I was sort-of an advisor and was working around it. It was called “Dear Auditor” and it was great.
It was really an apology letter to auditors. Like “Hey, we’re really sorry, you know, we we should have done this, we should have talked to you about this” and it’s not only the apology letter but then it has a whole checklist of things that we promised to do and there’s actually… if you Google “Dear Auditors” actually a Github project on this and this was really good, and the thread was about audits and DevOps.
A couple years ago, I started getting this idea how we can do a better job from the pipeline perspective of audit, and… so last year in 2019, actually and usually these forums start in about April and we usually wind up publishing the work sometime around the fourth quarter or the end of the third quarter or fourth quarter. So I got a team together to really focus in on what we call DevOps automated governance and its a reference architecture and I’ll explain that in a little more detail. So, a couple of things the prior year prior two years that got me really interested in this, Capital One has been heavily involved in these forum papers for years and over the years we’ve had these discussions about pipelines, gating in the pipeline, and in 2017, Capital One wrote a really interesting paper on how they did, you know, focusing on DevOps pipelines and there was a subsection in there called “Creating Better Pipelines” and what they talked about is this idea of these gates, they call them 16 gates. And these were things that allowed service teams to sort-of bypass centralized authority like CABs. In other words, they can get a sort-of auto-deploy allowance if they could evidence these things like that it indeed came from source control, it had optimum branching strategies, static analysis, all these things.
I think I saw a presentation last year, they said they are up to like 30 now. But in conversations about this, it’s great to have these gates. But, since you have these gates anyway, couldn’t we turn these gates into evidence? So over a year or so, we kept having this conversation and that sorta started me thinking about this and then around that same time as that article, Google announced an open source project called Grafeas and it turns out this was a project that they had actually been using internal for auditing governance and it had a lot of features, and I think a lot of features that haven’t really been utilized but one in particular was an attestation, metadata attestation source, attestation being evidence. I started thinking, and actually Kit Merker, a good friend and a Frog, worked for Frogs, had approached me about why aren’t you thinking about using Grafeas and we had a great conversation. So that all led me up to sort-of last year, trying to get a group of people together, saying “Hey, could we actually create a reference architecture around this automated governance idea?”
you know, these terms are very overloaded but we had a specific charter that we wanted to accomplish in this particular. So we actually published, this is out, it’s Creative Commons, it was, as you saw earlier, it was published last year in September and actually it was Mike Nygard from Saber. You might know Mike Nygard from “Release It,” the inventor of Circuit Breaker pattern, Tapabrata Pal over at Capital One, Steve Magill, Sam Guckenheimer who has been heavily involved in most of Microsoft’s infrastructure, myself, John Rzeszotarski of PNC, Dwayne Holmes who run large Kubernetes infrastructure at Marriott, and Courtney Kissler over at Nike. So we all got together for a couple days and we sort-of tried to hash out, could we actually put this to a paper in terms of reference architecture, like could the end model be a microservice Java, in a container that gets a go, no-go into Kubernetes, using Grafeas and actually kritis which goes along with that. And so one of my goals coming into this was changing the sort of language of how evidence is created for auditors. You know typically, auditors come into a site and they spend somewhere in the neighborhood, 30 days working with the organization, looking at changes, and what they really do is the evidence is actually subjective. So evidence and attestation, as far as this conversation, means the same thing. So, the idea of changing subjective evidence, so subjective attestations into objective evidence or objective attestations.
So currently people create change records in most large enterprises. And then it’s a human discussion about like Sue is going to do these things, Bob will read these things, maybe Sally wants Sue to add a couple more lines. And these are all complex systems, so it is… It’s usually a human telephone game trying to scribe the complexity of change. And then in the auditor comes and sees this record that is this discussion of union, the subjective discussion as the evidence and then tries to make sense of this, and it’s a lot of toil, it’s a lot of disconnectedness, it’s… and it just doesn’t have high efficacy. And so, could we actually change that? To actually making this evidence as built in automation, non-human intervention, automated and built into the pipeline itself in a digitally signed mechanism. So it’s a set of signatures that basically become one immutable link list of signatures. And so, when we sat down and we thought about writing this thing, it was really threefold the objectives. One was, short audit time, could we turn 30 day audits into half day audits?
The idea that, instead of the subjective discussions and comparing screen prints, could we actually just show an immutable list, we don’t really wanna call it blockchain, but it’s sort-of based on a block chain model, could you just show sort-of an immutable list of evidence of a change that no human had… actually, there was no human interaction… and so literally, you just look at these immutable list and next, next, next. The second was, could we increase the efficacy? I mean, the truth in the matter is, I spend a lot of time with CILs and then interviewing a lot of people in an organization over the last three or four years. And most people, most of the audits they have I call it Security and Compliance Theater, and that’s not even included when you get into modernization of cloud native and microservices, even way worse then. So the risk profiles that they they sort-of think they need attestations in this sorta rapid deployment structure it’s just completely disconnected. So you find this in most organizations, the efficacy of an audit is extremely low.
So could we increase the efficacy of an audit from like 20 percent to high 90 percent and then last but not least, if we could do this, we could make a malleable argument of moving away from a change of advisory board or CAB or centralized authority, you know, if you think about the original Capital One article. And so what we did is we sat down and, I won’t go into all this in gory details because it’s in the reference architecture. I just wanna expose you to the ideas, if you’re interested, you can download the Creative Commons copy of it, it’s on ITrevolution.com. And so we broke this down into seven stages, and the idea was to not really focus on how people perceive the pipeline but to create boundaries for attestation. Remember every time I say attestations, I mean evidence. So, what were the logical blockings or boundaries for attestations, so we came up with a development build, packaged at its own stage, non-prod/prod, and the reason we put dependency and artifact, it’s in the sort of life cycle of the the traditional CI/CD path but the dependency managing artifact have their own life cycles, sorta asynchronous to that. So we wanted to make sure that that was understood.
And then we concurred what we called common controls and common actors and the controls were the, basically, attestations. So if you remember what I said earlier, it was Nike, Capital One, PNC, Marriott, and Saber Group. When it was all set and done we had about 75 attestations. Now I don’t think one company or one organization or one service would use all of these, but it was a reference artifact to show what could be accomplished. So you could look, now again I’m not gonna go into it, but if you take the source code stage, things like appearing on a poll request or unit test coverage, clean dependency, scanning, if you get to the build stage, unit testing, linting, immutability from the input/output perspective, and again I’m going fast on purpose because if you’re interested all this is in, every one of these is actually spelled out, there’s like a two page, three page on every one of the control points in the aggregate, dependency management, license checking, approved external sources, security check, aging, so not allowing stale artifacts, approved versions, package, things like notary, or signing of a signature, or if you’re going to apply metadata from like a zookeeper or something at operation time to make sure that the metadata can’t be hacked, or it can’t man in the middle, so again a lot more here, like I said 75 artifact stage, retention period, immutable artifacts, product stage, and then if you look at prod or prod or non-prod, a couple of subtle differences but the allowing configuration.
So, when you’re dealing with Kubernetes and containers, there’s a lot of opportunities for adversaries to or, creating adversaries by misconfiguration definitions. So making sure you’re capturing those type of things as artifacts for evidence and then what we try to do is then go through and identify as an example, not a recommendation, just an example of where would be the control points, where would they come from, things like SonarQube or Checkmarx, and then of course JFrog Xray, but again none of these were really recommendations, it was really just a…
it was a sense-making exercise for us to say, “Does this make sense?” and then, “Could we go through a quick list of where all these attestations might come from?” And as the project went, we didn’t get through all of the things we wanted to accomplish. We finally end up with sorta reference architecture for a backbone architecture with Grafeas and Kubernetes and we wanted to do this simple _ which is in the admin controller, it was really very simple in notation. But what got really interesting is when we were having these conversations, when we were writing the paper, you know at night we go to dinner and you know all the people working on the paper would sorta hang out, and then we started thinking about like, if you could create this DevOps automated governance architecture, then could we start thinking about templates, so like, advisable templates for these things.
And if you could do that, could you actually create human readable code to apply those templates to? And so it was really sorta just a dinner conversation, but one of the banks went out and went full into this. They put a bunch of resources into it, what we’re calling now Policy as Code. So we’re actually going to do this this summer, start a second version of this document where we’re gonna really focus in on policy and I’m going to show you some of this stuff that as…
me as an advisor, but one of the banks had really taken this to a whole new level and sorta post the reference architecture. And so, here’s some of the sorta principles for governance, human readable, platform agnostic, durable, and again I’ll let you read this with the slides, condition parameters. But here’s the thing, so what this one company did is went in and created this what they call, pack files, policy is code files. And here’s the interesting thing, these are human readable files that the policy people now are engaged in. So the actual policy people, so what happens now, service owners that wanna go through this sorta newly defined way of processing stuff, and there’s a lot of advantages, so people want to do this, they have to go to the policy people and this was built by design with the policy people and then they will actually come up with a human readable pack file definition that will be associated with that service, and then so the things like the thing we talked with attestations will actually be defined. So if you look at pipeline versioning, that every application or service has to have a mnemonic for the actual component, has to have the, I’m sorry, the mnemonic for the service, a component ID, and a version.
And then you can look like later down as unit test coverage, and then you have a pull request review, so these things would actually get written in collaboration with the service owner, with the actual policy people, and by the way, now that policy people come to design the requirements, and these become immutable because they get actually stored in source control along with the artifacts so that they can deliver the service. So there’s not, and you have the DevOps automated governance architecture, I’ll show you how it’s been advanced here in a minute. So what happens is every time you do sort of a merge, you’re gonna pull in all the evidence of the commit, you’re gonna pull in the at-that-time pack file with all the other artifacts, and that’s gonna be the immutable evidence that will end up in this case, in Grafeas and in attestation store.
So now you’ve got the best of both worlds. You don’t have policy people trying to give spreadsheets to infrastructure people, service people, and service people having to interpret those spreadsheets into things that will wind up being maybe gated, very little of it evidenced. Now you have everything, you have the gating and the evidence all built in one. And this is where it gets really cool, I’ll show you the architectures one bank is using. So they’re actually using now OPA and Rego and so if you remember in that pack file, there was a versioning pipeline right, so that all services had to have mnemonic component inversion. So basically what you would think now is sort of interface definitions would be the pack files would be the collaboration between the the collaboration between the risk people and service owners and then the implementation could be something like Rego. So here’s an example of actually using the pack file and Rego to control in Kubernetes whether something is allowed from a policy perspective.
This gets really, really cool. I mean, we’re even talking about in this next version of, if you have all this, you could do policy aero-budgeting, right? So here’s sort of a sample architecture, it’s a little more complicated than this but you basically put Kafka on both ends. Actually right now, we’re working with trying to figure out if we could put in, I say we in an advisor mode but, if we could put in a… a serverless architecture in between so when you get the sum level of volume then like Kafka, serverless implementation, Knative, and then Kafka, and then into this enforcer evidence engine and the enforcer can integrate with OPA. So it’s just really cool. Hopefully in the next reference architecture definition we’ll get the open source for some of this stuff but either way we will have a pretty robust reference architecture of how this works. Another thing that’s really cool about this is one of the problems you have in most enterprises today is that, in my experience, very few enterprises…
one of the problems, let me step back here, one of the problems that you have with sort of audit in the enterprise, is it’s based on, most of the audits are based on a traditional service management model, where every change has to be associated with a service owner and service owners traditionally are associated with a CNDBCI configuration item.
I can’t tell you that I visited in the last, you know, three or four years, any large institution that told me honestly that CNDB was more accurate than 25 percent… So if the whole model of your of your evidence in an audit is based on this sort of false idea, and not based on what’s happening in like Git, you know, Github, across to sort of a Jenkins, or a build model… Like, the beauty of this model, is that in order to play, you have to actually define so you really start creating an emergent configuration management database because you have to find a mnemonic component inversion and by definition all services that are sort of building and gaining more value in this process are actually becoming the emergency NDB. Not only that, you have this continuous audit replay you can look at because remember I said earlier, these artifacts are immutable, and the fact that you know exactly what version of a pack file got pushed, and in fact actually the Rego files as well got pushed at that time so you can always go back and sort of replay.
And the other thing, remember the thing we talked about aero-budgeting, so now you can actually sort of analyze the continuous evidence or continuous compliance if you want or what we’ve been calling automated governance, are the things that fail, right? And so, one more subject I wanna talk about is, that paper came out, I’ve been talking about it for about a year or so and then there’s a group out in New York called the Open Network User Group run by a guy named Nick Lippis and their board members are like the largest banks in in New York and really in the world. One of the focus of the year has been software-defined networking, SD-LAN, they’ve been moving more into DevOps and some of the board members had saw this paper that we did, and Nick Lippis knew me so he reached out to me and asked me if I would want to help drive a cloud automated governance based on that paper that we did. So we got together and this time we created, and really this is really cool, because people that were involved in this one was Fedex, Kaiser Permanente, Cigna, and JP Morgan Chase, and and I think…
yes, JP Morgan Chase, and then indirectly Don Duet, was the VP of Engineering for Goldman, he’s independent now and we focused on the relationship of attestations from the cloud providers to the tenant. Now we’re going through this really quick. Again, this is a you know, Creative Commons book, a paper that’s available from ONUG. You know, it’s pretty easy to find, you can download. You do have to fill out some some names and stuff like that, but it’s a free book. We actually, this paper it got a lot of, it was about a month ago now, a little less than a month ago, the Wall Street Journal wrote an article about it where they interviewed the, it was actually sponsored by Fedex, Cigna, and Kaiser Permanente, so the three CSOs who sponsored this were interviewed by the Wall Street Journal talking about the work we did in this paper. So really, really significant work, I’ll try to summarize it again, I’m not going to go into ghory detail, like I said, I’ll leave you, the listener, the reader, if you’re interested, obviously my contact information will be all over this presentation if you wanna talk about it.
If anybody knows me, I love talking about this stuff or exploring but… our goal in this paper was, we had three goals and they were really focused on the cloud providers showing evidence backed to the tenants or the consumers of those clouds. So one of the things we wanted to make sure was, could we ask the community, one thing I wanna be clear too is, initially when we sat down, everybody said “You’re crazy if you can think of all three cloud providers or all five cloud providers or however many there are to actually take your advice.” Even though we had about 25 billion in in spend on that team but, I said from the get-go, “Let’s not worry about like, trying to convince Amazon or Google or or Microsoft… like that shouldn’t be the focus of our paper. What our focus should be is, convince another hundred other companies that would represent a trillion dollars in asset buying power” and if they all agree with this paper, then maybe we’ll, by the way, we’re doing a second version in this summer and already two of the top three cloud providers are in, so mission accomplished.
So here are the three things. One is, could we get in a unified format? So that means, each cloud provider agrees to create a unified… a normalized version of some type of signature that tells you that nothing’s changed from the way they did sort of _. So it’s a signature that tells you that you know the known state, they don’t have to tell any of the IP how they do it, any secret sauce just keep it real simple, I know it’s not this simple, but imagine a check sum event that told you that the posture of anything you’re looking at hasn’t changed. So its scale a change to that boot sequence, or boot infrastructure could actually create new opportunities for adversaries or for example, a sort of the first principle of all incident review is, go back to the last change.
So imagine if we could keep, normalize across all the providers an event that tells us every time we do a new deploy or once a day, that nothing has changed in that sequence. Or something goes wrong, we can check to see that sequence. So again, a lot more detail. The second point is, since all cloud providers see all Ingress requests from a tenant, a consumer, all the API calls, could they actually expose that back to the tenant in some normalized format? And the reason why you’d want this is, today a lot of consumers of cloud will scrape logs, they do this so they you know scrape different logs, different logs for different providers, there’s no normalized way even in enlarged structures, but let alone having to scrape logs and do things sort of all things like that which is brittle, you know, the logs change.
Could we just say, since you’re seeing everything, couldn’t you just spit that back in an event gateway, something that we could process through a sort of Knative or some type of event gateway process and that we could get firsthand look at what you see. And so for things like service request forgery, or some run away activities, or somebody that’s sort-of not following the rules and not put the right metadata, it would be much easier to have a single point of control to identify any sort of anomalous behavior that actually could get us in trouble. And last but not least is, look for a normalized structure from a security framework. All the different products have security frameworks, they speak different languages and they don’t really talk to security professionals so again, a little more of that in the paper.
This is the model for normalization. Just winding down here, the last thing is one of the things we tried to do is create pseudo-code based on Gherkin it was based on YAML so I wanted to try to explore a different model, seeing if we could try to do something with Gherkin. In here I’ve got a couple of examples based on those two models. One is the boot integrity, so this is basically just a Gherkin model of checking to see if if the, sort of that intent or checksums boot sequence hash has changed and then this is one where if we’re receiving all the Ingress traffic to a provider and we were to listen in on it, we could look for metadata that should have been there then if it’s not, then we could go looking for sorta nefarious actors like a crypto miner and things like that.
And I just wanted to end up with, as always, one of the, I talked about Kit Merker, I talked about JFrog, I really do love the JFrog family, they invite me to speak, I guess that’s why I like them, I guess they like me, but this book a lot of this book the Liquid Software which talked about a lot of the principles in automated governance in this sorta creating trust in the component pipeline. That was another “I think I’m doing the right thing” a couple of years ago, when I read this book and it talked about Grafeas and it just helped me to say that more than one person is telling me that this is needed, it just everything fell into place. It’s just a fabulous book. I always try to end these presentations saying that it’s a quick read, it really sets the mindset perfectly. Anyway, thank you so much for listening, I hope you enjoyed and please, if anything of this interests you, I’m pretty easy to find, reach out to me. I love to have discussions about this stuff.

Try JFrog for Free!