Disaster Recovery and You

Valarie Regas
DevOps Engineer

Does your team have a disaster recovery plan in place? What even is a DR plan?!

Bad things happen in this world: tornadoes take down data centers, laptops break, office buildings get damaged.

Join me for a practical guide to anticipating and mitigating disaster.

Your boss will thank you!

Video-Transkript

Hello SwampUP. My name is Valerie Regas and I’m here today to talk to you about disaster recovery. This will not be an exhaustive talk. Obviously, it’s a 25 minute session, but it should give you a pretty good jumping off point to get started and just some things to think about when you’re writing your disaster recovery plan. I’m going start out by telling you a little bit about me. I am married to my best friend, Michael and we have three awesome kids. And he has been in software for forever, he encouraged me to go to a coding boot camp when I wanted to switch careers. And I’ve been… You know, since that boot camp, I did an internship in DevOps, and just recently started at SalesForce in one of their DevOps teams.

 It’s a lot of fun. A lot of fun. I have the coolest sisters in the whole world. And I’ve done Judo for 20 something years now. And it’s my love. So I make a lot of judo references, just let me through it… Fun fact about me that bottom picture with the chick in the fancy dress, that’s my sister, and that’s me officiating her wedding.

 I’m an ordained minister, if anyone would like a secular wedding performed, I’m your girl. So enough about me.

 Let’s talk about what is disaster recovery planning? Like, you hear this thrown around a lot. At least I did and I wasn’t necessarily 100% sure on like, what really needed to be covered, What is this supposed to be? So let’s get into it. It’s all about anticipating the worst things that can happen, and mitigating them ahead of time.

 Now, this subject really speaks to my soul. In fact, when I was assigned to create a DR Plan at my last company, several people basically said, Valarie, who did you irritate to get stuck with this? And I was like, man, I have complex Post Traumatic Stress Disorder, thinking about stuff that can go wrong is my jam. So because I live my life constantly thinking about what could happen, and how can I, you know, preemptively fix it, this is great. I love this, this is just how my brain works. So that was exciting when I realized, all it is is thinking about what could go wrong and what do we have to have to get our product to the end user, like what cannot go wrong for this process to be, you know, smooth and 100% uptime, which, you know, that’s the goal.

 Staying prepared. Cool. So when it comes to your DR plan, you know, we’re going to talk about what goes in it but keep in mind, a plan that no one sees, might as well not exist. So it needs to be widely available, and depending on your company size, your team size.

 Are you writing this plan just for one little small team in a huge organization? Or are you a startup, and this is for everybody? That’ll sort of determine how you make it available to everyone. And once it is available to everyone, you want to have frequent drills, and we’re going to talk a lot more about this, like what this looks like, with different scenarios and we’ll get into that but bear in mind, an untested plan might as well not exist. And again, I’m going to say this a lot of times, the whole point of this is to think about how to fix things before they’re broken. And where do we start?

 Where do we start? Because I mean, I was a little bit overwhelmed. I was a little bit overwhelmed by this whole process. So we’re going to talk about your product, IT issues, server issues, these things, because I’m guessing for most people, if you’re working on a DR plan, it’s mostly going to be about getting your product to the end user.

 However, when I wrote my plan, I was, you know, I wasn’t at a startup, but I was at a subsidiary of a very large company in such a way that we got to operate like a startup. And there were just several more…

 I don’t know just things outside of the software that I wrote into my plan, which might be pertinent to you so we are going to talk about briefly some other things that if you don’t have an office manager who’s responsible for the DR Plan for the more, you know, office related things, you can be evaluated, stand out on your team, stand out to your boss. And, you know, just think about those things too.

 So people first. Anyone who knows me in real life, people first. Who’s responsible?

 This is a really important section because even if you have this amazing plan that’s thoroughly tested and rehearsed, and everyone knows it, and everything is great. You have to designate who does the things and what do they do and so we’re going to get into that a little bit. You want a DR team. And so you can choose based on, again, your company, your role, your job, what are you writing?

 Do you want to delegate responsibilities by role by name? And so what this might look like, if you are at a smaller company with very little turnover, and a very tight knit team and no one’s going anywhere for a while, you might say, hey, Rob, you’re going to be responsible for this, and Sarah you’re going to be responsible for this and Gabriel you’ll be responsible for this. Cool, that might work for your team. Or you might say, engineering manager is responsible for this.

 Lead back end engineer is responsible for this. It just depends on your team, your company and what works for you. But you want to be specific.

 Much like you know, if you’re out in public and someone’s having a heart attack, you never say “Someone call 911”, You make awkwardly specific eye contact with one person and you’re like “You, call 911” or no one will do it.

 By the way, I have a psych degree. Yeah, same thing with a DR plan, you always want to be very specific on who does what, and designate a backup , because people get sick, people leave before the plan gets reworked. There’s always a reason you need a backup.

 Now, something to think about with your backup is make sure that everyone in the plan has access to everything they might need to fulfill their responsibilities. What does that look like? Maybe your backup doesn’t normally need a specific, you know, permission set within your cloud provider. Well, if they’re the backup for your DR plan, they need to have whatever permissions are necessary to fulfill their responsibilities. If we’re talking more like office logistics type things, at my old company, there were a couple of rooms that only maybe three people had keys to. Well, if you’re designating a backup who might need to go into that room, they should probably get a key card, or at least there should be an emergency key card in the office somewhere. Think about those things, make sure that whoever you’re telling to do things has the tools they need and the access they need to do those things.

 Leadership, who’s going to do this? So for every different sort of scenario, or, you know, potential catastrophe, you want to be very clear about who declares there’s a disaster, who is responsible for saying, okay, we’re putting this into effect now, and that can look different based on, again, team size, company size, and you know, how sort of broad your DR Plan is, but be very specific.

 You want to be… And again, this can be by role, it can be however you want to designate it, but you need to make sure people know who’s responsible for saying, okay, this is a disaster, I’m in charge, I’m running it, let’s do this.

 Who owns each step? Again, this goes right back to… In a group of people, if you say, hey, group of people perform a task, more than likely no one’s going to. But if you say you do this, you do this, you do this, you’re much more likely to get responsiveness so be very, very granular in who does what, or at least who’s shepherding what process and making sure someone does it. And then this is an interesting one that had not actually occurred to me when I started writing mine, but who talks to the press and when? So you do, again, depending on the size of your company, your team and what you’re writing, you might want to consider at what point of outage, what point of, you know, users inability to access your product does someone need to make an announcement? And how is that going to happen? What will that look like?

 Do you pre-write the announcement, so it can just be modified at the time and who’s responsible for it?

 Just something to consider. So, whatever plan you’re writing, and whatever’s in it, if people aren’t communicating about it, that is a problem. So we’re gonna talk a little bit about this. I don’t know about you all but at different times in the last several years, I have been in situations where I’m like, do I Slack?

 Do I use Google Chat? You know, am I using some other software? Are we texting, are we emailing? What are we doing? I’m guessing at your company, you have a lot of different means of communication. One of the things you might want to consider in your DR plan is what I’m just going call a standard order of use.

 So basically, when things go wrong, we communicate first via our paging system, then Slack, then email, then cell phone, like, designate how everyone should anticipate communicating. And let’s say Slack goes down, you already know that secondary and then tertiary, etc. means of communication so people sort of know what to have open, what to pay attention to and where to look for their teammates.

 You don’t just want to think about your teammates, and employees at your company, you also want to think about when you’re writing your plan, things like service providers and vendors. So let’s say that the thing you’re writing is, you know, related to your actual office building, your waterline burst or something, right?

 Who owns your building? Who’s your maintenance person? Who’s responsible? You know, who’s the building manager who’s responsible for calling the plumber? Who are you, right? These are sort of things you want to write into your plan. So that if something happens, and it’s not necessarily on your end, but maybe it’s a Google Cloud problem, who do you contact at Google Cloud? Like, who are you supposed to reach out to?

 Because you definitely don’t want to be going through standard customer service channels if there’s an actual emergency. So be thinking about that. Any dependencies that you have, and, you know, obviously, anyone providing you a service, they are dependency, factor that in. And most importantly, on this one, update things periodically. And again, I’ve read in different places, you know, you update your plan every quarter, every month, whatever it is, there’s no right answer, I’m just going to say that, there’s no right answer.

 Look at your team, look at your company, look at your needs, and look at how often you actually have problems, and then come up with a schedule and stick to it, put it in the calendar to automatically ping, whoever’s in charge of updating it.

 We’re going to update at this frequency and here’s a checklist of things to update. Big fan of the checklist. Just always keep that in mind. Okay, so let’s talk about it. Like what happens? What are we trying to mitigate? What… and again, this is not, not an exhaustive list, but these are just some of the more frequent things that come up. I will mention that when I wrote this originally, I sent out a survey to all sorts of people in the community and just asked for their horror stories about things that went wrong and so everything that we’re about to talk about has gone wrong or someone that responded. So we’re going to start with the hardware and products used, that are third party. So you’re going to want to inventory everything, right?

 You’re going to want to know, not just… So let’s say you’re on a small team, you don’t want to say, we have 10 computers, you want to say we have four HPs, here are their models, bla bla bla… you know, we’ve got these six MacBook Pros, and here are their serial numbers, and here’s who we call if something goes wrong with our hardware, right? And so maybe you’re on the kind of team in the kind of company where you have an office manager, and if your computer’s completely borked, that’s who you talk to or maybe you’re in a very large organization and there’s a whole team of people that you have to reach out to in a specific way.

 Knowing, you know, if you have on prem servers, who made them, and what’s their serial numbers, and who do you call if there’s a hardware issue? These are important things. And then you want to go ahead and think about if something solid breaks, do we replace it, right? So maybe you work at the kind of company where if your computer’s borked, then they scrap it, start over, and someone just sends you a shiny new one. Cool. And by the way, congratulations, your company’s doing great. Or maybe there’s a process for refurbishing or fixing or maybe you mail it out, they mail it to you back.

If you are going to write into your plan replacement plans, go ahead and set a budget and talk to whoever you need to talk to about that so that you know what your limitations are, should something happen and where to go, again, if you’ve got an issue with an entire rack of servers like okay, how do we replace those? Who do we call? What are we doing?

 Like have a plan for that? Let’s talk about automatic failover and site switches. So this is probably the biggest chunk because I have a lot of horror stories that came in and talked a lot about, you know, we had this problem and we had a backup site but it didn’t automatically go over, we didn’t know what to do with the data, you know, as far as pausing and rethinking and so, create a plan so that if something happens, you have a backup site and it’s pretty much automatic that you’re going to it and by automatic, I mean obviously humans are going to be involved but do as much as you can to mitigate the amount of time that your product is down, obviously.

 If you have a mirror site that needs to be checked in on and updated frequently, go ahead and put into your plan a schedule for sinking your databases, for maintaining that mirrored site, that fall-over site. Also run frequent drills on site switching, right? So what this looks like is you designate who’s responsible for the site switch, you set up a schedule, they’re going to do it and you’re taking really good documentation on how it goes, anything that goes wrong, everything that goes well. And then have sort of a post mortem after the site switch on, okay, what do we need to improve? How can we make this faster?

 How can we make this a little bit easier? Because when something truly bad happens, obviously, you won’t have… you know, planning time, I was actually speaking to one of the SRS at Salesforce, really great guy, and he was talking about how, yes, they have a set schedule, and it’s just okay, in the days leading up to site-switch we do checklist, checklist checklist, site-switch, go over the documentation, post mortem.

 Really efficient system so that when something bad happens, it goes much more quickly. You definitely want to be monitoring for unusual traffic, which clearly you want to do anyway but a lot of the horror stories that came in happened on software where the spikes in traffic were unusual, and no one really noticed because it was sort of… basically DDoS attacks were happening, and no one noticed, because they always had spikes. So these sorts of things are just things you want to write in to be looking for. And what do you do if you see unusual traffic? And how quickly can you respond to that?

 Another thing that came up that had not occurred to me, was your version control, right? Do you want to write into your plan a set schedule for pulling from your repos for certain people, certain repositories at a certain schedule? Why do I say that? Several people shared issues where they were having… and this was Bitbucket GitHub, GitLab, like it wasn’t one organization, but they would have like a day of random outages. And okay, I guess we’re all slowing down.

 Now, if that’s not a big deal, you don’t need to write that in. But if you are constantly interacting and pushing code, if you’re, you know, needing to be able to interact with your version control system constantly, you might want to think about, okay, you’re the code owner for this repository, please just pull every morning or whatever. Again, not always necessary, but something to consider, because things can go wrong with your version control.

 Database, I saved this one for last on this section because oh my God, for my money, the worst possible thing that can happen is data loss. Right? Like once it’s gone, it’s gone. Which is why you constantly have backups and briefings and blah, blah, blah. But data loss is the like, for me the biggest thing because as a user, I mean, think about when you pull up your Amazon app, right?

 As a user, if I open that app, and nothing happens, it’s out, there’s a problem. I’m mildly irritated. It’ll be back in a few minutes. I trust them to get back up. Cool. However, if it works beautifully, but I can’t see anything I’ve ordered for the last year now I’m angry, right? So yeah, we thought we started talking a little bit about a fully mirrored recovery site. This is very important. And you do want to think about, when you when you have a mirrored site, or when you have a problem you do want to write in how are you going to handle data sinking, right? So for example, if you’re doing a site switch practice run, you want to think about when are you going to cut users ability to create more data or interact with data?

 When are you going to cut that off so that you can sink into the switch, right? You definitely don’t want anything lost during the switch. In an actual emergency. How do you handle cutting off data so that you can get to the fully mirrored site? These are things that you definitely want to write in. How are we going to handle this? Right? We don’t want our users interacting with the old one that’s broken while we’re switching.

 Write that in. Think about it.

 Yeah, so there were definitely a lot of horror stories about data loss and I mean, there’s some things you can plan for and some things you can’t, right? I think it was a few years ago, a utility worker cut a line outside of one of Amazon’s big data centers. Well, guess what? That’s a problem. You can’t plan for that. You can’t plan you know, if a butterfly flaps its wings in Milwaukee. How’s that going to affect my software? But what you can say is, okay, let’s assume that any one of our data centers could be hit with problem.

 Do we have widely available data? Are we constantly backing up and rethinking? Is our recovery site ready to go at all times? As close to all times as possible. These are things you can control. And more importantly, does everyone on the team know what it looks like when disaster strikes to get to that site? And then be working on that? So yeah, so that is, I mean, really the meat and potatoes of what you want to have if you’re just doing straight product disaster recovery.

 But if you would like to be a value add or you’re working at a smaller company, we’re going to just briefly talk about a couple of things. One, okay, so I wrote the disaster recovery plan in like January 2020.

 Now, I’m a nerd who’ve been following Coronavirus on BBC World News since like, Thanksgiving. So I kind of felt like something was coming. I didn’t foresee still being remote in 2021 but I felt like something was coming.

 This was a bigger deal and also something that got me a little bit mocked for being an alarmist but what happens if you can’t work in the office, right?

 Thanks, pandemic. What does that look like? So there are a couple of options, right? So let’s say you can’t work in the office because of a pandemic. Well, hopefully, we won’t go through another one of these again. But it might be something to write into the plan. What does it look like to transition from most people in the office or a flex sort of arrangement to everyone’s 100% virtual?

 How do you make sure that all employees have everything they need to be productive from home? That’s something to consider but if you can’t work from the office because of a waterline break or fire damage or it’s being fumigated or any of the things that that come with owning a building, Do you want to consider a backup location? So for a company like SalesForce, that doesn’t really make sense.

 There’s so many employees when we’re in the office that… but if you’re in a start up, what does the backup location look like? Does it look like renting out a shared office space? If you’re a very small team, does that look like, you know, maybe the 10 of you going to one person’s home? Do you need to be in person ever? These are things that you might want to consider moving forward, especially since I’m sorry, like, we’ve been through this pandemic, that we were warned about, you know, seven years ago. Now it’s happened. It doesn’t, I mean… I hope it doesn’t happen again in my lifetime, but it could, we might want to think about it. Safety issues. Now, this is the one where I get called an alarmist a lot. But it is something to think about.

 I live in the United States and I have school children. And so I do think about active shooter situations. If you’re working in an office, especially one that doesn’t have strict security, my last office was in, like a sort of a live-work-play sort of space, where really anyone could have walked in anytime. What do you do? This is something to think about, you know, obviously, there’s a million different things to do, you know, mostly hide, run. Worst case scenario fight, but these are things to think about. Hey, if we suspect that this is going on in the building, where do we hide? Who’s responsible for checking on everyone?

 Who’s responsible for calling the police if they can? Like, all these things you might want to consider. Again, depending on, you know, where you’re working, if you work, you know, if you’re doing software for the CDC, well, guess what? Bomb threats are a thing.

 Another thing that’s probably never going to happen but you might want to think about it before it does. Fire. Do you have a designated fire escape plan? Have you practice it? And I know that sounds cheesy to do a fire drill as a grown up, I’m not even encouraging you to do duck and cover under your desk. But I am saying in the event of a building fire, does everyone know where the emergency exit is to get out of the building as quickly as possible?

 Does everyone in the office know a designated meeting spot so that you can do a head count and make sure everyone got out safely? Something to think about. And then much more likely than those last depressing ones. Human issues, right? So let’s talk about mass illness, especially on a small team or a team where you have code owners who are just subject matter experts and maybe not so many other people who are, you know, able to complete their tasks.

 What happens pandemic, what happens if half the team is sick? Do you have coverage for that? Do you have people who are sort of cross training with other people to pick up those skills? Something to consider. What if everyone cashes in PTO at the same time? Now obviously, you know if you have the kind of company where you have to get approval for PTO that can be mitigated, but if you work in a company where it’s unlimited PTO, and you’re grown up, go when you go, well, what happens if there’s a problem and half the team is on vacation? Is there a protocol in place? Does everyone know what’s expected of them? Something to consider. And then extended leaves. Now this one, as someone who went through, you know, a pregnancy while working at my last company, something to be considered, you know, people get in car wrecks, people have babies, people have to take care of older family members, people have any number of reasons they might need an extended leave.

 So look at your team critically, is there a person or people that if they were to be gone unexpectedly for three months, how would your team adapt? How would you handle that? What protocols are in place? Are you cross trained people? Now with something like a maternity leave, or paternity leave, you should know, hey, this is coming up although we’ve all seen that show “I didn’t know I was pregnant”, it happens, people, it happens. But you should be able to sort of plan for that and say, okay, you know, this gentleman is going to be gone on parental leave to meet his new baby, yay.

 Who do we need to train to make sure that while he’s gone, we’ve got adequate coverage? Just something to think about. So this, this is random advice. This is basically just things that came up during that survey that people said that resonated with me, right? Things that I felt like were worth sharing, again, or just, you know, in a random place. So, not all dependencies are equal as it turns out. There are things that go wrong and things that break and processes that don’t work, that are absolutely mission critical, you cannot function without them. Then there’s stuff that can wait a day. And then there’s stuff that take a week, whatever, Tear out these these potential problems, like, how critical would this be, right?

 How big of a deal? Let’s build that into the plan, right? So if you’re looking at certain scenarios, and you’re triaging, okay, this is a tier one, this matters first, we will address this later, this matters now. Think about that. This is a really big one, especially for someone like me with complex PTSD. Not all disasters are equally likely, right? So, you know, I talked about an issue with, you know, potential data loss and then I talked about an active shooter situation, well, guess what? One of those is far more likely than the other even living in the States. So, you know, yes, you might throw a nod to some, you know, sort of outlying cases, some sort of niche things that could go wrong, but know that you might not want to put as much time into that.

 Let’s put more time into planning site-switches, let’s put more time into creating a schedule for updating and maintaining your mirrored site, right? Not all things are equally likely. And then I mentioned this at the beginning and I’m just going to, like drill this again. Testing, testing matters, right? So you have these plans and maybe it’s once a year a fire drill, or anytime there’s, you know, a few new employees, a fire drill. And then, with site switching, maybe it’s once a month, twice a month, depending on your software, and how critical it is your users have access 24 hours a day, whatever makes sense, whatever potential disaster you are planning for, whatever, you have to test it, and it has to be a set interval, and everyone has to be in on testing it.

 I like testing the plan with unlikely hurdles. And this is crazy pants but we did sort of a drill at my old company when I was very new and you know, I just started and they wanted to basically find out what happens. The software I was working on at the time was sort of intimately linked to natural disasters so we definitely had spikes in usage. And leading into hurricane season, we wanted to look at what does a user spike look like with this product? Like, can we handle it? And so we had a fake hurricane. And that was neat, and they had me who knew nothing about the software, like play the role of insurance adjuster, and I just tried to be as dim as possible and do stupid things and then be indignant about it because I’ve worked in customer service, and I know how that works. But yeah, like test the plan with some weird things like maybe… especially if you’re on a smaller team, maybe the person who’s best at site-switching sits out and let someone who’s not very experienced run this one and watches. Almost like an apprenticeship, right?

 Throw in some little problems to really test your plan. Hamstering yourself a little bit, and then test it again. This It was I mean, overwhelmingly, I think everyone almost to the one that responded mentioned something about either we test super frequently, or because we don’t test enough, this is what happened. You know, I like it to, you know, I used to do protection work, and I had to do a shooting certification and my shooting instructor would make us, you know, draw weapons super slowly, because slow is smooth and smooth is fast. So going through and testing and testing slowly and controlled situations, you get smooth at the process so that when you have to move quickly, you can. And that’s it. Doom and gloom. I hate to be Debbie Downer, but things go wrong. As it turns out, things go wrong and I don’t know about you all, but the more prepared I feel, the more aware of what could go wrong and what to do when it does is very comforting to me. So I hope that you’ve gotten just a little tiny taste of the kinds of things you can write into your DR plan,

 I will say, I believe it’s IBM has a sort of free template available for DR planning that was really well done. So you might want to look into that. But yeah, just in general, I hope that you feel like you’ve got a good starting point for how to start writing a disaster recovery plan that will add value to your team and your company and further your career. As always, feel free to reach out to me. I love connecting with the community and I’d love your feedback.

 Thank you so much for coming to SwampUP.

 It’s my favorite conference I’ve never been to in real life.

 And I hope that you have enjoyed it as much as I have.

 Thanks.

Release Fast Or Die