Playing the Observability Game

Chris Engelbert
Senior Developer Advocate
As developers, we can remember the time when Nagios was state-of-the-art technology.
We hated looking at all the numbers that seemed disconnected from our reality.
The world has changed, though, and Observability provides us with a new swiss army knife in our toolbox.
Used correctly, it helps to improve reliability, brings additional focus on what matters, the business logic, and offers aid in case of problems or failures.
Especially in time-critical situations, a distributed system with many service dependencies can be hard to analyze.
In this session, you will learn how to use Observability to assist developers instead of distracting them.

Video Transcript

Hello everyone, thank you for joining my session. What we’re going to talk about is the observability game we’re playing today. Observability is mostly something which commonly is related to developers… or DevOps, not developers, DevOps and operations.

 But today, we are going to specifically look at things that are specific to developers and how observability as developers [inaudible], right?

 So, before we start, I think my favorite analogy for observability is an airplane.

 So this is the old version of the 747, like the first version, somewhere in the I think late 60s, maybe early 70s.

 And what we see is a lot of different instruments, a lot of meters, a lot of knobs, a lot of information to look at, and to try to understand what is going on.

 Commonly, when we looked at airplanes at that point in time, it was like three people flying it, basically.

 It was the pilot, the copilot, the ones actually flying and then the third one would sit all the way in the back close to where the picture was taken looking at all the different meters and information and trying to make sense out of it.

 As far as I know, this position was called the flight engineer.

 And I mean, if you look at the infinite number of information to grasp, that certainly makes sense to me, right?

 It was really hard. It was a single person only taking care of, is there enough fuel?

 Do we lose flute fuel faster than we wanted?

 Is [inaudible] pressure, okay?

 Is… whatever right? And it was his position to say, hey, Captain, co captain, copilot, first officer, something’s wrong and we need to fix that.

 So, as developers, I think we kind of know the same problem, right?

 I mean, flying from the outside looks simple.

 It’s like our services. We have an API gateway, we have endpoints, and we have simple services, stuff that just responds to what the user wants.

 But as engineers, and as somebody who has implemented a more complex system, we understand it’s not actually true, right?

 It looks more like that.

 Systems are interconnected, call services call each other, they go down to the same services, for orders, for addresses, for contact details, for all that kind of stuff.

 The reality is always a little bit more complex. It’s like…

 like I said, it’s like the plane, right?

 Flying looks simple.

 Especially these days, and we’re gonna see that in a second but the reality, at least was very different. It probably still is, I mean,

 there’s probably still a good reason why we need a pilot license, right?

 So anyway, who am I to talk about that?

 My name is Chris. I’m a senior advocate… senior developer advocate at instana.

 We create at least one observability solution.

 I’m a little bit biased, I would say we create the best one but that’s up to you.

 I would totally recommend trying it out.

 I’m coming from an extremely strong engineering background, I have plenty of years

 of software engineering experience, I just stopped after counting up to 10 because at some point, you just start to feel old and…

 Same goes for all the different programming languages.

 I love learning new stuff, I love trying new stuff out, but just listing all the programming languages, makes you feel really weird.

 But there’s one more important fact, for people that don’t know me yet, I’m German.

 And as a German person, at least most of us, we have one specific technical fact, which is very important.

 We love beer and so is true for me.

 So, where are we coming from, right? I mean, in the past, as engineers, at least that is true for me, I hated looking at dashboards.

 They gave me a lot of different graphs, they gave me a lot of different information but when I had an issue, it was certainly not the information that I was looking for.

 It was CPU calls, it was system calls, it was paging information, it was network traffic.

 It may gave me some hints, but it never actually helped me solve the issue.

 So what what did I do? Well, I guess same as most engineers, I kind of neglected all kinds of monitoring.

 Like, yeah, that’s operational stuff but it’s not helpful to me, right?

 It didn’t give me any helpful information.

 So as engineers, and I have to admit I’m…

 I create a fair share of that, we relate to something else that we love to do, because it gave us information and that was massive log files.

 I think everyone knows the point in time where operation just comes to you and tells you, oh, you have to really, really reduce the logs. We’re running out of disk space.

 But it meant… it took away the only thing from us which was extra actually helpful, at least most of the time, because we still had to figure out what was going on and when it was going on.

 And I mean, if you ever tried to search for gigabytes of locks, that’s a nightmare in itself.

 So people came up with this new thing, which they called observability, and DevOps, right?

 And the interesting thing about observability is, it actually just adds more data sources.

 So now we don’t only have log files and CPU metrics, or cloud metrics, we have performance information, we have the contacts, we have distributed traces, we have changes going on in the infrastructure, we have dependencies, we have so much stuff, and it just becomes more complicated to overlook, and to figure out what actually is helpful.

 So people figure, okay, when we have all this information, we need to find a way to correlate it on… maybe something can help us.

 So people started writing or creating observability tools, and Instana is one of those.

 An observability tool is kind of what the modern board computers for airplanes are, right?

 It takes all the information that the airplane collects and calculates and

 tries to make sense out of it, tries to correlate or, well, hopefully it doesn’t just try, hopefully it correlates it in the right way, right?

 So that is a modern…

 I actually think it’s the latest model, it’s at least a modern 747 and it only has two people because the flight engineers, is at least today not a thing anymore.

 It probably is for bigger machines, but not for like, typical consumer airlines.

 But what happened to all the different monitoring information that we needed to look at, they actually moved into six monitors, there’s two monitors for the pilot, two monitors for the copilot.

 And they, I think, may show the same information.

 Maybe they show information depending on who’s flying the airplane right now.

 And then there is like the center top screen which gives you information about thrust, I think altitude, cruise speed, stuff like that.

 And then there is like the lower center screen which is in this weird pinkish color and as far as I know, I’m not a flight engineer, right? As far as I know, this color stays and doesn’t change until something is very wrong.

 That means if you always have like, a little bit of a look on this screen, and it has the color, everything is good.

 The second this color is gone and it shows information, you know you want to look at it and you want to try to figure out what is going on.

 So that is basically what observability tools give you.

 They take all the different data sources, all the information, correlate them together, build the understanding, and help you.

 Still, most people think it only helps operational people like DevOps operations, IT Ops, whatever you’re going to call those but it can actually help us as developers.

 So as developers, I mean, what is our job?

 Well, rolling a dice and creating a method which says, well, that number was certainly rolled with a guarantee to be random.

 Obviously, it’s not.

 Part of our job and the way the engineering role changes is something people call high performing engineering.

 It means optimizing the processes in engineering, and observability can do a great job in optimizing that.

 So our business values as engineers, and the business value we create is, from my perspective, four things. It’s development frequency.

 Development frequency means how often do we deploy?

 Do we deploy once a day, once a month, maybe once a half year, right?

 Because that directly influences the lead time to change. So how long does a feature need from inception from the first thought of it to it’s actually being deployed?

 The other thing is time to restore. If something goes wrong and it is not operational but it is part of the application that we built or the service we built, how long does it take to be fixed?

 How long does it take to be restored?

 And the fourth thing is the change failure rate.

 And that is important, because it is basically, we can be extremely fast, either in deployment, in lead time to change, in time to restore but what does that mean, if every second or even every single release introduces some kind of error or some kind of failure, right?

 So those four metrics are basically the things that create business value.

 The time to features, the time to recovery, and having a very low error rate or failure rate.

 Failure rate, as I said, in terms of how often do we actually fail creating something.

 So we’re coming back to this like dream reality thing.

 Our dream is, we have a user service, the user calls the user service, we may have some internal service, taking care of all the database information and when we come

 back, we go to some external CRM system, enriching the user information we have in our database with, I don’t know, like last orders or whatever.

 But in reality, something like that looks horrible, right?

 You have a lot of dependencies, a lot of services, you may have services that repeat themselves.

 In this example, there is a couple of times being the internal contact service and internal address service being called because multiple objects in the hierarchy that we are returning have contact and address information.

 And obviously, every single microservice calls other microservices themselves.

 So we create a lot of dependencies.

 And dependencies are something which were created with this, like boom of micro service.

 They were not new, we had cobra, we had soap, but with microservices, these things basically hit any kind of engineering department.

 In the past, it was only like the big companies.

 Today, if you want to create something, and you want to scale you are going for microservices.

 So the cool thing about this information graph, like this dependency graph is that we can actually show it in a different perspective.

 We can show it as a timing schedule or as a timing diagram.

 Like, when did a sub service or dependent service being called.

 How long did it take to respond and what other calls were called internally, right?

 As I said, again, the dream is simple.

 From our perspective, that is like a super simple thing and you may be able to implement something like that but looking back, and this like diagram of this graph that we had, that is the same call or the same request flow as a timing diagram.

 And here’s the first thing where we, as engineers, get some hints.

 And it’s great.

 You look at the request that came in from from the outside.

 And you look at it and like, huh, so it took 900 milliseconds to respond.

 And we see that somewhere over here is the last request that was external and here is the first one after a massive gap, and if that is 900 milliseconds, this gap is probably like 450 milliseconds.

 So our first intention, as engineers is trying to figure out, we see what that call was and we see what that call is and we look at the source code of our own service and like, huh… why does it take 450 milliseconds?

 And I mean, if you can speed that up, and that increases the conversion rate of your customers by, I don’t know, 40%

 that was probably a low hanging fruit.

 And imagine what kind of a hero you are if you’re halving the response time, and it does increase the the conversion rate by 40%.

 Or just let it be 20%, you’ll be the hero of the company.

 And there’s one more way we can show this and that is as some kind of a stack trace, a tree diagram.

 And the same call that we have before can be visualized as a tree diagram.

 And the interesting thing about the tree diagram is that it gives us a lot more information.

 It tells us how the calls were influenced, how they were called.

 For Instana, you can click on any single point and it gives you information like HTTP parameters, like, what are the GRPC requests?

 What is the database statement that was being executed?

 And stuff like that. And you don’t really have to do anything for that, that is all automatically collected and integrated and correlated for you.

 And Instana just gives you all this information.

 So obviously, we’re not the only ones, we’re not the first ones to have figured out that stuff like that is hard, right?

 And that this dream and reality doesn’t really fit. I mean, there was a couple of people that already figured it out and kind of visualized it and I have to agree with them.

 It’s part of our job as engineers to have the front end and not only front end in terms of users, but also front end in terms of other services, other departments to look nice.

 It’s our job to take on the hard part, and hiding all the complexity.

 So, but there’s one really bad thing as engineers, and that is, we are in bed, it’s 3 AM, we’re sleeping tightly, nicely and we don’t really want to wake up, but something happened, you are on call.

 And that night, somebody calls and you answer the phone, and the person tells you what’s going on.

 He or she tells you, you have to figure out how it works and what is wrong and fix it because you’re on call.

 Alright, you start looking at the stuff and there is one thing that really, really hinders you in quickly figuring out what is going on.

 And that’s the architectural complexity.

 And the more stuff we add, the more communities, the more databases. the more micro services, the more different frameworks we add and same goes for programming languages, the more architectural complexity we get, right?

 We have this trade off between simplicity and scalability.

 We still try to intersect them slightly, but we they’re just shifting further and further apart from each other.

 But the cool thing is, we can actually use Instana as a developer being on call and we are looking into something which is not our service, we have no idea what’s going on.

 Or well, no, let’s put it better, right?

 It is our service but we figure out it is actually not our service failing, right?

 So we look at that service and it was 3 AM, we’re not like fully awake yet, and you just try to figure out what’s going on.

 So it takes you about 40 to 50 to 60 minutes to figure out well, it’s not your service, but it is the one that is being called by you.

 So you reassign the ticket and call the next person like, hey, I’m sorry. I’m on call, I have this issue and it is your service, can you please help me fix it?

 So you woke up the second person, the second one that is now in hate of operations and the first one who is in hate of you.

 So the second person also not awake fully right?

 Just woke up, looks at their service and about an hour later figures out, well, it’s not my service but it is that service that is actually wrong so now we are two hours in, we call the third person and they’re all independent teams, right?

 They have no idea what the other one is doing. They just have like an API specification to call.

 So we’re two hours in, the third one figures it out in 3 minutes.

 It’s a fixing time of two and a half hours.

 So now, you have to think of that without the help of Instana, right?

 Or without the help of observability.

 Now we take the observability tool, we get the ticket if we open it up.

 And in the best case, it doesn’t even end up at your place because the person in charge operations already figures out, oh, well, yeah.

 That is where the error hits the user but that is the service in this deck over here that actually makes all the requests fail.

 And in this case, it’s just caching right?

 So it probably shouldn’t even fail.

 And as I said, in the best case, it doesn’t even end up at your place.

 It is going Straight to the other person and that is maybe just 30 minutes instead of two and a half hours.

 What people will call that, and I love this term, it’s the MTTGBTB.

 It’s the meantime to get back to bed.

 As I said, In the best case, it’s zero because you’re don’t even wake up.

 So what does that mean?

 We have this massive graph of information and dependencies

 that observability tools know and that they correlate, and we see some yellow and red dots in between and they give us some information, that something is wrong, but nobody will understand that.

 The interesting thing is we can separate those things out and say, okay, please only show me what is important to my service, to my department, what do I call? Who calls me?

 If I want to drop a service, is there actually anything that needs to be… or can we drop it because nobody’s calling this old version of the service anymore, right?

 Stuff like that.

 Something that is, from my perspective, a direct extension of behavior driven development, everyone understands how unit testing works, how continuous deployment works these days but what we want to do is observability driven development.

 We want to deploy something into staging, into development, into to whatever you call these environments, and we have to have an observability tool, trying to understand, is it the same quality?

 Is there a high error rate? Is response time going up or down?

 In the best case, it’s going down.

 Is there any problems? Are there new dependencies or last dependencies?

 As I said, figuring out if there’s actually somebody calling an old version of the service you want to drop.

 If there’s like only two or three people still calling you, you may want to contact those departments direct and say, hey, you guys really want to move away from the service.

 Stuff like that, stuff you normally don’t understand.

 So what we need for observability driven development is quick feedback cycles.

 Deploy, figure out, deploy, figure out.

 We need health metrics, we need to understand if the error rate is higher, if failure rate is higher, we want to have something that two decades, maybe a decade ago, I never thought I’ll say we need feature toggles.

 Deploy and if something goes wrong, tell operation, hey, this new feature toggle is in and if that service is misbehaving, try to switch off this toggle first before calling anyone.

 And then get all the additional information like profiling, error, stacks, stuff like that, get all the good things and try to figure out what happens.

 As I said, observability driven development, unfortunately, we only had like 25 to 30 minutes, and we are already up.

 So there is an E-book, which I’ve written a while ago, “The Observability for Developers E-book”, go to Instana.com/dev-observability, download it, feel free.

 I’m happy to talk about that on twitter, @NocturiasTK or in the chat later on.

 And there’s one important thing.

 When you take away one thing from this talk, don’t try to understand all of that yourself.

 Make use of the tools that we have these days and make sure that you understand how to use them and how they can help you.

 I know as engineers, we hated dashboards, but there’s tools that are not just metric.

 That actually help us.

 So try to be the modern engineer that tries to understand what is happening.

 So, if you have any questions, I’m happy to talk on Twitter @NocturiasTK, I’m happy to be in the chat for a while and answer questions.

 And with that, I want to thank everyone for listening in and cheers.

Release Fast Or Die