AIOps and You – Faster Deployments, Safer Pipelines, Happier People [swampUP 2021]

Heath Newburn ,Senior Solution Specialist ,AIOPS

June 30, 2021

< 1 min read

In this session, you’ll learn how AIOps creates Actionable Intelligence and how to drive the Action in Actionable Intelligence. Get started with your instance today:

You’ll see how to increase the speed and safety of your pipelines, how automation will help you focus on core business and not chore business, and how we can create a better digital experience for employees and your customers.

Video Transcript

Howdy. My name is Keith Newburn and I’ll be your Sherpa, guide, Guru, chaiwala, or waterboy…
whatever you need
today as we talkabout AI Ops, and what it can do for you.
This talk is pre recorded, but I’ll be available in chat,
you can always find me on Twitter and LinkedIn as well. So
grab your favorite beverage, get a snack, get comfortable
and let’s chat about how AI and automation can help you and your pipelines.
So we’ll cover a handful of things today,
why AI ops matters,
what it is and what it can do for you.
delivering on the promise of actionable intelligence and what’s next.
So hopefully, you’ll leave with some ideas you can take back on creating value for your teams,
and getting some hours back in your week.
Over the last 10 or 15 years, the technology landscape has changed dramatically.
even think back 20 years to the Dot Com Boom
and the rise of Amazon, eBay, PayPal and the other firms that are still with us,
those companies are built on monolithic applications running on very expensive hardware
and proprietary databases in private data centers.
The last few decades have brought us cloud computing with infrastructure as a service,
which replaced bare metal servers with nearly limitless virtual servers,
but also gave us the decomposition of monolithic applications into distributed micro services
and components with the rise of other pieces of cloud.
Think about in the late 1990s that ran on a single application called Obidos,
as compared to today, where its homepage alone comprises thousands of different pieces.
Along the way, the rise of agile, lean and DevOps methodologies
means that each component is being delivered faster, sometimes many times a day.
That’s complex in your application arises on two dimensions,
the sheer number of individual components with intricate dependencies between them,
and the sheer pace of change
not only with new revisions,
but ephemeral components and ever changing dependency tree.
No wonder IT organizations whose process
is dependent on relatively static view of the world in a traditional data center
are under incredible strain.
This strain is exacerbated by the tooling required to keep up with the changes
and the rise of monitoring, APM, logging and other observability tools.
70% of IT organizations rely on up to 9 different monitoring tools to support modern applications.
Larger organizations could have 30 plus tools.
Keep in mind this is the situation before they even started their digital transformation.
According the same survey on average,
47% experienced more than 50,000 alerts per month,
with our largest customers dealing with millions per day.
That’s a lot to deal with.
So how are you doing?
How are those of us on the ground handling all of this?
Well, some are better than others. But I think we can all agree this last year has been a long haul.
As I’ve talked to a lot of our peers in IT, I kind of feel like we all need a group hug.
Tech, Twitter and Bourbon have made for some non recommended but workable
coping mechanisms.
I’ve seen a lot of teams go through massive turnovers.
The companies that looked to take market share tried to grow fast
and push their IT teams harder than ever before.
The companies that hunkered down cut teams to the bone to reduce cost,
putting much greater pressure on existing staff
and a lot of angst and worry on the folks who are now looking for work.
Those companies that languished in between in many cases fizzled out
and closed the shop or down to skeleton crews that are holding together with baling wire and hope.
I’m still not sure what right looks like for any of our teams, but
I know the effects on those I’ve talked to have been rough.
And it’s not just your stories are mind.
A recent journal article showed that 73% of our tech industry peers felt burned out.
And 80% have seen a significant increase in workload since the start of the pandemic,
nearly two thirds are working an additional 10 hours a week resolving incidents.
And there’s no lead up on the horizon. 79% believe that digital acceleration
is their organizations’s top priority for 2021.
Harvard Business Review did a similar study and showed that burnout isn’t just bad for morale,
it’s hurting organization in the bottom line.
It’s going to cost $30,000 or more in hiring to replace a good engineer
and the effects add up.
Lower engagement leads to less productivity along with higher turnover and healthcare costs.
The combined effects are huge, estimated $190 billion.
It had been bad before the pandemic but this year has shown us
we have to find ways to change, we can’t keep doing all the heavy lifting
with our smart people.
We can’t keep being the superheroes.
I’m passionate about this, passionate about helping it teams be successful
and I think one of the reasons I have that passion about this particular space
is that AI and automation or AI Ops can help.
The trouble with the term AI ops is that it can just be a marketing term,
so it’s open to interpretation,
just like DevOps in spite of what many folks say
there is no one AI ops tool. It’s a conglomeration of capabilities that create value.
So you can go read the various definitions from Gartner, Forbes or Forbes 51 research,
they all say the same things, but differently.
Oh, fun fact,
and a good way to win a bar bet in a tech town.
AI ops was originally defined as algorithmic IT operations,
and not artificial intelligence. So have fun with that.
But even with that, what is the meaning of AI, after all?
Is this about constructing massive data lakes of structured unstructured data for analysis?
Is it HELL 9000?
Will the pod bay doors not open if I screw up my streaming log files?
Is it Alexa or Siri for my hybrid ops?
It’s really easy to get carried away by the hype
and thinking that AI is some kind of magic bullet
that will solve all of IT’s operational problems in this world of enormous complexity.
Obviously, it won’t,
So what can it do today?
Well, there are four real world problems that AI ops is solving for customers,
delivering millions of dollars of savings via reduction in mean time to resolution or mttr,
reducing number of incidents to be worked,
or by completely eliminating the number of required person hours and incident resolution all together.
So helping achieve every CIO’s dream of delivering growth and stability to the business.
These jobs to be done are climbing this hill.
Today that includes separating the signal from the noise and telemetry,
identifying the best person or team to address an issue based on past behaviors,
and associating that with the right service,
using self service for guided automation or just executing automation to improve mttr,
increasing team effectiveness and removing toil.
And finally, overall moving an IT operations organization from manual reactive processes
to predictive proactive ones.
These tasks are present in this particular order because it is a journey,
and our most successful customers address them roughly in this order.
It’s really hard to invest in automation for example,
if you’re overwhelmed by noise.
So this is where pager duties AI ops solution can help.
We start with best practices for filtering noise, removing duplicate alerts and common problems such as
port flapping.
We also include the ability to pause transit alerts,
knowing that certain alerts are going to resolve themselves and preventing notification for these problems
that will self resolve and reduce alert fatigue.
After the initial deduplication machine learning
on your incident data is used to identify similarities between incidents,
automatically grouping multiple alerts together in real time
based on incident content, as well as timeboxing.
This can help eliminate duplicate work among teams as we identify and help resolve concurrent incidents.
We also use this data to compare against the historical record
to suggest actions and automation
that previous teams have taken to solve incidents of a similar type. In other words,
we’re using machine learning not only on the data generated by the systems,
but also by the humans who responded to the problem.
We can subscribe to change events.
And I know this is hard to believe for anyone who has spent more than 30 seconds in ops,
problems almost always come from change.
This helps us more quickly get to root cause.
By contextualizing this along with pipeline data,
we create context of where you’re at in your delivery,
and have the ability to do things like not send critical alerts for non prod deployments.
I mentioned at the top that keeping track of dependencies and relationships in a rapidly evolving
complex microservices type environment is incredibly challenging.
It’s not uncommon for fast moving engineering teams to create dependencies on their services
without other services owners even knowing it.
We get it. That’s what makes CI\CD work.
Again, we’re using machine learning to surface probable hidden dependencies between services.
So for example, if outages and service B tend to follow an outage and service A,
then over time, we’ll learn that and suggest to owner of service B that
their outage might actually be due to service A,
even if no explicit dependencies were declared.
If a user has associated these things before, our algorithms take that into account
and can make the association as well as based on time, labels, etc.
to help us get to the appropriate service.
This service context is critical.
We don’t want to have the same response to revenue generating services as back office ones,
that may not be mission critical.
This is one of the easiest ways to reduce alert fatigue,
and helps team focus on what really matters.
If it’s 3 AM on a Saturday, and there’s alerting the HR service
everybody already got paid.
We know it isn’t critical until Monday.
So why should we wake anybody up?
With our platform analytics capabilities, you leverage machine learning to also help you get to what’s next.
So our analytics lab is going to extract insights from pager duties deep data set
for personalized answer two key questions.
Some of these might be what is the cost of less incident?
What incidents are affected by resolution time?
Which responders were most impacted?
I can get one button call to actions with intelligent recommendation for machine learning
that suggests, how can I reduce noise? How can I improve team efficiency?
How should I improve prove my scheduling?
That with this easily generated data, I can drag this to other teams
and show them the reason I’ve come to these conclusions.
We’ve codified a maturity model into our advanced analytics.
It benchmarks where a business is at on their digital journey,
and also shows how maturity can improve with specific recommendations
based on our 12 years experience serving more than 13,000 customers and nearly 600,000 users.
And Analytics API enables ubiquitous access to detailed incident data so you can leverage your own bi tooling
and data experts to extract new insights.
These capabilities for event intelligence analytics create new views to actual intelligence,
which leads to better pipeline management.
There are a lot of tools and platforms that talk about filtering noise or finding root cause.
It turns out the root cause analysis is really hard,
and the industry has been trying to solve it for 30 years.
Companies like IBM, CA, HP BMC, even Microsoft took a stab at it for a while before moving on.
Now companies like Big Panda. Moogsoft, and many others,
well, along with almost all the observability APM vendors
are trying to tackle event management and root cause analysis.
The combination of these centralized event management
along with pager duties approach to decentralized may yield great results.
There are a lot of ways to get there,
and as we demonstrated, the combination of smart people and analytics
can help us get to that actionable intelligence, but it begs the question,
where’s the action?
Who does it?
And why do they keep doing it again and again?
And why does it always seem like it’s me that’s having to go off and take care of it?
Self Service Automation allows subject matter experts to focus
on their job of delivering value to the organization
and empowers L1 and L2 teams to keep the business running,
while eliminating toil wherever we can.
This is why integrated automation is a key to successful AI ops.
Noise reduction is great, but a certain point there’s still an incident to be handled
and if you can automatically respond before even alerting a human,
so much the better.
It’s about weaving automation into the three areas.
First, before humans are even alerted.
In order to automatically fix known problems and avoid needlessly waking anyone up,
it’s a great place to start.
Second, if we do need to alert someone
to enable the first line responders to run automatic diagnostics or gather information
even if they’re not the subject matter expert
so that that responder can be more effective.
And third, to give predefined actions to any responder,
a toolbox of automation if you will,
to solve for the most common problems, gather more information, remove toil.
With cell service in place, this allows first responders to really serve as shock absorbers for work.
They may not know the system they are supervising intricate detail because
the experts have given them standard operating procedures,
those L1 and L2 teams have something to try before escalating,
or at the very least they can grab that diagnostic information
and a situation with a format that will make it useful for the engineer even if they do get woken up.
We’re not just doing this for motherhood and apple pie or queen and country,
there are real business benefits to this.
A report from Capgemini showed that
investments in automation lead to increased revenue for 75% of companies,
increased overall profitability for 76% of them,
and 86% of them reported that it helped improve the customer experience
leading to happier customers.
Prevention, diagnosis and resolution.
These are the three places automation helps the most.
The whole objective of incorporating apps functionality with automation is fundamentally to drive down
the time needed for different phases of the incident response lifecycle,
and reduce the volume of incidents as well.
With just basic noise reduction,
and the other incident response features in pager duty,
it’s primarily about reducing the detect mobilize phase.
With automation many more of the phases,
including the diagnosis of what’s actually wrong and getting to a fix can be improved.
The lessons learned and knowledge gained in these automated responses
make it much simpler to enhance processes,
and ensure a more rapid response for the future.
We have several clients that now drive mttr to seconds for common incidents
with this automation in place.
So what does this mean for our pipelines? Well,
combining all these capabilities with JFrog,
you can monitor your sdlc and get new insights to the progress
of your software as a transition across each stage of your pipe.
integrating the main metadata at each stage,
and incorporating events to pager duty allows you to understand the status and details
so you know what teams need to engage to keep things running smoothly.
Those teams get at a glance context to what features packages, versions, commits,
dependencies, issues and environments are involved
to allow them to solve problems faster.
With pager duty automation,
and the addition of F5 Engine X,
you can now monitor and control
the canary deployments to your software to the final stage your deployment
where your software goes to production.
You can create context for automated rollbacks,
config changes or shift to blue, green or full production deployments,
creating faster pipelines with greater resiliency.
So where are we headed?
We’re nowhere near the end,
but it may be the end of the beginning of an AI Ops journey.
It’s clear from many customer conversations that what they’re looking for in AI Ops,
in this era of complexity is ease of use,
simplication, ease of getting started,
automatically dependency mapping, automatic root cause analysis,
more out of the box use cases and so on.
AI Ops is going to allow us to identify how to better tune signals coming into the environment,
how to deliver automation to avoid incidents or at a minimum expedite triage.
We’re building better capabilities for marshaling the right teams based on service insights
and avoiding duplicate work
and faster resolution with guided self service automation,
all of which creates better post incident analysis.
Because pager duty sits the nexus of all these signals from across the enormous assortment of domains,
with more than 500 predefined integrations,
we’re well positioned to help you accomplish these tasks, again,
in an easy to use, easy to get started manner
with no data scientists or complicated model training required.
We have unique technical approach to solving your biggest challenges.
And here are four ways that we think set us apart.
Our full service driven model versus team based ensures a culture of
complete ownership and accountability over every aspect of the services.
It translates into an automated real time response for teams to know exactly what to do,
and who to engage to accelerate incident triage.
Our service directory is the heart of service ownership,
we help you keep it up to date with machine learning,
services are built to last as opposed to organizational or team changes,
and helps keep full accountability.
Our platform is open and flexible.
It integrates into your existing environment and works anywhere your business runs.
Our data structure powers noise reduction
and enables best in class machine learning algorithms to work
effectively to serve as a foundation for all of our AI Ops capabilities.
It is made for scale.
We’re a market leader, market initiate and market founder
in this space, and it’s reliable and secure.
We’re trusted by more than 13,000 customers globally and counting.
We have more than 500 integrations out of the box to maximize your existing capabilities,
and it’s easy to get started with pager duty.
Some of our customers have seen payback in their investment in as little as three months of ownership.
We cover the full spectrum of real time work from detection, resolution,
and continuous learning and improvement.
So real time operations can happen where you are.
I’ll leave you with one last thought.
We’re going to guide you on how to optimize AI and automation
with the lessons we’ve gathered from incident data over the last 12 years and these nearly 14,000 customers.
We are invested in your success
and we’ll guide you on every step of the way,
with our run deck and pager duty communities,
our customer success and professional services organizations,
and the breadth of our best practices and learning resources.
This partnership will give you insights and actions
from how we have leveraged AI Ops within our own company
that we have built in the cloud and optimized for the cloud.
We’re ready to help you on your way.
Thanks for sticking with me.
I appreciate you
and you are awesome.
I hope this inspires you to find some ways to look at how to help your own team
and maybe get yourself back an hour or two this week.
Next time I see you.
drinks on me.
See you soon. Bye.