Before we dive into the nitty-gritty of incident management, let’s look a bit closer at the actual meaning of ‘incident.’
In the world of IT service management, the official definition for ‘incident’ is an “unplanned interruption to an IT service or reduction in the quality of an IT service.” Whether that means a slowdown in response time or a total system crash, you’re looking at an incident.
Incidents tend to have a sort of domino effect, which is why they are so expensive to deal with. A report from IBM estimates that the average cost of an IT incident in 2022 was a mouth-dropping $4.35 million.
But incidents tend to do more than just financial damage. They also cause service disruptions, a lot of stress for your team, and a lot of headaches for the customer service people dealing with unhappy customers.
It’s fair to say you either want to avoid incidents altogether or learn to manage them well that they don’t drive everybody absolutely crazy when they do occur.
And that’s where incident management comes in.
What Is Incident Management and Why Is It Important?
Incident management refers to the process of dealing with unexpected issues or disruptions (known as ‘incidents’). Basically, if something breaks, incident management is the process of restoring normal service operations as quickly as possible.
Why is it important? For starters, service interruption is expensive. When websites go down, it costs companies thousands (or even millions) of dollars. Depending on the industry, this can have a cascading effect you will be feeling for months to come.
Let’s take the example of service level agreements (SLAs). CIO defines SLAs as “the level of service you expect from a vendor, laying out the metrics by which service is measured, as well as remedies or penalties should agreed-on service levels not be achieved.”
For example, availability and uptime are key parts of a website hosting SLA, so you might get a written promise of a minimum uptime of 99.99% in your contract.
If a major incident occurs, chances are you won’t be able to meet that service level agreement — which can mean fines, penalties, and loss of reputation.
Having a plan in place to deal with issues will mitigate much of the damage, but not every company is ready. In fact, a survey by FRSecure found that only 45% of the companies polled had an incident response plan in place. The rest were, more or less, just hoping for the best with no backup plan.
Incident management also improves efficiency and team productivity and helps prioritize urgent incidents, plus it provides insights into recurring issues and their underlying causes.
So while incident management might seem like a plan created to respond to an emergency, it should actually be looked at as a strategic approach to improving service quality.
By adopting best practices in incident management, you can learn and grow so the number of disruptions can decrease.
Best Practices to Improve Incident Management
1. Define what “incident” means to your business
While the general definition of an incident might be an unplanned interruption to an IT service, what actually constitutes an ‘incident’ can vary greatly from business to business.
For an e-commerce company, this could be as simple as a website going down and preventing customers from making a purchase.
For British Airways in 2017, an outage meant over 1,000 flights grounded and 75,000 very unhappy stranded passengers.
Because the differences are so wide, it’s important to define what an ‘incident’ means within your industry. Atlassian recommends doing this by establishing KPIs (Key Performance Indicators) to “help businesses determine whether they’re meeting specific goals. For incident management, these metrics could be number of incidents, average time to resolve, or average time between incidents.”
2. Adopt a Proactive Approach
When people think about incident management, they often think about putting out fires after something goes wrong.
But adopting a proactive approach to incident management is just as important — and that means preparing for and preventing those incidents from happening in the first place.
You prepare by having a team and a plan in place (more about that later) and by ensuring your technology infrastructure is up-to-date. But also by having a number of monitoring and alerting systems in place to identify potential issues before they escalate into full-blown incidents.
For example, with tools like UptimeRobot, you can set up 24/7 monitoring to receive real-time updates and alerts if your site experiences performance issues.
You might not think of logging an incident as something preventive (after all, the incident already occurred), but keeping track of what happened as you’re working on resolving it might help prevent future problems. UptimeRobot’s Event Log feature helps you track all events related to your monitors so you can have a record of each incident, its causes, and the steps taken to resolve it. You can even leave comments for your colleagues.
By adopting a proactive approach to incident management, you’re not just putting out fires; you’re actively working to prevent them.
3. Prioritize Incidents Based on Their Impact
Before you go into panic mode, keep in mind that not all incidents are created equal.
A minor glitch that causes disruption for just a few minutes? Important, but not catastrophic. That server crash that causes your busy e-commerce site to grind to a halt for half a day? Panicking might be in order for this one.
This is where incident prioritization comes into play. According to Zapoj IT Event Management, “Effective incident management relies on the ability to focus on impact rather than the order in which issues arose.” So in a complex situation, the first incident to occur isn’t necessarily the first one that should be solved.
Think of incident prioritization as a sorting mechanism for your IT health — emergencies that have a potentially huge impact on business operations should trigger a quick response, but if your system just has a little “headache,” this should be considered a low-priority incident.
What’s a top-priority incident? Let’s look at some examples:
- Anything that has a significant effect on customers or users. People can’t complete online purchases? This is high-priority because it has a direct impact on revenue and customer satisfaction.
- Things that could potentially cause reputational damage. Data breaches aren’t just scary — they also break trust with customers.
- Anything that will time a long time to resolve. The more complex an incident is, the bigger the headache you’ll get trying to fix it. So if an incident doesn’t seem terrible but you’ll need six hours to get things back up and running, move this to the priority list.
Making people wait for a resolution causes customer dissatisfaction, which can then ruin your reputation, which then … you get the idea. If an incident is going to take a long time to get fixed, it should be considered a top priority.
4. Establish a Strong Incident Response Team
In order to mitigate damage, incident management best practices dictate that a comprehensive contingency plan should be in place for unexpected events. And a well-prepared, well-coordinated incident response team is your business’ first line of defense when an incident strikes.
These are the people who will jump into action to restore normal service as quickly as possible — and they can be the difference between a minor hiccup in operations and a major catastrophe.
According to the 2022 IBM’s Cost of a Data Breach Report, “Breaches at organizations with incident response (IR) team capabilities saw an average cost of a breach of USD 3.26 million in 2022, compared to USD 5.92 million at organizations without IR capabilities.”
Simply put, having an IR team in place can reduce the cost of an incident by about 58%.
So who makes up the ideal incident response (IR) team? There’s no one-size-fits-all answer here and the perfect IR team should match your organization’s size and nature, but you could potentially have an:
- incident manager
- technical specialists
- communication officers (somebody needs to make sure everybody else on the team is kept updated!)
- and somebody in charge of making high-level decisions as needed
Unfortunately, this is an area where many companies are failing. The IBM report reveals that “just 38% of organizations said their security teams were sufficiently staffed to meet their security management needs.” The rest are … well, they’re just hanging on for dear life.
A properly trained IR team should be able to function without guidance when an incident occurs.
This means everybody is ready to jump into action and has a pre-assigned, specific role to solve the problem so nobody needs to make last-minute decisions about how to handle an emergency.
Here’s an example of what not to do: In 2017, Danish shipping giant Maersk was the victim of a cyberattack. The NotPetya malware caused a massive disruption that hijacked the firm’s update servers and shut down Maersk’s system, halting operations.
It took Maersk 9 days to restore its Active Directory system. One of the main reasons the damage was so extensive?
The company didn’t have an incident response team in place. After the attack happened, Maersk scrambled to assemble an IR team to work out of the UK, but by then, they had lost previous time and the work to rebuild the network ended up taking much longer. By then, the incident had cost the company between $250 million and $300 million.
Effective incident management isn’t just about responding to issues (though that would have helped Maersk significantly back in 2017), but also about being proactive with a team ready to jump into action when an incident occurs.
5. Have the Right Communication Channels in Place
One of the critical incident management best practices is maintaining clear communication lines open.
How will you deal with the incident when it involves customers? Effective communication is the backbone of any successful incident management process, especially when incidents involve customers.
Many smaller incidents are first caught by users. For example, a customer trying to make a purchase might notice your website or app has crashed or is experiencing a significant slowdown.
Or they might also alert you to problems with the payment gateway, such as being unable to complete a transaction or being charged twice.
Users might also notice:
- Bugs like broken links or navigation issues
- Slow page load times
- Data Inconsistencies and content issues, including outdated information
Creating communication channels (email, live chat, web forms, or phone) to allow people to report incidents leads to faster resolution times and better customer satisfaction.
It also adds a valuable element to your incident management plan that requires no work or investment on your side and makes people feel included.
This can include:
- A status page to keep users informed, minimize support tickets, and maintain transparency.
- A way to provide regular updates for an incident that’s taking some time to be resolved.
In addition to the status page, this can be done via status updates on your website, email notifications, or even through social media channels.
6. Use of Automation in Incident Management
Because incident management can be a complex process involving multiple steps, you might want to look into automation to make the process more efficient and reliable.
Incident management systems (IMS) is a name given to a set of tools and processes used to identify, respond to, and resolve incidents efficiently. For example, IMS can automate routine tasks, such as incident logging and updating incident statuses so the response team can focus more on fixing the problem and less on administrative tasks.
Sometimes the automation can be very basic, such as creating a list of canned responses to be used by AI-powered chatbots to help users deal with simple issues or answer basic queries from employees.
Sometimes it can simplify things by providing multi-channel support. For example, tools like ProProfs Help Desk can track all incoming support requests by “centralizing information and support management services to handle a company’s internal as well as external support requests.”
This improves tracking and reporting, encourages collaboration among team members, and ensures everyone has access to the latest information to avoid confusion.
By incorporating automation into your incident management process, you will be simplifying processes while enhancing efficiency.
7. Have an Internal System for Regular Updates and Logs
It’s not just the end-users that need to receive regular updates when you’re dealing with a disruptive incident. Everybody in the team (whether they’re working to solve the incident or are just being affected by it) needs to be kept in the loop too.
Regular updates help create a proactive communication environment, reducing confusion and helping everyone stay on the same page.
This strategy also reduces the load on your support team, as end-users, being informed about the progress of their tickets, are less likely to inundate your team with additional queries.
Logs are an important part of this process. They come in different forms, including incident logs, response and request headers, and monitor logs. In addition to incident logs, UptimeRobot also provides Response Headers (which can help provide clues about the incident’s origin), and Monitor Logs (also known as ‘all events’), which provide additional information about system performance and error details.
These internal communication systems can dramatically improve the efficiency of your incident management process. Keeping the channels of communication open reduces confusion and stress, so every person can focus on the task at hand.
8. Maintain Comprehensive and Updated Documentation
After an incident, you should always update your database with a detailed record of the incident and its resolution.
Going forward, this allows everybody in the company to go back to this information to understand what transpired, how it was resolved, and how similar incidents can be prevented in the future.
These databases are sometimes known as ‘runbooks’, which Squadcast defines as “a compilation of routine procedures and operations that are documented for reference while working on a critical incident.”
Simply put, a runbook is a database of knowledge organized for easy access.
Rather than trying to find a mix of documents and checklists across a mix of Google Docs, Notion, and other platforms, having a central database where all the information is collected saves valuable time when the team is working on critical incidents.
Updating documentation should also include redesigning documents or FAQ pages that are difficult to read.
Big chunks of text are challenging at any time, but during an incident, they will feel like a nightmare. So spend some time adding diagrams, bullet points, and other visual aids to complex documentation to make it easier and quicker to scan and read.
Standardizing the documentation also helps. If everything has a similar look or uses a standard template, the brain can absorb the information better and quicker.
9. Test & Review Your Plan
In incident management, a well-drafted plan is only half the battle.
Once you’ve designed a system, you need to test it repeatedly to help you identify its strengths and weaknesses and optimize strategies.
According to InvenioIT, “around 7% of organizations never test their disaster recovery plans.” And from those that do, half will only test once a year (or less frequently). This creates a false of security (“But I already have a disaster recovery plan!”) and you might end up with an even worse crisis.
Carrying out practice drills and updates can prevent trouble down the line, especially because they will double as practice sessions for your incident management team.
If you wait a whole year to test and update your incident response plan, you might discover some of the procedures aren’t current or key people no longer work for the company.
If part of your incident plan requires calling John when something goes wrong but John left the company three months ago and nobody found a replacement for his role, you’re in trouble. Involving relevant staff in the tests is essential to make sure everybody understands their roles during a real incident.
Plans should also be reviewed every time a real incident happens. This is the perfect opportunity to identify areas for improvement.
It’s also a good time to see if your test scenarios are realistic and reflect potential incidents your team may face.
10. Post-incident Reviews and Analysis
Postmortem (also known as post-incident reviews and analysis) is an essential component of incident management best practices.
These reviews allow teams to learn from past incidents, improve processes, and prevent similar incidents from happening again.
Postmortem offers a chance to bring people together to discuss the details of an incident: what happened, why, the impact of it, and how we make sure this doesn’t happen again.
The process of conducting an effective post-incident review involves several steps:
Collecting and analyzing data for root-cause analysis (RCA)
Think you know everything about why an incident happened? Don’t be so sure.
Sometimes the tiniest details can make all the difference when you’re analyzing all relevant data about the incident.
So make sure you spend some time looking at incident logs, system metrics, user reports, and any other data that could provide insights into what happened and why. If you have a UptimeRobot account, you can check the Incident tab to get specific details about an incident.
Implementing recommendations for improvement
After identifying the root causes, the next step is to come up with a set of changes you can implement to prevent the same thing from happening again.
This could involve changes to systems, processes, or even staff training.
RCA shouldn’t be looked at as a blame game. You’re not in it to figure out who’s at fault for what happened but to understand why it did so you can take action. For example, leading e-commerce site Etsy conducts what they call “blameless postMortems.”
Their main focus is “to understand how an accident could have happened, in order to better equip ourselves from it happening in the future.” The company never places blame, reprimands, or fires people for an action that made sense to the person at the time they took it.
In short, things happen — let’s take the opportunity to learn from them.
Informing your customers of the results of your postmortem
This is key to regaining their trust after an incident. For example, in May 2023, GitHub experienced a major outage caused by a configuration change to GitHub’s internal service.
A week later, GitHub’s Chief Security Officer shared details about the incident on the company’s blog, explaining that they had since mitigated those incidents and all systems were back to operating normally and adding that “This is not acceptable nor the standard we hold ourselves to.”
An effective method to keep your users informed is by utilizing a status page. The status page allows you to communicate essential information regarding your service availability and provide updates about scheduled maintenance.
While postmortems are essential to learning from incidents so you can improve your incident management processes, they are also key to assuring your customers you are committed to doing better in the future.
Understanding and implementing incident management best practices can significantly improve your response when unexpected events happen. By focusing on these guidelines during critical moments, you’ll minimize losses and improve your chances for long-term success.
So, go ahead and roll up our sleeves — because who said being ready for the unexpected can’t have a fun side!