Skip to main content

In case of IT incident, analyze what went wrong

C
Written by Cyberangels
Updated over 2 years ago

Once an incident is resolved, there is a tendency to move on and return to normal daily work. This misses opportunities to gather critical lessons and to understand true system behavior as well as process and system disruptions.

Conducting effective post-incident reviews and taking clear actions based on those reviews is essential.

POST-INCIDENT REVIEWS

Post-incident reviews are a key component of an organization's culture. It is a critical feedback loop that contributes to system understanding and continuous learning.

Post-incident reviews should be of two types: local and global.

LOCAL POST-INCIDENT REVIEW.

The purpose of the review meeting is to focus on what happened and what can be learned from the incident. To this end, the team takes the following actions:

  • Review the timeline;

  • Identify and discuss what went wrong;

  • Discuss what went right;

Some of the most important questions to ask are:

  1. How could we have identified it earlier?

  2. How could we have diagnosed the incident more quickly? Did the analysts have the information needed to diagnose the problem?

  3. What would have helped solve the problem faster? Do we need new triggers, data collection, tools, or processes?

  4. What specific actions should we take to improve?

  5. Where did we get lucky?

  6. What have we learned about the behavior of our system?

  7. How could we have prevented the accident from occurring?

  8. What went well in handling the incident?

RECORD AND ACT

Immediate tactical solutions are important and must be identified to stabilize systems as quickly as possible, but longer-term, large-scale improvements must also be discussed to identify solutions that will prevent incidents from recurring.

Actions to be taken should be collected and translated into the incident team's work tracking system.

FAULTLESSNESS IS CRITICAL TO LEARNING

The team must focus on identifying deficiencies in existing systems and processes. Complex systems fail for a variety of reasons; therefore, the review should not focus on people.

If a team member made a wrong decision, the conversation should address the missing information that would have helped him or her better understand the situation.

If someone made a mistake, the conversation should be about how to make the system safer so that this type of mistake is not possible or is at least more easily detected.

GLOBAL POST-INCIDENT REVIEW

Local post-incident reviews generate significant learning about localized behavior and the behavior of systems and processes. But when teams conduct reviews in isolation, the organization and other teams do not have access to all the lessons learned.

DIFFUSE LEARNING

In addition to local post-incident review, global learning needs to be generated by making local review results widely available.

BREAK DOWN SILOS

The following practices break down silos between teams and maximize cross-functional learning throughout the organization:

  • Hold a Global Incident Review if a major incident has occurred;

  • During the Global Incident Review, teams and stakeholders should focus on assessing the impact on the business and then on the technology stack;

  • Tell the story of the incident to provide the best possible context and to stimulate public involvement;

  • Discuss remediation plans and next improvement elements;

  • Discuss what the organization and all teams (not just the affected team) can learn from the event;

  • Identify improvements needed to diagnose the incident, including the impacted service, priority level, and correct resolution teams engaged to improve response times in the future;

  • Review the repair steps and identify recommendations to reduce the repair duration of a future incident;

  • Evaluate whether incident communication was effective or if anything can be improved to reduce delays, confusions, and response times.

During these sessions and after the specific incidents have been evaluated and reviewed, it is important to update or gather all the knowledge shared and gained in the company's best practices.

This document will increase awareness of incident response and solutions and enable continuous improvement throughout the organization.

ACTION

After an incident is resolved, the organization and team must improve their ability to detect, diagnose, mitigate, resolve, and prevent future incidents. They can strengthen and encourage collective ownership of system reliability and customer experience, restore and maintain customer and stakeholder trust, and identify large-scale system and process changes that improve system robustness and reduce future impact.

ELEMENTS OF IMPROVEMENT

As part of the post-incident analysis, research the factors that contributed to the incident and try to identify specific and actionable improvement opportunities.

Also ensure that the improvement elements identified are specific, targeted and actionable.

THINK BROADLY.

It may be tempting to identify a very specific change that would solve the specific problem that occurred in this particular incident. If possible, try to solve a series of problems that could cause a series of incidents. It may be helpful to stimulate discussion with targeted questions such as:

  • How could we have detected the incident more easily?

  • How could we have diagnosed the accident more quickly?

  • How could we have mitigated the effects of the incident on the customer experience?

  • How could we have resolved the incident more quickly?

  • How could we have prevented the incident from occurring?

PRIORITIZE CAREFULLY

Not all improvements can or should be implemented, for reasons of feasibility and commitment. Be sure to prioritize improvements that will have the greatest impact and solve larger classes of problems.

Did this answer your question?