At some point you’re going to encounter a serious incident that takes out one or more of the IT systems that underpin a critical part of your business operation. The focus will be on troubleshooting and recovery action in order to get everything back up and running. Once the dust has settled, you’ll want to review what went wrong so that you can improve systems and processes to prevent it happening again.
You need a Major Incident Review. You’ve probably already got a process for running such a review. If you haven’t, I urge you to get one defined – it’s far better to have one prepared than trying to make it up on the fly. If you have, hopefully it isn’t seen as a Witch Hunt.
At BLMS we have built up a library of pragmatic best practices that can help you improve IT operational quality and resilience. Here’s 3 tips for Reviewing Major Incidents.
#1 – Make sure somebody keeps a log of events and timings during the course of the incident.
It’s all too easy to miss this and it makes things much harder when you come to run the Review. Without a log, there is greater potential for argument between the involved parties as they will each have a slightly different recollection of what happened when. That will be a big gap for those aspects of the investigation where it is key to understand the correct sequence of events.
Maintaining a log isn’t difficult, but it does take time. If you are a large IT department, you’re likely to have somebody in a Shift Operations team or an Incident Manager whose role it is to step in and do this. If you don’t have that luxury, you must still assign the responsibility to an appropriate individual.
Some people use a white board as a focal point for recording details, some will block out use of a specific office or meeting room to act as the incident hub. That will require additional communications methods when people are split across multiple locations. The more comprehensive your blow-by-blow account is, the smoother it is to get your Major Incident Review meeting off to a solid start.
#2 – Make sure you get a senior person with plenty of management clout to chair the Review meeting.
Easy to say and not so easy to do. It helps if the Chairperson is sufficiently independent and has sufficient understanding of the systems and technical landscape. Of course, the word “sufficient” is wildly subjective. In the end, it comes down to a good judgment call. In smaller organisations, it may need to be someone from outside the IT team. It might even be worth bringing in an external body. (Note that the TSB incident was reviewed by several external bodies, driven by the significant industry and media exposure).
Most organisations put the “service quality” roles (problem, change, incident, etc) in the operations side of the IT department, but this can cause a degree of “us and them” when it comes to reviewing major incidents – particularly if the cause was a problematic application software release.
Whoever chairs your review, they must have gravitas and demand respect from the attendees. Otherwise, the meeting could end up being very ineffective at drawing out learning points and remedial actions.
#3 – Make sure your actions pass the quality test.
There is always potential for dubious actions to creep into Major Incident Review reports. Imagine a confident and vocal attendee throwing in a hand grenade issue simply to score points against a colleague in another team. The issue will be something they have been harbouring for a long time and this is their opportunity to get it sorted out … but it has no connection to the events of this incident nor to its learning points. When you have drawn up your Major Incident Review document it is good practice to cross-check each of the actions to ensure they are supported by facts and conclusions in the body of the report.
The other trap we see people running into is the remediation action that doesn’t remediate anything.
“Agree a plan for upgrading the server’s operating system”. The plan gets agreed, the action gets closed. Then the business area decides they’re not going to support the upgrade because it’s going to cost money and they’re not prepared to put some time aside to do the necessary testing.
Or “Undertake a review of the security settings”. Three weeks later the security settings have been reviewed and the action is closed off. But nothing has been changed. Nothing has been fixed. Nothing has been put in place to track any of the possible improvements identified in the review of security settings.
In both cases, the action is lost and the issue may well come back to haunt a second time.
If you’d like help with any IT operations best practices, drop us an email or pick up the phone and call us.
07470 002019 / 0114 398 4344