Incident Classification and Root Cause Analysis

I’ve just been looking at IT incident classification and root cause analysis for somebody and the recent cold snap in the weather got me thinking about parallels. How do incidents in the world of IT compare with the incidents that could impact your domestic central heating and hot water system?

(1) An “Out-of-the-Blue” Failure. The boiler breaks down. It could be a hardware component failing or a software fault with the heating control’s firmware. It might have been a fuse blowing in your electricity distribution board (blame facilities management). Most homes will not have a secondary standby boiler that automatically switches in so you’ll have to go to your alternative systems – a traditional open-hearth fire, or a more modern log burner. You could put the immersion on for hot water (not an option if you have a combi boiler) or boil the kettle. When the dust has settled, you may want to dig deeper into why the failure occurred. Was it a case of poor maintenance of the system – a build-up of dirt in the pipework, or not having the boiler properly serviced each year.

(2) Third Party. There’s been an issue with the gas or electricity supply to your area. Not much you can do about that, unless you’re rigged up with a domestic UPS and a reserve fuel supply.

(3) Capacity. The radiators and pipework might not be sized correctly in each room – in which case you have the “capacity on demand” option of putting some portable electric heaters in the colder rooms. Your boiler might not be large enough to deal with a spell of severe cold weather, so you leave the heating on throughout the day to smooth out the peaks and troughs in demand which your timer dictates.

(4) Human Error. Somebody in the household – the “operator” – has messed with the thermostat or set the timer incorrectly (“scheduler” problem).

(5) Hostile Attack. An intruder has broken into the house and deliberately damaged your boiler. Perhaps – in this world of IoT – your electronic home management system has been hacked and rendered inoperable by somebody working on behalf of a foreign government agency.

(6) Change. You had a scheduled action to amend the settings on the timer when we reverted from British Summer Time to GMT but you moved the clock the wrong way. An example of a good Change Implementation Plan being executed badly. It could have been that scheduled maintenance slot when the heating engineer called in to service your boiler and didn’t put everything back together correctly.

(7) Act of God. Sadly, we’ve seen a lot of flooding over recent weeks. I count myself lucky that I live on the west side of Sheffield where the River Sheaf is only a small stream. Those over to the east beyond Rotherham and Doncaster have been badly hit by the River Don – it’s been awful to see on the local news. With that in mind, I’ll use the scenario where a house is hit by lightening and ravaged by fire. The Domestic Continuity plan kicks into action and you move to your disaster recovery site – temporarily living with family or friends, or lodging in a B&B until your “primary site” home is given the all clear again.

(8) No Fault Found. It was an intermittent glitch. You’ve checked it thoroughly and had a heating engineer give it a full health-check. The event logs in the boiler aren’t sufficiently detailed to enlighten you as to what exactly happened, so you have to put it down to experience and watch out in case it happens again.

As you can see, they all look very familiar to what we would expect to find in an IT department’s Incident and Problem Management system.

The likelihood is that item 1 (Out-of-the-Blue Failure) would account for most central heating incidents. It can be a significant factor in IT too, if resilience either isn’t built in or has been configured badly and fails to work as designed. Otherwise the majority of IT incidents can be traced back to item 6, Change. (More about best practices for implementing change in a future blog).

Needless to say, running a business’s IT services is a lot more involved than looking after a home’s central heating system, so the ability to analyse incidents and identify underlying problem themes is key to driving up quality. This also means that you will require additional levels of categorisation to further qualify each of the causes stated above (for example, failure could be hardware, network, software, etc). But beware, it is very easy to over-complicate the list of cause codes and that will make the subsequent analysis far harder and less meaningful.

So, a couple of points to conclude this blog:

  • Think clearly about the hypotheses you expect to analyse against. This will help inform your classification of cause codes.
  • Just like the boiler maintenance, it’s worth reviewing on a regular basis to make sure the hypotheses and classifications are aligned to your quality analytics.

BLMS provide interim management support and best practice consultancy to help you manage your IT complexity