The Question that takes away all Blame
The Blameless Postmortem, or blameless Root Cause Analysis is supposed to be the new-normal in DevOps, but it is still all too common that we seek to blame the team, or even the individual.
Reacting to business outages by ‘looking for the responsible party’ is especially prevalent in siloed organizations. This kind of behavior ultimately produces a culture that puts emphasis on ‘blame’. It is an undesirable situation that can easily be averted by asking the right questions when determining the root cause of the incident. Asking the wrong questions is the main reason for not finding the true root cause, and instead, puts the focus on blame and thereby fosters a culture where we are unable to prevent incidents from recurring.
The interesting part here is in the ‘asking the wrong questions.’
I want to emphasize that postmortems are about preventing incidents from recurring.
‘How?’ is irrelevant
In the aftermath of incidents that cause interruption of business processes, we ask ourselves: “How did we fix that what caused the business outage?”
Although a valid question, asking ‘How?’ is not of particular value. This approach consistently leads us to think that once we answered this question, we think we found the root cause of the incident. Which we have not, we merely determined how we fixed the problem.
When we ask ourselves: “What caused the incident?”, chances are that we will correctly determine the cause of the problem. This cause is almost always technical in nature: ‘There was not enough memory in the server’, ‘There was a bandwidth problem’, ‘There was a bug in the software that resulted in the incident’ or ‘There was a parameter set incorrectly’.
There is something very satisfying in the answer to the question ‘What caused the incident?’. You get this gratifying feeling when you identified the weakness in your product. It is in the software, in the hardware configuration, in the network infrastructure. When we know where the weakness lies, we know who is responsible for the weak component. It is the development team, automation team, the network team. When we know who is responsible for that component, we know where to put the blame.
Although everything seems to become tangible when asking the ‘What?’ question, it hardly tells us anything about the root cause of the incident.
‘What?’ is wrong
In about every root cause analysis (RCA) I was involved in, putting the blame on somebody was not about punishing that person, it was about identifying who should fix the problem. Granted, that, in itself, could be perceived punishment. Trust me, I have worked with systems where having to fix the problem could be considered capital punishment. Although the intention is never to put the blame on somebody, the unfortunate reality is that this is usually the perceived outcome of these sessions.
Sessions where we raise the question ‘What caused the problem?’, often digress into sessions where the question answered becomes ‘Who caused the problem?’
In my experience, looking for the person to blame is the usual reaction. That person is almost always solely held responsible for the incident, which seems to be the logical outcome. And this is where an important semantic aspect of the RCA lies; the person being held responsible is not necessarily the person who is accountable for the incident. Furthermore, be aware that in a siloed organization, there is no one person that can be held accountable solely.
During most of the RCA sessions I participated in, the posed ‘What?’ question focused on getting clarity around ‘What was done to fix the problem?’ and ‘What was the cause of the problem?’. However. in the majority of these cases the question ‘What must be done to prevent the problem from occurring in the future?’ is not addressed. Yet, although I agree that it is important to understand what the problem was and how it was fixed, it is even more important to understand what is required to prevent a recurrence of the problem.
It can be disconcerting to realize that the same problem might surface again. Preventing this recurrence does require addressing the true cause of the problem. Therefore, it is necessary to ask the question ‘Why did the problem happen in the first place?’
‘Why?’ is right
The question that we will have to ask ourselves is not so much about what caused the incident, it is about why the incident could occur.
This is a tough question to answer. Let us start with an easy question and ask ourselves why a problem surfaced in the first place. We can start off with asking; ‘Why was there a bug in the software?’, ‘Why was there not enough bandwidth?’ or ‘Why was there not enough memory in the server?’. All depending on the actual problem we faced.
A very common, tried-and-tested, effective way of identifying the true root cause of an incident is by applying the 5-Y approach. In this approach you ask five times ‘Why could [the answer to the previous ‘why’ question] happen?’. Experience has taught that going five levels deep, so, asking ‘Why?’ five times, will get you to the root cause of the problem. Sometimes it takes less, but you seldom have to go more than five levels deep.
Let’s assume that the incident was due to insufficient bandwidth and let’s start asking ‘Why?’
- Why was there not enough bandwidth? Too many customers accessed the newly released API.
- Why did too many customers access the new API? Because we announced it prematurely in our global newsletter.
- Why was it announced in our global newsletter? Because the marketing manager wasn’t aware that the API was to be released following the ‘soft-launch protocol’.
- Why was the marketing manager not aware of the fact that the release was to follow the soft-launch protocol? Because she was not in the meeting in which it was decided to follow the soft-launch protocol.
- Why wasn’t she in the meeting in which it was decided that the API was going to follow the soft-launch protocol? Because she was on vacation and didn’t appoint a delegate.
By now, we know why the incident could occur. Not what caused it, but why it could happen in the first place. This is the knowledge required for preventing a repetition of the incident in the future. In our case:
“Making sure that the marketing manager or a delegate is attending meetings in which product launch strategies are decided will prevent this incident to occur in the future, no need for more bandwidth.”
Of course, the above is only a simplified example, but it does show that by asking ‘What?’, the solution would be a costly technical solution and by asking ‘Why?’, the solution is better meeting attendance.
The holistic view of ‘Why?’
Another important conclusion we can gather from asking the right question is that asking ‘What?’ only involves a certain kind of people. Typically, these are technical people, leading us down a slippery and costly technical slope. In contrast, the ‘Why?’ question requires all parties involved in the delivery of the product to attend the postmortem, allowing us to benefit from a holistically shared responsibility.
Getting to the bottom of the incident’s cause requires a multi-disciplinary team. Just like delivering a product requires many disciplines. In fact, addressing problems is done most effectively by having all of the involved disciplines fully collaborating. Make sure to have the right people from the various disciplines available to get accurate and actionable answers to the five ‘Why?’ questions.
Product Owner or Problem Owner?
To further build the case for a multi-disciplinary approach on the RCA process; it makes no sense to think that being successful requires many disciplines. Equally, it makes no sense to think that only one discipline is required to prevent failure. There is no difference between delivering something that works and something that does not. Not from a product delivery perspective.
The question is of course: ‘Who would go through all this trouble and assemble all these people that are involved in delivering a product into the hands of our customers?’ It is the person that is held accountable for the incident. More importantly, it is the person that is held accountable for the fact that the incident does not occur again. Not coincidentally, this is the same person who is held accountable for delivering a successful product: The Product Owner.
The Product Owner is the perfect role for heading the RCA. Ownership of the product implies ownership of the success of the product and all challenges that come with it.
The text very explicitly communicates my own personal views, experiences and practices. Any similarities with the views, experiences and practices of any of my previous or current clients, customers or employers are strictly coincidental. This post is therefore my own, and I am the sole author of it and am the sole copyright holder of it.
Image source: http://bit.ly/blaming-finger-image
Special thanks to my lovely wife Olcay, as well as my friends Melanie and Sytse who took the time and made the effort to review my article. I am confident that the article’s quality was significantly improved by their feedback.