Learn more about root cause analysis and its role in data engineering.
Modern engineering has revolutionized almost every complex human endeavor.
From lean manufacturing to globe-wide telecommunications; from software and IT bringing the world to our fingertips to medical devices discovering previously invisible diseases, there is no human endeavor that engineering has not changed for the better. But engineers don’t only build complex systems and tools that help the world run around. They’re also the first line of defense when things turn south.
Swapping the blown fuses in our electrical grids, replacing pumps to keep the machines running, and debugging software before critical data is lost to downtime.
And things go wrong more often than planned.
Root Cause Analysis (RCA) is one of the most useful problem-solving methods in the engineering toolbox. It is used to identify the root causes of failures within complex engineering systems and correct them.
Root Cause Analysis has three main use cases:
When things go wrong, RCA is used to identify the root cause of the problem that caused the accident.
Within the accident analysis, RCA identifies the symptoms (problem), the causes of the problems that led to those symptoms, and then engineering teams recommend corrective actions to remove the causal source of the problems to prevent issues from happening again.
For example, if a bug caused a database to crash and lose important information, the developer would not just return the database online, but would also remove the bug - thus, preventing the database from crashing again in the future.
The RCA logic can be applied to existing systems to make them better.
If a component of an engineering system is working suboptimally, RCA can be used to trace the reason for the sluggish work back to its root. Thus identifying potential causes for the suboptimal system behavior and improving upon them.
This technique is often used in change analysis and risk management, to determine how complex systems would look like under different hypothetical scenarios. A potential cause for change is identified and RCA is deployed to check how that potential cause would affect the overall system.
Root Cause Analysis is used during regular monitoring and quality control, to guarantee a high standard of operations. By tracing monitored events to their source, RCA helps engineers identify which monitored elements are performing suboptimally.
So how does Root Cause Analysis look in practice?
The Root Cause Analysis technique goes through 7 steps.
The Root Cause Analysis process does not start with the root cause of problems. It starts with the problem.
Understand and define the problem that caused the defect in your engineering system in detail.
A good problem definition has three components:
Ask yourself “Why?” the problem happened five times. The first “why” gets you the immediate cause of the problem.
The second “why” gets you the cause of the cause of the problem.
And so forth until you come to the root of the problem.
Let’s look at a data example:
Problem definition: The machine learning algorithm used in production to predict the price of electricity your company is trading is outputting extreme outliers for the price of electricity.
The system of 5-whys was developed by Sakichi Toyoda, and it is widely accredited for the success of the Toyota production system. As you can see from the above example, 5-whys are usually sufficient to get you to the root of the problem.
If it is not clear already from the exercise of the 5-whys, establish a causal chain that links contiguous causes together from the root cause to the defined problem.
This answers the “How did the problem arise from the root cause” question, by enlisting all the in-between contributors.
If you have multiple causal factors and pathways leading from the root to the final problem, visualize the multiple causes with an Ishikawa diagram or fishbone diagram.
Named after the shape of a fishbone, the Ishikawa diagram represents multiple causes and their effect on the final problem as separate pathways:
When you are dealing with multiple causes, it might be unclear which one is the root cause and even how to prioritize the different causes for problem resolution.
An RCA tool that comes in handy for this task is the Pareto analysis. The Pareto analysis estimates the importance or contribution of each cause towards the final effect and assigns a higher value to the cause whose resolution would solve the problem the most.
For example, a Pareto analysis showing that the highest contributing cause to engine overheating was the damaged radiator pump:
Remove the root cause of your problem and make sure the cause does not repeat.
This step is crucial for not just solving the problem, but also preventing it from reoccurring in the future.
To err is human, to err twice is to be an engineer. Our best efforts often fail due to unexamined assumptions or changing external circumstances. This is why it is important to revisit important problems and their root causes to make sure they have not reappeared (under different disguises).
There are other root cause analysis techniques, such as the Fault tree analysis, the Failure mode and effects analysis (FEMA), and the barrier analysis.
We do not cover these techniques here but feel free to research them further on your terms.
Root Cause Analysis also has drawbacks:
Even though Root Cause Analysis is used throughout engineering, data engineers tap into this problem-solving technique more often than others.
Data engineering is by its nature a multifaceted and continuously changing intricate system of co-dependent data pipelines. And the more data pipelines there are, the more they break.
Start your journey towards becoming a (better) data engineer with Keboola’s Data Engineer Certificate, learn how to put your RCA skills to the test, and develop a competitive engineering skillset that will help you stand out.