What is FMEA (Failure Modes and Effect Analysis)?

FMEA is an acronym for Failure Modes and Effect Analysis. For an FMEA, we will define a “failure” as a loss of functionality, while “failure mode” is how the failure itself happens. It is one of the most common tools for root cause analysis due to its versatility. 

FMEA is sometimes known as DFMEA (which stands for Design Failure Mode and Effect Analysis) or PFMEA (which means Processes Failure Mode and Effect Analysis). These are used to analyse potential failures in product design or in the processes of a business unit, respectively. 

Diference between failure and failure mode

Here’s a simple example to understand the difference between a failure and a failure mode — if an automatic payment terminal stops printing paper receipts (failure), it might be because the paper roll is not set properly (failure mode #1) or because its compartment is not closed (failure mode #2).

Failure modes are our way to the root cause. However, FMEA goes even further. After signaling failure modes (and, by extension, potential root causes), it then evaluates the effect and impact of each failure.

To understand what this means, we’ll use a more complex example. Imagine there’s a failure because the fan was operating with too many vibrations (failure mode). What happens in the aftermath? The piece of equipment stops momentarily (failure effect), which leads to production losses. 

Different failure modes may lead to different effects, with various outcomes for the company. That’s why FMEA is often combined with a criticality analysis. When both are performed at the same time, it’s known as FMECA (Failure Modes, Effects and Criticality Analysis). 

What are the main purposes of FMEA?

And now, a little bit of history. The United States army developed FMEA in the 50s. Shortly after, the aviation industry and NASA adopted it, and NASA went on to use it for the Apollo missions, the Viking program, and the interstellar Voyager missions. FMEA is also prevalent in the automotive and oil industries.

As mentioned, FMEA is suitable for several contexts. It can analyse function, processes (PFMEA, focused on manufacturing and assembly processes) or project design (DFMEA). It’s advisable to carry out an FMEA every time a new product is released or there are any significant changes – new manufacturing processes or regulations, for example – and when your client’s feedback denounces a recurring problem. 

What are the advantages and benefits of FMEA?

The main goal of an FMEA is to improve the asset’s output, reliability and safety. However, there are many other benefits of going through this process, such as:

  • Developing a work method that will, in all likelihood, be well-succeeded, safe and reliable.
  • Assessing failure modes and their impact, so that you can prioritise them according to criticality and likelihood (especially if you carry out an FMECA). Setting priorities improves your maintenance plan.
  • Identifying failure points and verifying system integrity – whose security should not be compromised – even if you need to introduce new safety measures. 
  • Trying out changes and adjustments in product design (for example, testing if the probability of failure has decreased).
  • Faster troubleshooting, given that failure modes and their respective causes are all well-described.
  • According to the failure modes, defining criteria for try-outs and inspections that should be included in the preventive maintenance plan.

Root Cause Analysis Ebook

What are the disadvantages and limitations of FMEAs?

On the other hand, FMEAs also have some weaknesses: 

  • FMEA is not suitable for systems in which several failures may occur at once, since it doesn’t take into account the correlation between them.
  • As we’ll see below, severity, probability and detection have the same impact on risk assessment, which is a simplified approach. 
  • FMEAs rely on your team’s expertise to list failure modes, making it a time-consuming process that requires a lot of professionals.
  • An FMEA needs constant updating, because your knowledge about an asset increases with experience, time and use; you’ll likely discover new failure modes your team hadn’t considered from the get-go.
  • In case you fail to consider a failure mode, you’ll underestimate the risk associated with the asset. On the other hand, going into painstaking detail will take your attention away from critical problems and is a waste of resources. 

How to perform a Failure Modes and Effects Analysis

The biggest challenge in a Failure Modes and Effects Analysis is being thorough regarding failure modes, its causes, and impacts. Usually, you’ll organise the information on a table divided into 7 columns, one for each of the steps below. 

1. Define failure modes

The first step of an FMEA analysis is defining all possible failure modes for each component. To achieve this, consider previous experiences with similar assets. 

FMEA and FMECA are regularly used in high-risk industries, in which safety is the top priority. Yet, for the purposes of this article, we’ll use a much more mundane example: a poorly prepared dish at a restaurant. We’re dealing with an unfortunately common failure mode, finding a hair on the plate. Our inner Michelin inspector can also think of at least 3 other failure modes – finding an insect in the food, not enough salt, and food poisoning. 

Obviously, a real restaurateur would think of many more. It’s the greatest challenge in FMEA: if you forget failure modes, you are already compromising risk assessment. 

2. Failure effect  

The second step to make an FMEA is describing clearly the failure’s effect because this will determine its severity. Try to be as specific as possible with the description to determine severity ranking on step 3.

What’s the effect of our failure mode? Well, in the short term, sending it back to the kitchen. In the long term, not coming back. 

3. Severity

The severity rating is a scale from 1 and 10, according to the impact of each failure:

1 minor risk: faults are almost imperceptible

2-3 low risk: failures are perceptible, but cause only minor annoyance 

4-6 moderate risk: the consequences of failures are noticeable (even for customers) and affect asset performance

7-8 high risk: the operation is totally compromised, which disrupts schedules

9-10 very high/critical risk: the asset is completely compromised and there are high security risks

We don’t know about you, but for us, the asset “dish with human hair” is completely compromised, with great safety and hygiene risks. We give it a 9 (we’re saving that 10 for salmonella).

4. Potential root cause

The same failure mode can have several root causes. For example, if a lift stops between floors, it might be a wrong configuration or an electrical problem. If you list all the potential root causes beforehand, it’s easier to test, troubleshoot, and correct failures when they happen. 

In our example, the root cause is evident: the kitchen personnel are not wearing hair caps. If we had found an insect in the salad, for example, there would be more potential culprits: improper storage; failure to wash properly; infestations, and so forth.

5. Occurrence 

Occurrence represents the expected frequency of the failure, again based on asset history or that of a similar piece of equipment. Usually, occurrence is a rating from 1 to 10, in which 1 is “extremely unlikely” and 10 means “extremely likely” or “inevitable”. 

Based on our personal experiences, we estimate the frequency of our failure mode to be a 2.

6. Detection 

In this column, you should propose measures to detect potential failures. The detection rating – or the ability to detect the failure – consists of a scale from 1 to 10, in which 1 stands for “extremely likely” and 10 “extremely unlikely”.

In our case, the proposed measure to avoid the specific failure mode would be to visually inspect the dish before serving it to customers. But visual inspections are somewhat unreliable – which is why we only noticed it too late. So, not wanting to delve further into these bad experiences before lunch, we estimate a detection ranking of 4, “somewhat likely”.

7. Calculate Risk Priority Level (RPN)

Finally, we can calculate the risk priority. The RPN equals severity (which we calculated in step 3) x occurrence (step 5) x detection (step 6). The bigger the risk level, the bigger the need to search for improvements.

The risk level of our dish, according to the scores we gave it, would be 9*2*4 = 72. If we had given a 10 to that salmonella, with the same frequency and likelihood of detection, its risk level would be 80 – which means it would have greater priority, which sounds about right! 

Root Cause Analysis Ebook