Failure Has a Blueprint Too | Six Sigma Pop Culture Series

Every good heist film has that scene. The team stands around a table covered in maps, blueprints, photographs, security schedules, access codes, building layouts, and at least one person who clearly should not be trusted with explosives. Someone points at the vault. Another explains the guard rotation. The hacker looks faintly irritated because nobody appreciates how complicated the camera loop is. The getaway driver asks a question that sounds silly until everyone realises it could save the whole operation.

Then the leader says something dramatic like, “We have one shot at this.” That is when the real work begins.

Not the running. Not the rope descending from the ceiling. Not the glamorous moment where someone in a black suit calmly disables a laser grid while holding their breath. The real work begins while the team still has time to think. They walk through the plan. They test the assumptions. They ask where the timing could slip, where the alarm might trigger, where the wrong person could be blocked, where the escape route could close, and where one missing detail could turn a clever operation into a very expensive disaster.

That is the spirit of FMEA.

Failure Mode and Effects Analysis is not pessimism with a spreadsheet. It is operational imagination with discipline. It is the process improvement equivalent of walking through the heist before the vault door opens, studying every weak point, every dependency, every handoff, every control, and every “surely that will not happen” moment that absolutely will happen the minute the customer is watching.

The uncomfortable truth is this: failure also has a blueprint.

Processes rarely collapse out of nowhere. They fail at predictable seams. A handoff with no owner. A queue without an ageing trigger. A field that depends on manual entry. A system rule nobody fully understands. A policy exception stored in one person’s head. A fraud rule that catches genuine customers in the same net as bad actors. A customer promise that relies on three teams, two approvals, one missing data point, and the spiritual cooperation of the weather.

By the time the failure reaches the customer, it often looks sudden. In reality, the crack was already in the plan.

FMEA helps us find those cracks earlier.

Before the Alarm Goes Off

FMEA stands for Failure Mode and Effects Analysis. It is a structured way to examine a process, product, service, change, control, or rollout before the damage becomes visible. The team asks where the work could break, what the effect would be, how serious the impact might become, how often the risk could occur, whether anyone would detect the issue in time, and what action should be taken to reduce the risk.

That sounds formal because sometimes it is. A full FMEA can involve a detailed process map, scoring tables, cross-functional review, action owners, and risk-priority discussions. But the thinking behind it is beautifully practical.

What could go wrong? What would happen if it did? How would we know before the customer had to tell us?

That last question matters. Too often, customers become the organisation’s detection method. They are the ones who notice the missing update, the broken link, the wrong refund, the confusing instruction, the blocked account, the cancelled order, the duplicate request, or the handoff that quietly disappeared into the operational underworld.

By the time the customer tells us, the failure has already escaped the vault.

FMEA moves detection earlier. Better still, it helps the team prevent the risk where possible, so the alarm does not need to ring at all.

The Heist Plan Is the Process

A process is a heist plan with less jazz music and more email. There is an intended outcome. There are steps. There are people involved. There are tools, systems, approvals, timings, dependencies, triggers, checks, and handoffs. Somewhere inside that chain, a promise is being made, even when nobody has written it down clearly. Somewhere at the end of the journey, a customer is hoping the whole thing works without needing to understand the machinery behind it.

In a heist film, nobody assumes the vault opens because the blueprint looks tidy. The team studies the route, the guards, the cameras, the alarms, the locks, the timing, the exit path, and the backup plan. One missed detail can change everything.

In operations, we sometimes behave as if the flowchart is proof that the process works. The flowchart is only the official story.

FMEA interrogates the actual plan. It looks at each step and asks how that step could misfire. The tool does not wait for the defect to become visible. It inspects the conditions that make the defect possible.

That matters even more when the organisation is not only running a process, but introducing a new control. A fraud prevention initiative, for example, is usually built with good intent. The business wants to protect customers, revenue, trust, and the integrity of the platform. Nobody wants fraud strolling through the vault door with a grin and a fake moustache. Controls matter. Detection matters. Risk matters.

The harder truth is that every control also creates an experience for genuine customers.

When the net is cast too wide, the alarm does not only catch bad actors. It catches ordinary people trying to complete ordinary journeys. A legitimate customer suddenly has a payment blocked, an account locked, a refund delayed, an order cancelled, or a request sent into manual review with very little explanation. From the organisation’s side, this may look like protection. From the customer’s side, it can feel like accusation without trial.

That is where FMEA becomes a trust exercise.

The Alarm Can Catch the Wrong Person

Fraud prevention is a powerful example because it exposes one of the most overlooked uses of FMEA: testing whether the solution itself could create new harm.

A fraud initiative may reduce fraudulent transactions. On paper, that looks like success. The dashboard may show fewer losses, fewer risky approvals, and stronger control. The vault looks safer.

Yet another story may be unfolding in the customer journey. Genuine customers may be getting blocked. Orders may be delayed. Accounts may be locked without clear explanation. Frontline teams may be unable to explain the decision because the rule is hidden inside a model, a policy, or a vendor tool. Customers may contact support repeatedly, not because they are difficult, but because nobody can tell them what happened or how to fix it.

This is the point where FMEA earns its keep.

The team should not only ask whether the fraud control will detect suspicious activity. It should also ask how the control could fail legitimate customers. What happens when a genuine customer is incorrectly flagged? How quickly can the issue be reviewed? Does the customer know what to do next? Can the frontline explain the decision without sounding like a locked filing cabinet? Is there an escalation path? Are false positives tracked as seriously as fraud prevented? Does the control create more contacts, complaints, abandonment, or distrust than the business is prepared to acknowledge?

The failure mode is not only “fraud gets through”. Another possible failure mode is “a genuine customer is treated like a bad actor, with no clear route back to trust”.

That is the aha hiding in the vault.

FMEA helps the organisation examine both sides of protection. It asks how the control prevents harm, and how the control itself might create harm if it is too blunt, too opaque, too slow, or too difficult to challenge.

A fraud rule can be technically successful and still damage the customer relationship. The better question is not only whether fraud decreased. The sharper question is whether the organisation reduced fraud while protecting legitimate customers from unnecessary suspicion, delay, and distress.

Sometimes the alarm works. It just rings for the wrong person.

Failure Modes: The Ways the Plan Can Go Sideways

In FMEA language, a failure mode is the way something could fail. This is where the team names the possible breakdowns with enough precision for the process to do something useful about them.

A vague failure mode creates vague action. “Communication failure” sounds serious, but it does not tell anyone where to look. “Customer is not informed when a fraud hold is applied to their order” gives the team a real weak point to inspect. “System issue” is fog in a suit. “Fraud model flags legitimate high-value orders without manual review before cancellation” gives the room something solid to test.

In a fraud prevention rollout, possible failure modes might include a genuine customer being incorrectly flagged as suspicious, a payment being blocked without clear explanation, an account being locked with no visible appeal route, an order being cancelled before manual review, a frontline associate having no authority to escalate, or a model flagging behaviour that is normal for certain customer segments.

Those are not minor wording choices. Specificity changes the quality of the conversation.

When the failure mode is clearly named, the team can examine causes, effects, current controls, detection gaps, and possible actions. When the wording stays vague, the risk floats around the room like a ghost with a lanyard.

This is also why FMEA belongs close to the work. The people who operate the process often know the failure modes before anyone else does. They know which step always needs chasing, which field causes confusion, which queue looks fine until it ages, which rule sounds sensible until a real customer gets trapped inside it, and which workaround has quietly become the operating model.

They know where the camera loop fails, even when the official heist plan still says “disable security system” as if that explains anything.

Effects: When Protection Becomes Friction

A failure mode tells us how something could go wrong. The effect tells us what happens because of it. This is where customer impact becomes visible.

If a genuine customer is incorrectly flagged as suspicious, the effect may be delay, embarrassment, repeat contact, order cancellation, loss of trust, or reputational damage. If an account is locked without clear explanation, the customer may feel accused rather than protected. When the frontline cannot explain the decision, the interaction becomes even more painful because the associate is forced to defend a rule they cannot see. If the appeal path is slow, the customer’s ordinary journey turns into a trial where nobody has explained the charge.

The effect matters because not all risks carry the same weight. Some create mild irritation. Others damage trust. A few create legal, compliance, fairness, accessibility, or reputational concerns. Some affect one customer. Others quietly scale across thousands.

This is where FMEA becomes a way to protect the customer promise. A control may look small inside the business, but feel enormous to the customer caught inside it.

A generic fraud message may look like operational caution. To the customer, it may feel like being treated as dishonest.

A delayed manual review may look like prudent risk management. To the customer, it may feel like abandonment.

A locked account may look like protection. To the person trying to buy groceries, retrieve funds, book travel, or access a service, it may feel like the company has taken control without explanation.

FMEA asks the team to consider the effect from the outside in. The organisation may only have applied a control. The customer may have lost confidence in the relationship.

Severity, Occurrence and Detection: The Three Questions in the Vault Room

Traditional FMEA usually scores three things: severity, occurrence, and detection.

Severity asks how serious the effect would be if the failure happened. Occurrence asks how likely the failure is to happen. Detection asks how likely the organisation is to catch the problem before it reaches the customer or causes harm.

These three questions stop the team from treating every risk as equal.

Some risks are severe but rare. Others are common but less damaging. The most dangerous ones are often severe enough to matter, likely enough to occur, and poorly detected. Those are the risks waiting in the shadows with a clipboard and a suspiciously calm expression.

In heist terms, severity is what happens if the alarm rings. Occurrence is the chance of someone triggering it. Detection is whether the team notices the problem before the whole building locks down and the getaway driver starts reconsidering his career path.

In a fraud prevention rollout, severity asks how serious it would be if a genuine customer were incorrectly treated as suspicious. Occurrence asks how often false positives may happen. Detection asks whether the organisation would know the control is harming legitimate customers before those customers complain, abandon, escalate, or post publicly.

Detection is often where processes reveal their soft underbelly. Many organisations can detect work that happened. A rule triggered. A case closed. A message sent. A transaction declined. A review completed. Those events are easy to count.

The harder question is whether the right thing happened well enough. Was the fraud hold accurate? Was the customer informed clearly? Was there a fast path to review? Could the associate help? Did the customer understand the next step? Were false positives tracked? Did the business detect customer harm before the customer had to become the detective? Because the customer should not be the detection method.

RPN: The Score That Should Start a Conversation, Not End It

Many FMEA formats use a Risk Priority Number, or RPN. Traditionally, this is calculated by multiplying severity, occurrence, and detection scores. A higher number usually indicates a risk that may deserve more attention.

That can be useful. It helps teams compare risks, prioritise action, and decide where to focus first. Still, the score is not the treasure. The conversation is.

When a team spends more time arguing whether a risk is a seven or an eight than discussing how to reduce the actual harm, the tool has wandered into spreadsheet theatre wearing a little bow tie.

RPN should sharpen judgement rather than replace it. A high score may indicate urgent attention, while a lower score may still matter if the issue affects a vulnerable customer, a regulatory obligation, a trust-sensitive moment, or a high-impact customer promise. Numbers can guide prioritisation, but they should not be used to hide from context.

FMEA works best when structured scoring meets practical wisdom. The frontline view matters. The process owner view matters. Customer impact matters. Business risk matters. Detection reality matters.

A number can tell you where to look. It cannot do the looking for you.

Controls: Gadgets, Guardrails and Getaway Routes

Once the team understands the failure modes, effects, severity, likelihood, and detection gaps, the next question becomes practical: what safeguards already exist, and what needs to be strengthened?

Controls are the protections built into the workflow to prevent, detect, or reduce risk.

In a fraud prevention context, a preventive control might be a better rule threshold, clearer eligibility logic, stronger identity verification, or segment testing before rollout. A detection control might be false-positive monitoring, sample audits, ageing alerts for manual reviews, exception reporting, or tracking repeat contacts linked to fraud holds. A mitigation control might be a fast appeal route, a dedicated escalation path, clear customer messaging, frontline guidance, or manual review for high-impact cases before cancellation or account lock.

In heist language, these are the gadgets, guardrails, backup routes, and contingency plans. Nobody serious walks into the vault with confidence alone and a motivational quote tucked into their sock. They bring the tools.

The same seriousness belongs in process work.

If a legitimate customer can be blocked without explanation, create clear messaging that tells them what happened, what can be shared, and what they should do next. If a fraud rule may create false positives, monitor those false positives as a core measure, not as an inconvenient footnote. If the frontline cannot explain a decision, give them enough context to preserve trust without compromising security. If high-value or high-impact cases carry customer harm, route them for review before the harshest action lands.

A control should reduce risk in a way the process can sustain. It should protect the business without turning genuine customers into collateral damage. That balance is where good control design grows up.

The Pre-Mortem Before the Process Goes Boom

FMEA is closely related to the idea of a pre-mortem. Instead of waiting until the failure happens and then asking what went wrong, the team imagines that the process has already failed and works backwards.

The fraud rule went live, and complaints doubled. Genuine customers were blocked. The frontline could not explain why. Manual reviews aged beyond the promised window. The business reduced fraud losses, but repeat contacts increased. Social media began filling in the story the organisation failed to tell. Now ask: how did that happen?

This is where FMEA becomes incredibly useful before launches, policy changes, system updates, automation rollouts, new routing designs, and process redesigns. When the team waits until after go-live to discover the failure modes, the customer becomes the test environment. That is rarely a good strategy, unless the goal is to generate escalations with confetti.

A good pre-mortem creates psychological permission to be sceptical early. People can say, “This is where I think it will break,” while there is still time to adjust the design. That matters because many organisations punish early warnings until those same warnings become expensive enough to require a task force.

FMEA gives the warning a format. It turns nervousness into evidence. It turns “I have a bad feeling about this fraud hold” into “This control has a meaningful false-positive risk, limited detection, and a customer impact we should reduce before rollout.”

That is much harder to dismiss.

Why FMEA Gets Left Until After the Vault Is Already Open

FMEA is often treated as something process experts bring out after the fact, once a defect has already escaped and someone needs a formal analysis. That can still be useful, but it misses the best part of the tool.

The best time to use FMEA is while there is still time to change the plan. So why does it get left so late?

Urgency rewards movement. Teams are under pressure to launch, fix, stabilise, reduce, improve, automate, or simplify. Taking time to imagine failure can feel like slowing the mission down.

Failure language also makes people uncomfortable. Nobody wants to be the person in the room saying the plan could break, especially when the plan has sponsors, timelines, and a slide deck with confident colours.

There is another reason: the people who know the weak points are not always invited early enough. Frontline teams, quality reviewers, escalation owners, analysts, fraud investigators, risk specialists, and SMEs often see the operational cracks clearly. By the time they are consulted, the vault door is already open and everyone is pretending the alarm is part of the soundtrack.

FMEA works best when it is used early and with the right people in the room. Not only the project team. Not only the process expert. Bring the people who live with the consequences of failure. They know where the plan is fragile.

Small FMEA, Big FMEA, Same Discipline

FMEA does not always need to become a large formal ceremony with a solemn spreadsheet and a choir of risk ratings. Like many useful improvement tools, it can scale.

For a major product launch, system migration, new policy, fraud initiative, or high-risk process, a full FMEA may be appropriate. Bring the team together. Map the steps. Identify failure modes. Score severity, occurrence, and detection. Review controls. Assign actions. Track risk reduction.

For a smaller problem, the same thinking can become a quick routine. What could fail here? What would the customer experience if it failed? How likely is that risk? Would we know before the customer does? What control would reduce the risk?

That five-minute version can be enough to prevent the most obvious cracks. A team lead can use it before changing a handoff. An associate can use it when spotting a recurring escalation pattern. A quality analyst can use it when reviewing a defect category. A manager can use it before agreeing to a “quick workaround” with the structural integrity of wet cardboard.

The power of FMEA is not only in the template. It is in the habit of thinking ahead.

AI Can Run the Simulation, But Humans Still Choose the Risk

AI can make FMEA more powerful. It can scan historical defects, customer complaints, escalation logs, fraud review outcomes, process notes, and transcript patterns to suggest likely failure modes. It can cluster recurring issues, compare similar rollouts, flag known risk patterns, and identify where detection appears weak. It can help draft a first-pass FMEA faster than a tired team staring at a blank spreadsheet at 16:47.

It can also help ask better questions. Where have similar controls failed before? Which customer segments generate false positives? Which handoffs create repeat contact? Which defect types carry high customer impact but low detection? Which steps depend on manual judgement? Which controls exist on paper but show no evidence of working?

AI should still not become the risk owner. A model can suggest failure modes, but it cannot decide what level of customer harm is acceptable. It cannot weigh trust, fairness, vulnerability, or brand promise the way humans must. It cannot understand every nuance of compliance and customer dignity unless people bring that judgement into the room.

AI can run the simulation. Humans must still decide which risks deserve action, which controls are ethical, which trade-offs are acceptable, and which failure modes are too important to leave to probability.

The best use of AI in FMEA is to make the team’s thinking better informed, less dependent on memory, and less likely to miss the quiet patterns hiding in the evidence.

The Customer Should Not Be the Detection Method

FMEA is not about expecting everything to go wrong. It is about respecting the fact that some things will. That is not cynicism. That is maturity.

A mature organisation does not design a process, launch a control, or release a fraud rule and hope the weak points behave themselves. It studies the plan. It listens to the people closest to the work. It looks for the failure modes before the customer finds them. It builds controls where the risk is real. It understands that optimism is not a detection strategy.

In the heist film, the team does not wait until they are inside the vault to wonder whether the alarm works. They ask earlier. They test the route. They challenge the plan. They decide what happens if the lift stops, if the code fails, if the guard arrives early, if the exit route closes, or if the alarm catches the wrong person.

Processes deserve the same respect.

If CTQ tells us what must be true for quality to exist, FMEA asks what could stop that truth from happening. It takes the customer requirement and protects it from foreseeable risk. It also asks whether the controls we create to protect the business might accidentally damage the very trust we were trying to defend.

That is why FMEA belongs near the beginning of the work, while there is still time to change the blueprint. The customer can tell us when the process has failed. They should never be our first warning that it could.

This is a personal thought piece, written from my own customer experience and process improvement perspective. It draws on publicly available information and reflects my own views.