BryceYang
BryceYang
HomeArticlesProjectsAbout
BryceYang
  • Home·
  • Articles·
  • Projects·
  • About
  • ·

Article:

Failure Is Inevitable, Outages Do Not Have to Be

March 8, 2026

A look at fault domains, failure domains, and how thoughtful network design limits the blast radius of failures.

Fault DomainsFailure DomainsNetwork Design

If you spend enough time around network engineers, you'll hear the terms fault domain and failure domain used interchangeably. The confusion is understandable — they're closely related. But they describe two different things.

A fault domain is where a problem starts. A failure domain is how much of the network gets dragged down with it.

The way I keep them straight is pretty simple: when there's an outage, it may not always be my fault — but it's still my failure.

Faults are unavoidable. Hardware wears out. Someone unplugs the wrong cable. Configurations drift. Software does something nobody expected once it hits production. That's just what operating real networks looks like. Good design doesn't try to eliminate every possible fault. It focuses on controlling how far the damage spreads.

Murphy's Law is worth keeping in mind here. If something can go wrong, eventually it will. Network architecture is less about preventing every issue and more about making sure that when one shows up, it doesn't take more of the environment down than it has to.

What Is a Fault Domain?

A fault domain is any component or condition that can introduce a failure into the network. Sometimes that's physical — a router dies, a switch locks up, a WAN circuit drops. Sometimes it's logical — a routing process crashes, a config change introduces an issue, or a software process starts behaving in ways nobody planned for.

Every device, link, and protocol carries some level of risk. A single switch might represent one fault domain, but that switch also lives in a rack, in a building, inside a larger campus architecture. These layers stack on top of each other.

That's why individual fault domains often sit inside larger failure domains. The fault domain is where it starts. The failure domain is how far it travels.

What Is a Failure Domain?

A failure domain is the scope of impact when something goes wrong.

If a WAN circuit fails and an entire branch loses connectivity, the circuit was the fault domain. The branch office became the failure domain. How large that failure domain gets depends almost entirely on the design surrounding it.

In some environments, a failure touches a single device or subnet. In others, the same type of fault takes down a full floor, building, or site. Most of the time, the difference isn't the fault itself. It's the architecture around it.

Why This Actually Matters

Resilience conversations usually start with redundancy. Redundant links, redundant devices, redundant paths. Those things matter. But redundancy alone doesn't guarantee stability.

What actually matters is containment.

If systems share the same failure domains, redundancy can still fail in ways that catch you off guard. A single configuration issue or protocol problem can cascade through interconnected systems even when multiple physical paths exist. Good architecture assumes faults will happen and is designed to limit how far they travel.

Layer 1 — Physical

Physical faults are the easiest to understand. Cables fail, power supplies die, interfaces stop responding.

Whether those events stay small depends on the design around them. In a poorly designed environment, a single cable failure can disconnect an entire rack or floor. In a well-designed one, the same failure might affect nothing more than a single device. That difference usually comes from straightforward choices — redundant uplinks, diverse cable paths, redundant power.

Even basic operational practices help. Proper cable management and accurate labeling make it a lot less likely that a routine change accidentally pulls the wrong system offline.

Layer 2 — Data Link

Layer 2 can quietly create much larger failure domains than most engineers expect.

Because Ethernet relies on broadcast and flooding behavior, faults here can spread further than they look like they should. The broadcast domain itself is a failure domain. Every device that can receive a Layer 2 broadcast is sitting inside it. Under normal conditions, that's fine. Under a switching loop, it becomes a problem fast — broadcast frames multiply, bandwidth disappears, and switches start getting overwhelmed, often before anyone realizes what's happening.

That's why Layer 2 boundaries matter. VLAN segmentation, proper spanning tree configuration, and clean Layer 2 design all keep these domains manageable. A lot of modern architectures take it further by pushing Layer 3 boundaries closer to the edge, which significantly reduces the blast radius of Layer 2 problems.

A useful gut-check at this layer: if a broadcast storm or loop happened here right now, how much of the network would feel it?

Layer 3 — Network

At Layer 3, faults tend to show up as routing problems rather than straight connectivity failures. A bad route advertisement. A flapping routing process. A failed gateway. A static route pointing somewhere it shouldn't.

One advantage here is that routing boundaries naturally contain these problems. A routing issue might disrupt a specific subnet or break traffic along one path, but it usually doesn't take down the whole environment the way a Layer 2 loop can.

Dynamic routing protocols help a lot. OSPF or BGP can react when links or devices fail and find alternate paths automatically. ECMP distributes traffic across multiple equal-cost paths. First-hop redundancy protocols like VRRP or HSRP make sure hosts still have a working gateway if a router goes offline.

Most Layer 3 design decisions come back to path diversity. The goal isn't to protect a single critical route. It's to make sure losing any one path doesn't leave large portions of the environment stranded.

Layer 8 — Humans

Layer 8 isn't part of the OSI model. But it's present in every network.

Most engineers have unplugged the wrong cable or configured the wrong interface at some point. I know I have. In complex environments, these mistakes are going to happen. What matters is limiting how far they reach.

Testing before deployment is one of the most effective ways to do that. I've built scenarios in lab environments that looked completely solid on paper, only to find unexpected behavior from hardware or protocol interactions that I never would have caught otherwise. Finding those things in a lab is a lot better than finding them during a production outage.

Operational discipline helps too. Maintenance windows, staged rollouts, and configuration validation exist for a reason. They keep Layer 8 faults small and recoverable.

The Bottom Line

Hardware fails. Cables get unplugged. Engineers make mistakes.

The difference between resilient networks and fragile ones isn't whether faults happen. It's whether the architecture was built to contain them.

Good design assumes something will eventually go wrong. When it does, the impact should be limited and recoverable.

Failure is inevitable. Widespread outages don't have to be.

Next read: Network Design Starts with Business Intent