Cloudflare Completes 'Code Orange: Fail Small' – Strengthening Network Resiliency for All Customers

Introduction: A Major Engineering Milestone

Over the past two and a half quarters, Cloudflare has undertaken an intensive engineering initiative internally known as Code Orange: Fail Small. This project was designed to make Cloudflare's infrastructure more resilient, secure, and reliable for every customer. Earlier this month, the team successfully completed the core work, which would have prevented the global outages experienced on November 18 and December 5, 2025. While continuous improvement in resiliency remains a top priority, this milestone marks a significant leap forward in how Cloudflare manages failures.

Cloudflare Completes 'Code Orange: Fail Small' – Strengthening Network Resiliency for All Customers — Source: blog.cloudflare.com

The effort focused on several critical areas: safer configuration changes, reducing the impact of failures, revising break glass procedures and incident management, preventing drift and regressions over time, and strengthening customer communication during outages. Here is an in-depth look at what was shipped and what it means for you.

Safer Configuration Changes with Snapstone

One of the most impactful changes is the way Cloudflare handles configuration deployments. Previously, internal configuration changes could reach the network instantly, potentially affecting traffic before any issues were detected. Now, in most cases, changes are rolled out progressively with real-time health monitoring. This allows observability tools to catch problems and revert changes before they impact customers.

To achieve this, the engineering team identified high-risk configuration pipelines and built new tools to manage changes better. The cornerstone of this effort is a new internal component called Snapstone. Snapstone bundles configuration changes into packages and enables gradual release with health mediation principles. Before Snapstone, applying a health-mediated deployment methodology to configuration was possible but required significant per-team effort and was inconsistently applied across the network. Snapstone closes this gap by providing a unified way to bring progressive rollout, real-time health monitoring, and automated rollback to configuration deployments by default.

What makes Snapstone particularly powerful is its flexibility. Rather than being a fix for specific past failures, it allows teams to dynamically define any unit of configuration that needs health mediation—whether that's a data file (like the one that caused the November 18 outage) or a control flag in the global configuration system (like the one involved in the December 5 outage). Teams create these configuration units and apply health mediation consistently across the network.

Reducing the Impact of Failures

Another key area of focus was reducing the blast radius when failures occur. The project introduced mechanisms to contain problems so they affect only a small subset of traffic or functionality, rather than the entire network. This includes improving isolation between services and implementing circuit breakers that automatically stop propagation of faulty configurations. By designing for graceful degradation, Cloudflare ensures that even when something goes wrong, the impact on customers is minimized.

Revised Break Glass and Incident Management

The incidents also highlighted the need for better break glass procedures—emergency access processes used during outages. The team revised these procedures to make them more controlled and auditable. Additionally, incident management was overhauled to improve response times and coordination across teams. Post-incident reviews now feed directly into the engineering backlog to ensure lessons learned are applied quickly.

Preventing Drift and Regressions

A critical part of long-term resiliency is preventing the system from gradually deviating from its intended state—a phenomenon known as drift. To address this, Cloudflare introduced automated checks and regular regression testing. Configuration baselines are now monitored continuously, and any unauthorized or unexpected changes are flagged and reverted. This ensures that the improvements made today will persist over time.

Enhanced Customer Communication During Outages

Transparency is vital during incidents. Cloudflare has strengthened how it communicates with customers when an outage occurs. This includes faster notifications on the status page, more detailed incident reports, and clearer explanation of root causes and mitigation steps. The goal is to keep customers informed every step of the way, reducing uncertainty during stressful events.

What This Means for Cloudflare Customers

For customers, the completion of Code Orange: Fail Small translates to a more resilient and reliable network. Configuration changes are now deployed safely, failures are contained, and the system is better equipped to prevent and recover from incidents. While no network can ever be completely immune to outages, this project significantly reduces the likelihood and severity of future disruptions. Cloudflare remains committed to continuous improvement, and the tools and processes developed during this initiative will continue to evolve.

In summary, Code Orange: Fail Small has delivered a stronger foundation for Cloudflare's global network. The combination of Snapstone, health-mediated deployments, improved incident management, and better communication means customers can trust that their traffic is handled by a more robust infrastructure.