Cloudflare Outage 2025: Lessons in Digital Risk and Resilience

This blog was authored by Neil Taylor and Martin Summerhayes, with valuable contributions from the wider Northdoor technology team.

Cloud outage

In November 2025, a major Cloudflare outage disrupted countless services and platforms, many of which were not even direct Cloudflare customers. This event highlighted the invisible dependencies and systemic risks that exist beneath the surface of today’s digital infrastructure. What does this mean for technical leaders, CIOs, and business executives aiming to strengthen digital resilience?

If you’re searching for answers about large-scale outages, hidden dependencies, best practices for limiting downtime, or what technical and organisational changes are crucial after an incident, you’re not alone. Understanding these challenges is vital for improving contingency planning, securing business operations, and communicating risk at the C-suite level.

To help clarify the most pressing issues, Northdoor’s experienced technology team answered the eight most common questions our clients and stakeholders raised during and after the Cloudflare disruption. Their advice spans technical strategies, board-level risk, and cultural factors that determine how quickly organisations recover from a systemic IT failure.

Below, you’ll find clear and actionable answers from our tech team, offering practical guidance whether you’re a CIO, board member, risk lead, or simply want to understand the wider impact of major digital outages.

cloudflare outage

Q&A: Cloud outages, dependencies and digital resilience

Q: The Cloudflare outage disrupted platforms that don’t even rely directly on Cloudflare. What does this reveal about hidden dependencies and systemic risk in our digital infrastructure?

A: This reveals the ‘iceberg’ nature of modern digital supply chains. Many platforms believe they are independent because they don’t have a direct contract with Cloudflare. However, they rely on third-party APIs, payment gateways, authentication services, or monitoring tools that do sit behind Cloudflare. We are seeing systemic risk because the internet has consolidated around a tiny number of massive infrastructure providers. A failure in one isn’t just a service outage; it’s a utility failure for large portions of the digital infrastructure.

Q: What aspects of this outage caught even seasoned IT leaders off guard? Where are CIOs and risk teams still underestimating exposure in their contingency planning?

A: People were caught off guard by the speed and the scope of the cascading failure. Many contingency plans assume a ‘graceful degradation’, i.e. if one service fails, we switch to a backup. But in this case, the failure propagated quickly so that automated failovers didn’t have time to trigger. IT organisations underestimate exposure in their API dependencies and rarely map the hundreds of external API calls their applications make every second. If an API call hangs, it can freeze an entire application logic, even if the application’s own servers are perfectly healthy.

Q: If you were briefing a board or executive team tomorrow morning, what are the top three steps they should take to strengthen their cloud and network resilience and how do you ensure this becomes an enterprise-wide priority, not just an IT concern?

A: To the Board, I would say this approach is Revenue Assurance, not IT resilience:

Map Your Critical Path: Don’t show a network diagram. Show a ‘User Journey Map’ that highlights every single third-party dependency required for a customer to pay or access our services. If Cloudflare, AWS, or Azure fails, does the customer journey stop or is there a workaround?

Diversify: We need to stop treating Content Delivery Networks (CDNs), like Cloudflare, as a commodity we buy from one vendor. We need a ‘multi-CDN’ strategy or a ‘fail-open’ architecture where, if the security layer fails, traffic bypasses it to a fallback origin rather than just stopping.

Test via a ‘Big Red Button’: We need to run a ‘Game Day’ simulation where we pretend our primary provider is down. Do we actually have a playbook to reroute DNS? Have we ever tested it?

Q: What’s the most critical technical takeaway from the traffic spike and cascading failure that triggered today’s Cloudflare disruption?

A: The takeaway is that automation is a double-edged sword. The outage was likely triggered or exacerbated by an automated configuration push or something similar rule that backfired. Global changes should never be pushed globally instantly. They must be rolled out to 1% of traffic, validated, and only then expanded. Automation without safety brakes is just a faster way to crash.

Q: Which architectural or configuration strategies are most effective in limiting the blast radius when a global CDN or edge network fails?

A: The most effective strategy is ‘Active-Active’ Multi-Cloud Architecture. Have two active environments served by two different CDNs. Use DNS load balancing to direct traffic. If one CDN fails, you update the DNS to route 100% of traffic to the survivor. It’s expensive and complex, but it’s the only way to limit the blast radius of a global infrastructure failure truly. As a business you have to balance the cost and risk profile i.e. revenue lost per hour.

Q: From a cybersecurity and operational resilience perspective, what blind spots did this outage expose—and how should teams address them before the next disruption?

A: The blind spot is the ‘Security vs. Availability’ trade-off. Many organisations configure their WAFs (Web Application Firewalls) to ‘fail closed’—meaning if the security check fails, nobody gets in. This outage showed that for some businesses, ‘fail open’ (allowing traffic in without the WAF inspection during a crisis) might be the lesser of two evils compared to total downtime. Organisations need to re-evaluate their risk appetite: in a catastrophic outage, is it better to be potentially vulnerable but online, or secure but offline? That is a difficult question to answer.

Q: What should post-outage conversations at the C-suite and board level focus on and how do you translate technical failures into strategic risk language that drives action?

A: It must move from ‘Why did the tech fail?’ to ‘Why was our business continuity plan ineffective?’ The focus should be on Tolerance Levels. We need to ask: ‘What is our maximum tolerable downtime?’ If the answer is ‘zero,’ then the investment in resilience must match that need. Just as a CFO wouldn’t invest 100% of the company’s cash in one bank (bank guarantees, etc), a CIO shouldn’t invest 100% of the company’s digital presence in one infrastructure provider. It’s about diversification of risk.

Q: Beyond technology, what cultural or organisational traits most influence a company’s ability to recover quickly from third-party infrastructure failures like today’s?

A: In a crisis, you need people who are not afraid to make a call. If a team is terrified of being blamed for making the ‘wrong’ decision during an outage, they will freeze and wait for approval chains, costing valuable minutes or even hours. Companies that recovered fastest today were those with balanced-risk decision-making and practiced incident response. They had staff who knew they had the authority to hit the ’emergency bypass’ button without needing to wake up the CIO or CEO for permission.

Final thoughts

Outages like those experienced by Cloudflare, AWS other major infrastructure providers expose critical vulnerabilities in today’s digital ecosystem. Being prepared isn’t just about technology, it’s about organisational culture, strategic planning, and clear communication at every level.

By understanding these risks and the practical steps your organisation can take, you’ll be better equipped to manage future disruptions swiftly and confidently.

If you found these insights helpful, feel free to share this post with your network or reach out to the Northdoor team for tailored advice on strengthening your business’s digital resilience.

Related media coverage

Please see the following related press coverage featuring Northdoor’s contributions to the topic.

20th November, Financial Planning, Big tech outages don’t have to shut down business for prepared advisors
22nd October, TechTarget, AWS cloud outage reveals vendor concentration risk

Cloudflare Outage 2025, Infrastructure Failure Impact, It Contingency Planning, Major Cloud Outage Impact, Risk Management Cloud Services

Martin Summerhayes All Author's Posts

Latest Blog Articles

Artificial Intelligence

Five IT trends that will define the UK SME market in 2026

Five UK SME IT trends for 2026, covering AI readiness, cyber security, cloud sovereignty, ESG reporting and DORA-driven compliance.

Insurance

2026 Insurance Technology Predictions

Explore 2026 insurance industry predictions, from Blueprint Two delays to AI adoption, sanctions compliance, and cybersecurity. Learn how insurers can prepare for change

IT Security Predictions for 2026

Discover the top IT security predictions for 2026 from AI-first cyber defence and automated Zero Trust ecosystems to supply chain resilience and quantum-resistant encryption. Learn how regulations…[ ]

See all blog articles

Our Awards & Accreditations

See all Awards

How the Cloudflare outage exposed hidden risks

Expert answers to common questions

Are you ready to get in touch?

Q&A: Cloud outages, dependencies and digital resilience

Final thoughts

Related media coverage

Latest Blog Articles

Five IT trends that will define the UK SME market in 2026

2026 Insurance Technology Predictions

IT Security Predictions for 2026

Our Awards & Accreditations

Q&A: Cloud outages, dependencies and digital resilience

Final thoughts

Related media coverage

Latest Blog Articles

Five IT trends that will define the UK SME market in 2026

2026 Insurance Technology Predictions

IT Security Predictions for 2026

Our Awards & Accreditations

Subscribe to our newsletter