CharlieVX - Article

Cloudflare Global Outage

You’d think giants like AWS, Azure and Cloudflare would avoid massive disruptions but as today proved, even the most resilient platforms can fail dramatically and when they do its in a spectacular fashion.

Solution Image

On the 18th of November 2025, Cloudflare experienced one of the largest outages in its history.

A global event causing HTTP 5xx errors, DNS (Domain Name Server) failures and CDN (Content Delivery Network) disruption for hundreds of thousands of sites. API endpoints, WAF layers, Zero Trust services and general internet accessibility were all affected.

The issue began at 11:20 UTC due to a bug in a Bot Management configuration. In simple terms: a configuration broke, though that’s still understating it. So let’s break down and postmortem what happened and why CharlieVX didn’t stay down for long.

Why These Outages Matter

Whether you’re into tech or not, it’s important to understand why outages like this hit so hard.

Cloudflare acts as foundational infrastructure for a huge portion of the modern internet and so when they go down, we all feel it whether we want too or not. After all they're responsible for handling so much:

  • ~DNS failures affect everything. If DNS breaks, devices can’t talk - your phone, computer, etc can’t reach Discord, YouTube, Snapchat, anything.

  • ~CDN downtime slows or completely blocks static content like images and videos.

  • ~Security layers (WAF, DDoS) become unreliable, increasing risk during an already vulnerable period.

  • ~Indirect impact occurs even for companies not using Cloudflare directly due to how interconnected services are today.

When a provider responsible for DNS, CDN and security goes down, the internet itself feels it - from banking platforms to basic mobile connectivity. This comes just a few short-weeks after the massive AWS and Azure outages too. Its almost as if being centralized and dependent on corporations isn't wise and decentralization is best practice where you can.

Timeline (based on Cloudflare + independent reporting)

  • ~11:20 UTC - Cloudflare’s telemetry showed internal failures to deliver core network traffic meaning customers and end users began seeing error pages and timeouts.

  • ~11:30 to 12:00 UTC - External observability platforms (notably 'ThousandEyes') began reporting timeouts and HTTP 5xx error spikes to Cloudflare-fronted services. Showing the media noticed pretty quickly!

  • ~12:00 to 14:00 UTC - Major sites using Cloudflare experienced intermittent or total outages (reports surfaced for many well-known services).
    Press outlets reported peak impact and ongoing remediation not that anyone could view their articles during the outage anyway.

  • ~Afternoon (same day) - Cloudflare deployed and validated a fix. The company published updates indicating the incident was resolved and that monitoring would continue. This however doesn't mean its all perfect again...

Times are based on Cloudflare’s UTC logs and rounded media reports.

Cloudflare’s Mitigation Steps

Cloudflare followed a standard incident response process:

  • ~Detection and triage - Cloudflare routed the issue into their incident response channel.

  • ~Containment - engineers isolated the failing generation logic (configuration) and prevented further propagation of the malformed configuration (stopped it spreading).

  • ~Rollback / fix - Cloudflare rolled back and corrected the problematic generation step and began phased reintroduction of services while monitoring for error regressions before making a public apology.

  • ~Verification and monitoring - after the fix, Cloudflare monitored for residual errors and communicated status to customers via their status page and blog.

Cloudflare made it very clear that they found no evidence of malicious activity in this incident, framing it as a latent bug in the configuration.
It is unclear at this time whether that bug was the cause of some intern vibe coding (
AI coding
) their way through day one or if it was a larger internal management failure. We may never know...

Why Centralised Infrastructure Outages Are Dangerous

This outage highlights a known problem: centralisation amplifies failure. A single failure can ripple across the internet. We all know this from the recent AWS and Azure outages.

  • ~DNS/CDN providers are choke points. Many modern sites rely on a small number of providers for DNS, TLS, CDN (for the daily-folk these terms don't matter much just know they rely on it for everything from making sure the site you want loads to the images or content within it) and protection from hackers and trolls. When those providers fail, entire ecosystems stall and sometimes, shutdown perminantly.

  • ~Secondary effects: embedded third-party scripts, OAuth flows, and analytics endpoints can turn local "infra-failures" into end-user outages. CharlieVX for example uses Discord as a point-of-access this is part of an OAuth flow (where you leave CVX go to Discord, and come back), these points fail as a result even though it "wasn't our fault" these secondary effects can be serious. Stopping end-users from logging into vital services, in some cases electric companies (which I experienced).

  • ~Operational blindness: if your monitoring depends on the same provider, such as individuals using Amazons AWS for a status page during the AWS outage - you may lose visibility exactly when you need it most. Observability heterogeneity is vital. This is a matter of fact. Decentralized monitoring using platforms like UpTime Kuma or Wazuh is paramount for proper monitor-ability but many 'self-hosters' or small businesses, even individuals, rely on these massive name-brands instead to either host the service, or do it all for them (Managed Service Provider solutions).

Being decentralized, or self hosted, even on a small Raspberry-pi and one-or-two repurposed old PCs as servers can make the difference in outages like this as it did for many on Reddit who mock the outage in part, but I get self-hosting takes a tech-savvy hand.

In the cases where you need or want a web-service, such as a website, but arent tech-savvy enough to do it all yourself. Turn to a friend or family member who is, and can host these services for you. Particularly if their self-employed and insured. Otherwise you'll find decentralized smaller providers all over the web, an alternative to Cloudflare for CDN is Fastly, for example. Instead of AWS visit Hetzner.

Why Wasn’t CharlieVX Affected for Long?

CharlieVX uses Cloudflare - but recovered quickly due to deliberate resilience decisions:

  • ~Multi-provider DNS/CDN strategy - a fully pre-configured fallback that can be switched to within minutes.
  • ~Short DNS TTLs - speeds up propagation during failover (changing CDN).
  • ~Static asset fallback - mirrored and cached assets keep the essential UI and interactions working even during degraded conditions.
  • ~Feature gating & graceful degradation - non-critical features disable cleanly under stress.
  • ~Independent observability - monitoring remains available even if Cloudflare is down.
  • ~Runbooks & automation - scripted recovery steps reduce downtime from hours to minutes.

These are practical approaches for anyone from self-hosters to larger teams.

Cloudflare’s Official Apology

Cloudflare publicly explained the incident and how they plan to prevent a repeat: Cloudflare: 18 November 2025 Outage.

Frequently Asked Questions

Q - Should I stop using Cloudflare?
No. No provider is perfect. The real solution is designing for provider failure, not abandoning them altogether.

Q - How fast is DNS failover?
With low TTLs and proper automation, failover can occur within minutes - though some resolvers may hold on to cached records briefly.

Q - Was this an attack?
Cloudflare says no. Independent telemetry supports the explanation of a configuration bug rather than malicious action.

Thank You ❤

If you ever need help, join the Discord or contact me directly.
Email: charlesbaron@tutanota.com