Recent outages at two of the world’s largest cloud providers, Amazon Web Services (AWS) and Cloudflare, have highlighted just how much businesses rely on the cloud, and how quickly things can go wrong. On 20 October 2025, AWS experienced a major disruption that left thousands of companies offline, including Signal, Snapchat, Roblox, Duolingo, and smart home devices like Ring doorbells and Eight Sleep beds.
Less than a month later, Cloudflare faced a global outage caused by a small system change that triggered widespread crashes. Popular services such as X, ChatGPT, Canva, LinkedIn, Zoom, YouTube, and Google were affected, showing how far-reaching even a single failure can be.
In this article, we’ll explore what caused these outages, why they matter, and what businesses can do to reduce the risk of future cloud disruptions.

The AWS outage was triggered by a fault in the system that manages its databases. Specifically, an error in the automated process that updates and monitors database connections prevented critical records from being accessed. This caused dependent systems to fail, and even after engineers fixed the initial problem, the ripple effects across connected services prolonged the disruption.
The Cloudflare outage began with a routine system change that unintentionally created an oversized configuration file. This file, which helps manage bot traffic and security features, exceeded the system’s limits, causing servers to crash and restart repeatedly. The team had to stop the faulty file from being deployed and replace it with a correct version to restore normal operations.
AWS and Cloudflare form part of the backbone of the internet. When their systems fail, websites and online services can effectively go offline, impacting millions of people and businesses. By some estimates, these providers support services used by a significant portion of the world’s websites, meaning even a small disruption can ripple across the digital economy.
Both incidents show how a seemingly minor fault in a single part of a cloud provider’s system can cascade through complex networks, disrupting multiple services at once. They highlight the need for businesses to understand the risks of relying on cloud infrastructure and to plan for potential outages.
Critical services should be spread across multiple regions or providers whenever possible. This could mean running systems in different data centres or using more than one cloud provider for key functions like DNS, connectivity, or security. Separating control, data, and management processes also helps limit the “blast radius” if something goes wrong, ensuring a problem in one area doesn’t take everything down.
Detecting and responding to failures quickly is crucial. Both AWS and Cloudflare showed how errors can appear confusing and contradictory during an outage. AI-driven monitoring can help by filtering out noise, highlighting the real source of problems, and triggering automated responses, such as rolling back a faulty configuration or rerouting traffic. This approach ensures issues are addressed faster and more reliably than relying on alerts alone.
Preparation is key. Running realistic drills and failover tests allows teams to practise responding to outages before a real incident occurs. This should include scenarios such as system failures, human error, and cyberattacks. Regular, well-documented exercises help uncover gaps, build confidence, and ensure backup systems actually work when they’re needed.
One of the most common issues we see is businesses assuming their disaster recovery plans will work without ever properly stress-testing them. Having these plans reviewed and challenged by an experienced third party can reveal risks and weaknesses that are easy to overlook internally.
At MCD Systems, we work closely with businesses to review existing systems, test recovery processes, and identify weak points before they turn into real problems. By pressure-testing plans and validating assumptions, organisations can be far more confident that their systems will hold up when it matters most.
No system can guarantee zero outages. The goal is to make failures local rather than global, minimising disruption when something goes wrong. By designing systems that are flexible, decoupled, and multi-region, and by automating checks and responses, businesses can continue operating even when a major cloud provider experiences problems.
The AWS and Cloudflare outages are a clear reminder that no cloud provider is immune to failure. Even small issues can ripple across the digital ecosystem, affecting businesses and users on a global scale. For organisations that rely on cloud services, the focus shouldn’t be on trying to avoid failure entirely, that’s unrealistic, but on building systems that are resilient and prepared.
At MCD Systems, we partner with forward-thinking companies to design and build reliable, high-quality software that supports growth and reduces risk. Whether that’s reviewing cloud architecture, strengthening disaster recovery plans, or helping teams prepare for the unexpected, our focus is on creating technology that works for your business, not against it.
By spreading critical services across providers and regions, implementing intelligent monitoring, regularly testing recovery plans, and working with trusted partners, businesses can significantly reduce the impact of future outages. In an increasingly connected world, preparation and resilience are the best safeguards against the next cloud disruption.