What can we learn from the AWS outage?

This week we’re looking at the Amazon Web Services outage and what can be learned from it.

What was the outage?

Earlier this week, Amazon Web Services (AWS) experienced one of its most disruptive outages in recent memory.

Spanning over 15 hours, the incident affected more than 100 AWS services and sent shockwaves across industries worldwide. Downdetector logged 6.5 million disruption reports across more than 1,000 services.

From services like Snapchat and Roblox all the way to commerce giants, HMRC and airlines such as Delta, the outage exposed a critical vulnerability in modern IT infrastructure: over-reliance on a single cloud provider.

Why did the outage happen?

The root cause of the outage was a DNS resolution failure in AWS’s US-EAST-1 region—its oldest and most heavily used data centre hub.

DNS (Domain Name System) is essentially the internet’s address book, translating human-readable domain names into IP addresses. When this system fails, services can’t locate each other, leading to widespread breakdowns.

This single point of failure cascaded across AWS’s global infrastructure, impacting services such as virtual servers, storage, serverless computing, and more. Major platforms went offline as a result. Even airline check-in systems and banking apps were rendered unusable, leaving millions of users stranded and frustrated.

Some businesses not directly hosted on AWS experienced downtime due to dependencies they weren’t aware of.

Why it matters to all of us

For some businesses, AWS is the invisible backbone of their digital operations. Whether directly hosting their applications or relying on third-party services built on AWS, the outage revealed just how deeply embedded cloud infrastructure has become—and how fragile it can be.

This incident wasn’t just a technical hiccup. It was a stark reminder of the concentration risk that comes with putting all your IT eggs in one cloud basket. When a single provider experiences downtime, the ripple effects can be catastrophic.

What can we learn from the outage?

Here are some lessons for businesses:

1 Understand Your Dependencies – many companies affected by the outage weren’t even aware they relied on AWS. They used SaaS platforms, APIs, or integrations that were built on AWS infrastructure. This lack of visibility into the tech stack can be dangerous. Businesses must audit their digital ecosystem to understand where cloud dependencies lie.

2. Diversify Your Infrastructure – the outage underscores the importance of multi-cloud and hybrid cloud strategies. By distributing workloads across multiple providers, businesses can reduce the risk of total service disruption. Hybrid models that combine cloud and on-premise solutions offer even greater resilience. We recently sent out an email questioning whether storing data in the cloud is this the best option, you can read it here >

3. Invest in Failover Systems – systems should be designed with failover capabilities that automatically switch to backup services or regions when primary ones fail. This requires thoughtful architecture and investment but pays dividends during crises.

4. Communicate Transparently with Customers – during outages, customer trust is on the line. Businesses must have a crisis communication plan in place to inform users promptly, explain the situation, and provide updates. Transparency builds credibility, even in the face of technical failures.

Moving Forward

The AWS outage is not an isolated event. Similar incidents have occurred with other cloud providers, and as digital infrastructure grows more complex, the risk of systemic failure increases.

Cloud computing offers immense benefits—scalability, flexibility, and cost-efficiency—but it must be approached with a balanced strategy. Resilience, redundancy, and risk awareness should be at the heart of every digital transformation plan.

Let’s talk about how we can strengthen your IT strategy and protect your operations from future disruptions.

Posted in News