Thursday, October 23, 2025

The real lesson from the AWS outage: Resilience can’t be outsourced

If you didn’t experience the interruption firsthand, you’ve heard by now: AWS’s busiest region in the United States went dark and took down wide swaths of the web. This outage serves as a stark reminder: Even the largest cloud provider (and the backbone of much of the digital economy) can fail.

Core systems including load balancers, DNS, and DynamoDB stopped functioning, triggering a rapid chain reaction. Apps froze, transactions stalled, and operations ground to a halt for millions of users. Amazon’s own ecommerce site, Alexa, and Ring doorbell devices were affected, along with major consumer and business platforms such as WhatsApp, ChatGPT, Venmo, Robinhood, and Snapchat. Even some government and gaming services experienced downtime.

AWS engineers acted fast and shared updates throughout the incident. Still, the consequences were widespread. When one company runs such a large part of the world’s infrastructure, a single malfunction can ripple everywhere. The details of this specific occurrence may fade, but events like this will happen again.

The deeper issue is over-reliance on a single provider

Over the past decade, many organizations have consolidated nearly everything on a single public cloud to simplify operations. Service‑level guarantees may look solid, but they don’t eliminate the concentrated risk that comes with dependence. 

When all workloads sit in one place, any disruption spreads quickly. A resilient strategy requires more control and more diversity in how infrastructure is designed. Companies that had distributed systems stayed online. They had tested recovery paths and kept essential data under their own control.

This approach is sometimes described as “cloud‑smart.” It keeps the benefits of cloud computing while preserving autonomy through hybrid, private, and edge infrastructure.

The following principles offer a practical guide to building that kind of resilience.

Five practical principles for cloud‑smart resilience

  1. Take responsibility for uptime.

Even when cloud providers offer built-in resilience features, your availability still depends on their infrastructure. With a hybrid or on-prem architecture, you’re not waiting on someone else’s status page to refresh. You control your own availability and can design redundancy that protects your business, not just your provider’s SLA.

What to do: Identify the applications that must remain online under all circumstances. Create redundant paths for them, ideally spanning multiple regions or environments outside your main provider. Review your continuity plan to ensure every external dependency is covered.

  1. Keep essential data within reach.

Large datasets are slow to move, especially during a crisis. Recovery becomes easier when critical information exists in more than one location. Local or regional copies shorten the path to restoration.

What to do: Classify your data by importance and acceptable downtime. Keep the most critical and frequently accessed datasets replicated in multiple secure locations. Test recovery regularly to verify access speed, integrity, and immutability.

  1. Maintain direct control over security and compliance. 

Cloud vendors secure their infrastructure, but the accountability for data protection stays with you. Regulatory requirements, encryption policies, and governance still need to be enforced internally.

What to do: Manage encryption keys internally. Choose storage locations that align with legal and corporate policies. Audit all cloud and on-prem configurations regularly to ensure nothing drifts out of compliance.

  1. Prepare for unexpected costs during recovery.

When outages occur, recovery often triggers extra expenses. Data transfers, temporary capacity, and performance degradation can all raise costs at difficult moments.

What to do: Model the financial impact of a large-scale disruption, including data retrieval, egress, and temporary-capacity fees. Factor those variables into your resilience budget and keep secondary systems ready so recovery doesn’t depend solely on on-demand cloud resources.

  1. Combine agility with control

Cloud architectures deliver flexibility, while on‑premises systems provide consistency and oversight. The strongest organizations apply both principles, bringing cloud-like agility to wherever their data lives. Workloads run where they perform best and can shift when conditions change.

What to do: Identify which workloads need elasticity and which require consistency or governance. Place them accordingly, and design your environment so data and applications can move between clouds, private infrastructures, or sites with minimal reconfiguration.

Cloud computing remains a powerful model for growth and innovation. The goal is to build enough independence to keep operating when part of the system fails. Resilient design provides that independence.

When infrastructure has multiple paths, interruptions become manageable instead of crippling. True resilience comes from preparation and deliberate architecture, not from provider guarantees.

Resilience is built, not borrowed

The recent AWS outage revealed how fragile the internet can be when too much depends on one region or one platform. A single technical fault affected companies worldwide, showing how concentrated modern infrastructure has become.

Scality has spent more than fifteen years helping organizations avoid that kind of exposure. Our data storage software supports hybrid, distributed, and S3‑compatible environments that keep data accessible even when external services fail.

As it always does, AWS will return to full strength. But for many businesses affected, the damage is already done — and the lesson is clear.

Lasting resilience comes from freedom of choice. It means having the option to use cloud resources when they add value. To keep essential workloads close. To make movement between environments simple and reliable.

AWS remains an excellent platform, but the most resilient organizations don’t rely on it alone.