24 Hours Seeking the Answer: When the Cloud Pillars Trembled and AWS Froze the Internet
- 22/10/2025
1. What Happened at the US-EAST-1 Hotspot?
In the early hours of October 20, 2025 (U.S. time) — around 11 AM to 4 PM in Vietnam — major Internet services across the globe suddenly went offline due to a severe outage originating from AWS’s US-EAST-1 (Northern Virginia) region.
This region is the backbone of AWS’s global infrastructure. Even a minor glitch here can cascade into massive disruption. Core AWS services such as EC2 (virtual servers) and DynamoDB (NoSQL database) became unresponsive, triggering a domino effect across thousands of applications worldwide.
As a result, millions of users lost access to major platforms including Snapchat, Fortnite, Duolingo, Canva, Wordle, Slack, Monday.com, Zoom, and even financial and public services such as Lloyds, Barclays, Bank of Scotland, HMRC, and Vodafone.
According to AWS’s service status page, the root cause was a DNS-related issue affecting DynamoDB, a core database service underpinning numerous AWS applications.
DNS (Domain Name System) converts domain names into IP addresses, enabling browsers and applications to locate and connect to the right servers. When this translation layer fails, applications can’t reach critical services like DynamoDB — leading to connection errors, data retrieval failures, and ultimately, widespread service disruption.
2. Technical Dissection: Tracing the Root Cause
Below is a summarized cause-and-effect analysis based on observable technical signals and historical incidents.
(Note: AWS has not yet released a full post-incident report; the following remains informed speculation.)
2.1. Network Control Plane or DNS Failure (Route 53)
The most likely culprit: DNS resolution failure preventing services from locating DynamoDB endpoints.
In a microservices architecture, a single fault in Route 53 or the network control plane can quickly propagate to other layers such as Load Balancers (ELB) and API Gateways, resulting in high 5xx error rates, login failures, and blocked resource initialization.
→ Estimated likelihood: 70–80%
2.2. Faulty Automated Configuration Deployment
AWS operates its infrastructure using “Infrastructure as Code” (IaC) — meaning updates to network, security, and orchestration are automated through pipelines.
A misconfiguration (for example, routing rules in VPC, IAM policies, or network parameters) could cause desynchronization between the control and data planes, leading to system-wide disruptions.
→ Estimated likelihood: 15–20%
2.3. Data Replication Inconsistency
While not the primary cause, temporary inconsistencies between database replicas could have worsened the issue. Unstable endpoints often trigger timeouts, authentication failures, and session losses, making applications appear frozen.
→ Estimated likelihood: 5–10%
2.4. Indirect Impact from Unusual Traffic or Third-Party Providers
Though AWS and security agencies have dismissed large-scale cyberattack speculation, an unexpected traffic surge (from CDNs, APIs, or third-party DNS providers) may have amplified the outage by overloading existing system vulnerabilities.
→ Estimated likelihood: <5%
3. Damage Beyond Numbers
As the cloud backbone for over 90% of Fortune 100 companies, AWS’s outage impacts far beyond temporary user frustration. It exposes how deeply the global digital economy depends on a single cloud provider — a risk that could cost billions of dollars worldwide in downtime, lost productivity, and reputational damage.
Industry studies estimate that major Internet disruptions cause multi-billion-dollar losses annually. According to DataCentre Magazine (2024):
-
76% of global enterprises run their applications on AWS.
-
48% of developers integrate AWS into their software pipelines.
In this context, the question is no longer “Can AWS go down?” but rather “How devastating will it be when it does?”
From a cybersecurity perspective, this incident highlights the classic Single Point of Failure (SPOF) problem — a single flaw that can cripple an entire system.
In this case, a DNS failure in one critical node paralyzed countless dependent services, even though the actual data remained intact. Centralizing infrastructure in “mega regions” like US-EAST-1 magnifies this risk: high interconnectivity equals wider, deeper impact.
4. Lessons from the Cloud Collapse
This was not an isolated event. Historically, most AWS global outages trace back to US-EAST-1. Just a year earlier, the world witnessed the CrowdStrike Incident (July 19, 2024) — another wake-up call for the fragility of global digital supply chains.
Both cases reveal a dangerous dependency on a handful of core providers, a systemic risk many organizations still underestimate.
My takeaway — and perhaps a call to action for the next generation of engineers and IT leaders — is clear:
As we build deeper into the tri-pillar of the cloud — AWS, Microsoft, and Google — our responsibility must shift from reliance to resilience.
Distribute risk. Build redundancy. Don’t wait for the next outage to realize how fragile the cloud really is.



