AWS / Azure / GCP 7 mins

Multi-AZ resilience: Why the recent AWS outage shows you need it

Don't wait for the next outage to expose gaps in your critical architecture.

Back to all posts
Blackout Crisis in City at Night

AWS experienced significant challenges this week when a failure in a single US-EAST-1 region triggered a widespread outage, ultimately disrupting a substantial portion of our global digital infrastructure.

But now the dust has settled, what can we actually learn from this disaster so it makes it a little less painful for everyone involved next time round?

Ready to build truly resilient architecture?

We're here to help

The wake-up call

On October 20, 2025, a massive AWS outage disrupted services for millions of users worldwide. It began shortly after 3 am and was caused by an issue with a core database product designed to monitor network load. This failure took down services for a wide array of companies, including Snapchat, Roblox, major banks, and healthcare providers. Services didn't fully recover until Monday evening.

The impact was staggering. Outage monitor Downdetector indicated there had been more than 6.5 million reports globally, with upwards of 1,000 companies affected. Banking apps stopped working, gamers couldn't access their platforms. Even Amazon's own internal systems went dark, leaving warehouse workers standing idle.

But here's the critical question:

Why did businesses that were supposedly using AWS's 'highly available' infrastructure experience downtime?

The myth of cloud infallibility

Many organizations believe that deploying their applications to AWS automatically means they're protected from outages. After all, each AWS region consists of a minimum of three isolated and physically separate availability zones (AZ), each with independent power and connected via redundant, ultra-low-latency networks. Customers are encouraged to design applications to run in multiple AZs.

But during yesterday's outage, even businesses spread across multiple availability zones within the US-East region experienced downtime. Why? Because they were all dependent on centralized services.

The problem is that even if European regions were unaffected in terms of their own availability zones, dependencies on infrastructure or control-plane features located in US-EAST-1 could still cause knock-on impact. Critical services like IAM, account management, and some control APIs are served from US-EAST-1, regardless of where your workloads run.

The hidden single point of failure

The real vulnerability isn't just having resources in one availability zone—it's about having all your eggs in one regional basket. When a foundational service like DynamoDB in a single region experiences issues, the blast radius can be enormous!

And this isn't an isolated incident. Major internet outages are becoming more frequent (and costly), driven by growing reliance on digital infrastructure and the concentration of services among a few dominant cloud providers. The number of significant incidents has surged from just a handful in the 1960s and 1970s to over 80 in the first half of the 2020s.

True multi-AZ architecture: Beyond the basics

This is where proper load balancing architecture becomes critical. It's not enough to simply deploy EC2 instances across multiple availability zones and hope for the best. You need intelligent traffic management that can:

  1. Actively monitor service health across all availability zones
  2. Automatically reroute traffic when an AZ experiences degradation
  3. Maintain session persistence during failover events
  4. Provide independent control planes that don't rely on centralized services

The Loadbalancer.dk solution

Thankfully, Loadbalancer.dk's Enterprise AWS solution offers a sophisticated approach to multi-AZ resilience that goes beyond basic AWS load balancing.

Let's explore how it works.

Dual Availability Zone (AZ) deployment

The Enterprise AWS appliance supports a deployment model specifically designed for true multi-AZ resilience. Like so:

Primary 1 and Primary 2 configuration
  • Two load balancer instances are deployed, each in a different availability zone
  • Each instance has its own Virtual IP (VIP) that is locally active
  • Only one VIP is made available via an associated Elastic IP (EIP) at any given time
  • Regular health checks monitor the EIP availability across both zones
What happens during an AZ failure?
  • If the availability zone hosting the active EIP association fails, the peer instance in the healthy AZ automatically detects the failure
  • The EIP is reassociated with the VIP on the healthy instance
  • Traffic continues to flow to available backend servers
  • No manual intervention required

Independent backend servers across AZs

The real power of this architecture becomes apparent when you distribute your backend infrastructure:

Primary 1 (AZ-1) → Manages traffic to:
  ├─ Server 1 (AZ-1)
  ├─ Server 2 (AZ-1)
  ├─ Server 3 (AZ-2)
  └─ Server 4 (AZ-2)

Primary 2 (AZ-2) → Manages traffic to:
  ├─ Server 1 (AZ-1)
  ├─ Server 2 (AZ-1)
  ├─ Server 3 (AZ-2)
  └─ Server 4 (AZ-2)

Under normal circumstances, whether the EIP is associated with Primary 1 or Primary 2, all four backend servers remain available. But if AZ-1 completely fails:

  • Primary 1 becomes unreachable
  • Servers 1 and 2 in AZ-1 become unavailable
  • Primary 2 automatically associates the EIP with its VIP
  • Services continue via Servers 3 and 4 in AZ-2
  • Your application remains online with 50% capacity instead of 0%

Advanced health checking

The health checking mechanism is sophisticated and configurable:

  • Check interval: Customizable monitoring frequency (default settings balance responsiveness with stability)
  • Failure count: Requires multiple consecutive failures before triggering failover (default: 2) to avoid false positives from transient network issues

This prevents the "flapping" problem where services bounce back and forth between zones due to temporary network hiccups.

Automatic routing table updates

For Layer 4 NAT mode or Layer 7 transparent proxy configurations, the solution can automatically update AWS routing tables during failover. This ensures that return traffic always flows back through the active load balancer, maintaining connection integrity.

The failover script capability allows you to execute AWS CLI commands automatically:

aws ec2 replace-route --route-table-id rtb-xxxxx \
--destination-cidr-block 0.0.0.0/0 \
--instance-id i-xxxxx --region us-east-1

This level of automation means your infrastructure adapts to failures without human intervention.

The control plane advantage

Unlike native AWS load balancing services that can be affected by control plane issues in US-EAST-1, Loadbalancer.dk appliances have their own independent control plane. Each appliance operates autonomously, making its own decisions about traffic routing based on local health checks and configurations.

This independence is crucial. During yesterday's outage, Amazon confirmed that the outage impacted AWS customer support operations, meaning customers couldn't even report problems through the automated support system. With an independent control plane, your load balancers continue functioning and making intelligent routing decisions even when AWS's centralized services are degraded.

Beyond AWS: True multi-cloud resilience

For organizations requiring the highest levels of availability, Loadbalancer.dk's Global Server Load Balancing (GSLB) capability enables multi-region and even multi-cloud architectures.

With it, you can:

  • Balance traffic between AWS regions (e.g., US-EAST and EU-WEST)
  • Implement geographic load balancing for performance optimization
  • Create failover relationships between AWS and other cloud providers
  • Maintain full control over traffic distribution policies

This approach addresses the broader issue identified during the outage: A single cloud is a single point of failure. Spreading workloads across cloud service providers ensures connectivity during an outage.

The cost of downtime vs. the cost of resilience

Let's talk about the elephant in the room: cost. Yes, running resources in multiple availability zones increases your infrastructure spend. But consider the alternative.

The July 2024 CrowdStrike incident cost Fortune 500 companies an estimated $5 billion in direct losses. Yesterday's AWS outage affected banking services, healthcare enrolment systems, airline operations, and countless e-commerce platforms. The financial impact of multi-hour downtime far exceeds the additional monthly cost of true multi-AZ architecture.

And our Enterprise AWS licensing is flexible:

  • Pay-As-You-Go (PAYG): Hourly billing for complete cost control
  • Annual subscription: 15% savings for long-term deployments
  • Bring-Your-Own-License (BYOL): Perpetual licensing with the Freedom License that allows migration between platforms

The investment in proper load balancing infrastructure provides insurance against events like yesterday's outage.

Implementation: Easier than you think

Deploying a multi-AZ resilient architecture with Loadbalancer.dk is straightforward:

  1. Deploy two Enterprise AWS instances in different availability zones
  2. Configure synchronization between Primary 1 and Primary 2 (automated via WebUI)
  3. Set up corresponding VIPs on both instances pointing to your backend servers
  4. Associate Elastic IPs with both instances using AZ HA mode
  5. Configure health check parameters to match your application requirements
  6. Test failover to validate the setup

The entire process can be completed in under an hour for most deployments. The WebUI provides clear, intuitive configuration options that don't require deep networking expertise.

Real-world use cases

This architecture is battle-tested across multiple industries:

  • Financial services: Banks and payment processors can't afford downtime. Multi-AZ load balancing ensures transaction processing continues even during regional disruptions.
  • Healthcare systems: Patient care systems using protocols like DICOM and HL7 require constant availability. Geographic distribution protects against localized failures.
  • E-Commerce platforms: Shopping cart abandonment costs are measured in millions. Maintaining uptime during traffic spikes and infrastructure issues directly impacts revenue.
  • Gaming platforms: Player experience is paramount. Connection stability across AZs maintains engagement even when individual zones experience issues.

Monitoring and visibility

You can't manage what you can't measure. The Enterprise AWS solution provides comprehensive visibility:

  • Real-time EIP status showing which instance currently holds each EIP association
  • Health check history revealing patterns in backend server availability
  • Traffic statistics across both load balancer instances
  • Automated alerts for failover events

This visibility enables proactive management and helps you understand exactly how your infrastructure behaves during normal operations and failure scenarios.

Lessons learned

One AWS expert predicted significant outages in 2024 based on increasing Large Scale Events, though they discounted the power of inertia. The pace of senior AWS departures and increasing complexity makes future outages inevitable.

The question isn't 'if' your cloud provider will experience another major outage—it's 'when', and 'will your business survive it'?

Take action now

Yesterday's outage was a reminder that even the most sophisticated cloud infrastructure has vulnerabilities. Organizations that invested in proper multi-AZ architecture with independent load balancing continued serving their customers while others went dark.

The Enterprise AWS solution from Loadbalancer.dk provides the resilience layer that AWS alone cannot guarantee. With independent control planes, sophisticated health checking, automated failover, and support for multi-region and multi-cloud architectures, it's the insurance policy your business needs.

Don't wait for the next outage to expose gaps in your architecture. The best time to implement true multi-AZ resilience was yesterday. The second-best time is today.

In a nutshell (TL;DR)

  1. Single-AZ deployments are unacceptable for production workloads requiring high availability
  2. Multi-AZ within a single region isn't enough if centralized control plane services can fail
  3. Intelligent load balancing with independent health checking is essential
  4. Automated failover must work without relying on potentially affected AWS services
  5. Regular testing of failover scenarios ensures your architecture performs when needed

Ready to build resilient architecture?

We're here to help