Disaster Recovery

Disaster Recovery on AWS: Automated Multi-Region DR

Disaster recovery on AWS with multi-region DR automation and IaC: improve resiliency, reduce RTO/RPO and ensure business continuity


In recent years, resilience has gone from being a "nice to have" to a critical requirement for any digital platform. Regional failures, though rare, do exist, as do problems resulting from human error, and when they do occur, the impact on the business can be enormous if you are not prepared.

In this article, I want to share a recent and successful project in which, from Unikal Tech Partners, we automated the complete recovery of an AWS environment deployed in a primary region (Region A) to a secondary region (Region B), using Infrastructure as Code (IaC) and native AWS services.

IMG RRSS (ING) - BLOG - Unikal - BMS

Disaster Recovery on AWS: Objectives, Challenges, and the Real Environment

The main objective was clear: to recover the platform quickly, repeatably, and without manual intervention. Even in the event of a major failure of an entire region.

The main challenge was the following: regional recovery without improvisation.

The original environment in Region A included mainly the following elements:

  • Applications are deployed on EC2 behind Application Load Balancers.
  • Databases managed on Amazon RDS
  • Object storage on Amazon S3
  • Queuing services (Amazon SQS) and notifications (Amazon SNS)
  • Network configuration with VPCs, subnets, gateways, and security rules
  • Security and compliance services due to being an ENS (National Security Scheme) High-certified environment

Logically, being a critical productive environment, critical dependencies between services were taken into account. One of the main premises set by the client was the following:

"If the region becomes unavailable, we don't want to rebuild the environment by hand."

Major challenges in multi-region disaster recovery

The main challenges facing the company's CIO were as follows:

  • Reducing the actual RTO (Recovery Time Objective), since it was not possible to meet the required RTO by continuing to work with the current methodology followed in disaster recovery.
  • Minimize human errors in a crisis scenario, either by not having available resources with the necessary knowledge to restore the environment or by making mistakes in a crisis situation in which the business is pressing for an immediate solution.
  • Ensure that the infrastructure in Region B was identical and consistent, as the SLAs committed to its customers did not allow the ecosystem to suffer a degradation of service. In the event of such a degradation, financial penalties would be applied.
  • To be able to test the recovery plan without affecting production, periodically, and with guarantees that the results are realistic.
  • To be able to adapt the recovery plan to changes in the production ecosystem in an easy and controlled manner, guaranteeing that the environment deployed in Region B will always be identical to the environment in Region A.

Automated Disaster Recovery Strategy with IaC on AWS

Within the different options we have when performing a Disaster Recovery, we opted for a multi-region active/passive strategy, where Region B remains ready to lift the entire environment on demand. Despite the criticality of the environment, taking into account the trade-off between RTO, RPO, and recurring costs, active-active modes were discarded.

The pillars of the solution were:

1. Infrastructure as Code is the foundation of everything

The entire infrastructure was defined as code using Terraform (although the approach is equally valid with AWS CloudFormation):

  • VPCs, subnets, route tables
  • Security Groups and NACLs
  • Load Balancers and Target Groups
  • EC2, Auto Scaling Groups
  • RDS and dependencies
  • IAM roles and policies
  • Security and compliance configurations and services High ENS

Our guiding principle was as follows: nothing is created manually. If it isn’t in the code, it doesn’t exist.

This enabled us to:

  • Replicate the environment in any region
  • Version changes
  • Execute reproducible and auditable deployments

2. Data synchronisation and preparation

When it comes to data, we take different approaches depending on the service:

1. Amazon S3

Due to the large volumes of data stored in S3, it proved impossible to restore the buckets within the RTO, so we decided to:

  • Enable cross-region replication
  • Enable versioning for added protection
  • Ensure that the buckets in Region B were always ready

2. Amazon RDS

As there was no active-active solution in which the databases were permanently running, the methodology used was as follows:

  • Using automatic snapshots
  • Copying snapshots to Region B
  • Defining IaC to restore RDS instances from the latest available snapshot.

3. EC2

  • Automated creation and copying of AMIs to Region B
  • The AMIs were used as the basis for Auto Scaling Groups

3. Automation of failover

One of the key aspects of the project was that the DR should not rely on manual commands. We created an automated pipeline that:

  • Detect the recovery scenario
  • Perform a full deployment in Region B from IaC
  • Restore databases from the latest snapshots
  • Launch instances and load balancers
  • Perform basic health checks

The whole process could be initiated with a single controlled action.

4. Traffic management and DNS

For routing:

  • We use Amazon Route 53
  • DNS records configured to point to Region B
  • TTLs adjusted to minimise the impact of the switchover

In the event of a regional outage, traffic is switched over quickly and in a controlled manner.

Real results of automated disaster recovery

Thanks to this approach, the customer achieved:

  • Recover the entire environment in Region B in minutes.
  • Drastically reduce RTO versus manual deployment
  • Eliminate human error at critical moments
  • Test the DR plan periodically and securely
  • Have live documentation: the code itself is the documentation

In addition, the use of IaC allowed cost optimization, since Region B consumes only minimal resources (storage and backup) until the recovery plan is activated.

5 Key lessons in disaster recovery projects on AWS

Some key conclusions from the project:

  1. If it's not automated, it's not real DR.
  2. Infrastructure as Code is not just for deployments;  it's a resiliency tool.
  3. Testing the DR is as important as designing it.
  4. An outdated DR is not a useful DR
  5. AWS provides all the services needed, but the value is in how they are integrated


Conclusion

Disaster recovery should not be a document forgotten in a drawer. It should be a live, tested, and automated process. AWS, combined with Infrastructure as Code, allows you to build high-availability and regional-recovery solutions elegantly, securely, and efficiently.

If your platform still relies on manual steps to recover from a major failure, it's probably not as ready as you think. We invite you from Unikal Tech Partners to review your AWS Disaster Recovery Plan, and we can analyze whether it truly meets the SLAs set by the business.

Webinar Multicloud (1) carlos valverde

 

Carlos Valverde  

 

Similar posts