fmarques.org

Disaster Recovery with Terraform, AWS and a few lessons learned

· fred

“Licensed under CC 2.0”

What would happen if you, unexpectedly, had to build your entire production infrastructure from scratch? Would you be able to perform a full recover off all services and dependencies to an acceptable level? How long would it take? Hours? Days? Would the engineering team know what to do? What problems would you encounter? What about full data recovery and databases? Are backups available? How to manage operations and set expectations across the business and clients?

This is the kind of nightmare situation that keeps any SRE awake at night, specially if you’re running a SaaS platform. There is a common perception that these events are similar to something coming out of the Black Swan Theory: they can have a profound impact when they (rarely) happen and always arrive as a surprise. But they are less rare than we think. In the last couple of years, I’ve seen major security incidents on the news, human errors, natural disasters, fiber cuts causing mayhem, affecting companies of all sizes. The list can go on. Some of them can be prevented with good planning and piggybacking on what public Cloud Providers already offer (think about good security practices and tools, automated backups, using multiple regions or availability zones, etc) but in many occasions, such events are triggered by something for which we have absolutely no control of and are not at the same level as with dealing with a single system or database outage. At Signal, we successfully planned and completed an end-to-end test of a Disaster Recovery plan during a game day using Terraform and a few automation tools. It was hard, thorough and most of times “invisible” work but it’s worth to share a few lessons and outcomes that we’ve learned and I think they could be useful to others considering this journey. There’s more then one way to do it and I’m curious to know about other experiences.

The doomsday scenario

Before we started to work on this we had to clearly define our scenario. In what situation do we want to recover from? In this case and since we run our infrastructure on AWS, we assumed that all of our AWS accounts used for production services were compromised or completely lost and we would need to recover only from backups using a new AWS account (no warm standby architectures or having multi-region setups for now). This might be a Black Swan event but working towards a total failure scenario will give you the confidence to restore any particular service or system.

Playbooks

So how do you recover? You basically execute a playbook which is usually tested in a Game Day. A playbook should contain all the information and necessary steps that will allow you to do it. How to plan your playbook? After mapping the order for each system to be provisioned, we have ran a terraform apply on each one of them until we completed the last system. In order to have a working playbook that you are confident with:

  • Each system has to be provisioned by Terraform in one go, there must not be any errors. Do not perform multiple runs with Terraform. If there are, it will have to be fixed before we move to the next system. If there is a temporary issue like a network timeout, do it again. Repeat.
  • Measure the time on each step. With many systems to build, this could take days so you’ll have to time them to have an approximate figure on how long it will take to do everything. This is important as some steps will take longer. It could be restoring a large RDS database or doing application deployment outside Terraform. This will help in calculating your RTO so it can be tested during a game day.
  • Automate every non-Terraform step to make it easier and simple (scripts are good). It’s worth to spend the time to simplify systems and tools to make things easier.
  • Write everything down. Don’t assume that the person doing the recover will know everything you know. Clear instructions are essential for preventing errors and delays, including anything that can’t be automated.
  • Identify systems that can be recovered immediately. If all dependencies are met, you can build systems in parallel, making the recovery faster.
  • Do it a couple of times so you are confident on a clean provisioning and all dependencies are met.

Game day

I’m not going to be exhaustive on how to run a Game Day since there is abundant information out there. This is what we’ve learned:

  • Have a clear RTO, RPO or other goals you want to achieve clearly defined before you plan the date in your calendar
  • Make sure everything has been tested and ready for the day (AWS account access credentials, VPN access, SSH keys, all required information for each person to use). This will prevent you to waste time or have delays before you start.
  • When preparing for the Game Day, make sure a large number of people in the engineering team will be able to join and help with the exercise
  • Everyone should stop doing what they’re doing at that day and consider the drill as their only priority
  • Assign a Team Lead. The Team Lead will be coordinating the entire exercise and support all teams
  • Assign a person with the sole responsibility to write everything down, creating a log of events. Using a conference system, chat group or #slack room will help with the ongoing communication between teams
  • Organize people into groups. These groups would be responsible to recover systems that can be done in parallel.
  • Shortly after the Game Day, do a retrospective with the team and discuss what went well and provide feedback on what went wrong. Fix identified issues as soon as possible (from bugs to missing documentation). If you don’t do it, you’ll likely forget about them.

AWS account isolation, roles and backups

Following the AWS best practices and using different AWS accounts for environments (production, testing, staging, etc) is essential, as well as using a dedicated account for user authentication via IAM and then using roles on each account that could be attached to groups with different permissions. To make it more secure, enable Multi Factor Authentication (MFA) on IAM accounts (on both the Console and API) and enforce it on any roles being assumed on your production accounts. A quick example on how to enforce MFA on a given IAM role via Terraform:

resource "aws_iam_role" "prod_access_user" {
  name = "access-user"
  max_session_duration = 36000
  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "AWS": "arn:aws:iam::<id>:root"
      },
      "Effect": "Allow",
      "Condition": {
        "Bool": {
          "aws:MultiFactorAuthPresent": "true"
        }
      }
    }
  ]
}
EOF
}

Replace <id> with the Authentication account AWS id. This way any valid IAM user will need to have MFA present before it can assume the role.

For backups, we use a dedicated AWS account which is completely locked down with very limited access enforcing MFA. Since it’s also being provisioned through Terraform and if you’re using remote state files with S3, make sure the bucket access is very restricted. This account acts as a failsafe place to keep a replica of the existing data or backups used on other accounts. So, what is going to be backed up and how? As a principle, anything that your services depend on. It really depends on the kind of AWS services you’re running but the most common are:

  • S3 buckets. For busy buckets where is important to have an almost exact replica of your production data, we recommend using the native S3 cross-region replication, so your source buckets are (asynchronously) replicated to a destination bucket on a different region owned by the backup account. For buckets where data is not updated that frequently, a simple batch job running every hour using “aws sync” CLI might be good enough. We recommend using versioning on the destination buckets, so this way you don’t replicate DELETE operations and protect yourself from data loss or corruption on the original bucket. Add a lifecycle policy to get rid of older versions after a certain time. Make use of different storage classes like Infrequent Access to optimize storage costs and be prepared to also pay extra charges on replication traffic to a different region.

  • RDS Databases. If you’re using Aurora, automatic snapshots can not be shared with other accounts. So we’ve built a service that copies automatic snapshots to manual that are later shared with the backup account. The backup account then gets notified via SNS that there is a new backup from a production account and will copy it to a local manual snapshot that is locked down. You should also have full dumps of every database for table level restore. Very common to have them in S3. These backups should also be replicated to the backup account.

  • EC2 EBS volumes. Same method as database snapshots, volumes are copied across to the backup account.

  • EC2 AMIs. The last thing you want to do when performing Disaster Recovery is spending time baking new AMIs under the new account. This will slow you down. We have made changes so every time an AMI gets baked, it will be shared with the backup account. Then have a Lambda to copy them locally. Very important to use AMI names instead of ids in your Terraform code so you don’t need to change it to use the “failsafe” id when doing a recovery using a new account. Another way that could be done is to bake AMIs exclusively into the backup account and then have the permissions to make them available to your production accounts. One other thing you’ll have to do is to include the account id of the backup account under owners so it can be found easily:

    data "aws_ami" "my-ami" {
    filter {
      name = "name"
      values = ["my-ami-1565277065"]
    }
    owners = ["self", "backup_account_id"]
    }
    
  • SSM Parameter Store. If you’re using it to store secrets and configuration, you’ll need to copy them frequently to the backup account as any service dependent on secrets will be unable to start. A simple Lambda function can do it.

  • DynamoDB. Unfortunately, DynamoDB service backups can not be shared with other accounts. There are several tools for doing dumps of DynamoDB tables although you might have scalability issues with the amount of data to be backed up. Sometimes you might not need to backup everything as data stored there is temporary. In some cases and if you are using Kinesis within a data pipeline, it’s important to backup tables containing shard leaseKeys and checkpoints so you can recover new data only and avoid duplication. Dumping tables might trigger reserved read capacity so it’s always good to increase it or use on-demand mode, if needed.

  • ECS / Docker Containers. If you use ECR, you could ship containers to a repository in a different region using the backup account or use an external repository. Make sure your ECS instances have the credentials to pull containers from any repository.

This is not an extensive list, of course. Also, there are particular services for which backups are not needed like SQS queues, maybe some Elasticache Redis clusters, temporary EBS volumes or some other service where data is ephemeral and can be easily recreated.

From a security perspective, the correct way of dealing with a locked-down backup account is to never push your data there from a production account. Always pull data from other accounts into the backup account. When using KMS keys to encrypt data at rest (S3 buckets, RDS snapshots, etc), make sure you allow the backup account to use them for decryption. KMS keys in the backup account should never be shared with other accounts unless you need them to restore data or doing a disaster recovery.

DNS

DNS is a critical part of your recovery. Using Route53 for internal domains is easy, everything will be recreated when provisioning infrastructure using Terraform. For public domains, and if you’re using an external registrar, you’ll need to update the nameservers pointing to the newly created zones on the recovery AWS account. When using Route53 for both registrar and hosting, you’ll need to take control of the registration and nameservers through a new account. Using Cloudflare or some other external DNS service with a Terraform provider is the easiest way, as it will be done automatically during provisioning. We also recommend using a public mockup domain (or more) that will mimic your production domains, allowing to easily test all service URLs without disruption. The most secure setup would be using at least two different external DNS services for increased availability.

Terraform and system dependencies

For simple platforms with a single environment, where every resource dependency is defined in a single state file, all you need to do is a “terraform apply” and Terraform does its job, calculating the dependency graph. Not so simple if you have isolation between environments and systems, where multiple state files on S3 are used as the glue between them. This follows best practices regarding reducing the blast radius on infrastructure changes and provisioning scalability among different teams. We’re also using our own Terraform wrapper, to make it easier to use a different public domain with no required changes in .tf files, complete account isolation, among other features. In this case, we need to manually map the dependency graph across all systems or else you’ll be unable to provision every single resource from scratch without any failures or even thinking in using automation. If your backend system depends on RDS databases and they are created elsewhere, you will need to provision them first. Or If your ECS cluster requires IAM roles and security groups provisioned on another system. VPC settings, subnets and IAM service roles are the first to be applied as almost every resource depends on them. And so on. The interesting fact is that not all infrastructure is created in one go, it is done frequently in small batches, by different people along years, systems on top of other systems, sometimes creating issues which are not easily detectable unless you start from zero. Some of the issues we have found doing this exercise:

  • Hardcoded values. Having them hardcoded will prevent us to reproduce infrastructure on a different AWS account. Example: AWS account ids and ARNs in IAM policies, S3 bucket names colliding with other accounts, domain names, etc. We’re now using variables for all of the resources (mockup domain names through our wrapper when doing testing), using ARNs from resource outputs.
  • Snowflakes. This is common when you start building infrastructure using the AWS Console or Cloudformation and later change to Terraform. Very common to miss a few resources. We had to import all of them to fix dependencies.
  • Circular dependencies. Very tricky. This happens when you have systems with mutual resource dependencies. Example: the backend system owns a resource with an ARN exported to the frontend system via its state file. The frontend system uses it in a IAM policy and also has a security group that is used by the backend system. This wasn’t spotted because the backend system was created first. Later a new resource was shared from the frontend system to the backend system. Worked well because of the necessity to run a “terraform apply” on different systems, thus creating incremental changes on your infrastructure. We had to fix this by either merging resources in a given system or move the dependency to a “high priority” system. The key benefit of doing this exercise was that we started to look in how to simplify systems in Terraform and make them truly independent, by removing things we didn’t need, merging a few of them or moving resources around. This can be painful sometimes as you’ll have to be careful in importing and removing resources across state files, so it will not trigger infrastructure changes in production.

Conclusion

The time invested in Disaster Recovery was worth it. It allowed us to be more confident on our platform, redefining our systems to make them more secure and resilient. We have a much clear Terraform code and systems with clear boundaries, easier to understand. As with any growing platform, systems will be changed and maintaining consistency will be hard. It will be an error to think about this as a one-off and never touch it again. As with any process, we will need continuity and do it all over again somewhere in the future.

Acknowledgments

This work would not be possible without the collaboration of our entire engineering team during the Game Day and in particular Sam Burrell, who worked with me tirelessly along this journey. It’s always a team effort.