Press "Enter" to skip to content

fmarques.org Posts

Disaster Recovery with Terraform, AWS and a few lessons learned

What would happen if you, unexpectedly, had to build your entire production infrastructure from scratch? Would you be able to perform a full recover off all services and dependencies to an acceptable level? How long would it take? Hours? Days? Would the engineering team know what to do? What problems would you encounter? What about full data recovery and databases? Are backups available? How to manage operations and set expectations across the business and clients? This is the kind of nightmare situation that keeps any SRE awake at night, specially if you’re running a SaaS platform. There is a common perception that these events are similar to something coming out of the Black Swan Theory: they can have a profound impact when they (rarely) happen and always arrive as a surprise. But they are less rare than we think. In the last couple of years, I’ve seen major security incidents…

Leave a Comment

Monitoring Docker Thin Pool Usage with Prometheus

Update (12/4/20): I highly recommend using the latest Amazon ECS-optimized Amazon Linux 2 AMI. It uses Docker’s OverlayFS (overlay2) storage driver. The same partition is used for OS, Docker images  and metadata. It’s easier to monitor filesystem usage using the Prometheus Node exporter. If you still have to use the older ECS AMI v2015.09.d or later, this article might be useful for you. There are plenty of tools out there for monitoring Docker using Prometheus. We can use the Node Exporter to gather useful information for Docker hosts at an OS/kernel level (memory, cpu, network, filesystem) and at a container level there is cAdvisor which reports resource usage and performance data. Unfortunately I couldn’t find any way of monitoring Docker Thin Pool usage with Prometheus so I wrote a quick Python script to generate usage metrics that are exposed using Node Exporter’s textfile collector. So first, what does “Thin Pool”…

Leave a Comment

Using EC2 Spot Instances with ECS

Had the opportunity to write an article for AWS Startups Blog, explaining how we use EC2 Spot Instances with ECS at Signal: “Every day, Signal ingests millions of documents from a growing number of publishers, including online media, print newspapers, broadcast, regulation and legislation. Our text analytics pipeline processes these documents in real time, applying our own AI algorithms and machine learning, preparing them to be searched from our application and distributed via our alerts system. The entire Signal platform is built on a large number of microservices running on Docker containers deployed to Amazon ECS. In fact, we run almost all of our workloads on top of several ECS clusters including ingestion, processing and consumption. With the hyper growth of our platform we have started to face several challenges, primarily on efficiency and capacity planning. We had a lot more questions than answers. What is the best way to…

Leave a Comment

America in the 1920s

Bill Bryson (A Short History of Almost Everything) knows how to write rich and engaging history books (he’s a remarkable storyteller). One Summer is a spectacular book for understanding America in the mad 1920s (and all the craziness around Charles Lindbergh, including early aviation history). A must read.

Leave a Comment

Using Ansible Vault with environment variables

This is a common trend. You’ve been using Ansible to provision your infrastructure for some time and all of a sudden you will have a couple of secrets to manage, usually SSL/SSH private keys, API credentials, passwords, etc. Because you don’t want these secrets to be stored “in the clear” on your git repository, you will declare them as variables inside yaml files and then use Ansible Vault to encrypt them using an AES symmetric key. You can then run ansible-playbook with –ask-vault-pass, so yaml var files will get decrypted on the fly when running the playbook. Sometimes I use Ansible together with other tools under the same repository. For example, I prefer to provision AWS infrastructure with Terraform and then call Ansible as a provisioner to customize an EC2 instance and Cloudflare to update the DNS record . Or use Packer to bake an AMI and use Ansible as a local provisioner. In…

Leave a Comment