Press "Enter" to skip to content

Category: Tech

Disaster Recovery with Terraform, AWS and a few lessons learned

What would happen if you, unexpectedly, had to build your entire production infrastructure from scratch? Would you be able to perform a full recover off all services and dependencies to an acceptable level? How long would it take? Hours? Days? Would the engineering team know what to do? What problems would you encounter? What about full data recovery and databases? Are backups available? How to manage operations and set expectations across the business and clients? This is the kind of nightmare situation that keeps any SRE awake at night, specially if you’re running a SaaS platform. There is a common perception that these events are similar to something coming out of the Black Swan Theory: they can have a profound impact when they (rarely) happen and always arrive as a surprise. But they are less rare than we think. In the last couple of years, I’ve seen major security incidents…

Leave a Comment

Monitoring Docker Thin Pool Usage with Prometheus

Update (12/4/20): I highly recommend using the latest Amazon ECS-optimized Amazon Linux 2 AMI. It uses Docker’s OverlayFS (overlay2) storage driver. The same partition is used for OS, Docker images  and metadata. It’s easier to monitor filesystem usage using the Prometheus Node exporter. If you still have to use the older ECS AMI v2015.09.d or later, this article might be useful for you. There are plenty of tools out there for monitoring Docker using Prometheus. We can use the Node Exporter to gather useful information for Docker hosts at an OS/kernel level (memory, cpu, network, filesystem) and at a container level there is cAdvisor which reports resource usage and performance data. Unfortunately I couldn’t find any way of monitoring Docker Thin Pool usage with Prometheus so I wrote a quick Python script to generate usage metrics that are exposed using Node Exporter’s textfile collector. So first, what does “Thin Pool”…

Leave a Comment

Using EC2 Spot Instances with ECS

Had the opportunity to write an article for AWS Startups Blog, explaining how we use EC2 Spot Instances with ECS at Signal: “Every day, Signal ingests millions of documents from a growing number of publishers, including online media, print newspapers, broadcast, regulation and legislation. Our text analytics pipeline processes these documents in real time, applying our own AI algorithms and machine learning, preparing them to be searched from our application and distributed via our alerts system. The entire Signal platform is built on a large number of microservices running on Docker containers deployed to Amazon ECS. In fact, we run almost all of our workloads on top of several ECS clusters including ingestion, processing and consumption. With the hyper growth of our platform we have started to face several challenges, primarily on efficiency and capacity planning. We had a lot more questions than answers. What is the best way to…

Leave a Comment

Using Ansible Vault with environment variables

This is a common trend. You’ve been using Ansible to provision your infrastructure for some time and all of a sudden you will have a couple of secrets to manage, usually SSL/SSH private keys, API credentials, passwords, etc. Because you don’t want these secrets to be stored “in the clear” on your git repository, you will declare them as variables inside yaml files and then use Ansible Vault to encrypt them using an AES symmetric key. You can then run ansible-playbook with –ask-vault-pass, so yaml var files will get decrypted on the fly when running the playbook. Sometimes I use Ansible together with other tools under the same repository. For example, I prefer to provision AWS infrastructure with Terraform and then call Ansible as a provisioner to customize an EC2 instance and Cloudflare to update the DNS record . Or use Packer to bake an AMI and use Ansible as a local provisioner. In…

Leave a Comment

Upstart and resolvconf cache

I’ve recently found this when I was trying to fix a nameserver config issue with resolvconf on Ubuntu. When resolvconf populates /etc/resolv.conf, it will read what we have configured in /etc/resolvconf/resolv.conf.d (head, base, tail, etc) and also any dns-server declared in /etc/network/interfaces. I had a conflict with something I was populating in the head file (with Puppet) from something that was configured under /etc/network/interfaces. So I removed the conflicting dns-server declaration from the interfaces file and run “resolvconf -u” to update the config. To my surprise, the “deleted” nameservers from /etc/network/interfaces were still included in /etc/resolv.conf. After some debugging, I have noticed that resolvconf’s Upstart script now keeps a cache file under /run/resolvconf/interface that is a copy of the previous /etc/network/interfaces. You need to delete this file and restart resolvconf to make it work: “stop resolvconf ; start resolvconf”.

Leave a Comment ->

Not long ago I decided to move to a new personal domain and registered I am in the process of moving everything from to my new domain, which will cease to exist in a few months.  If you are one of the brave souls still keeping an eye on my feed, I advise to change to the new domain before the redirect expires.

Leave a Comment

Discovering jemalloc and debugging native Java memory leaks

I’ve joined ThoughtWorks last August (awesome!) and I’ve been working with the tech team on everything related to infrastructure automation, code deployment and all things “DevOps” for GOV.UK Verify (part of the Government Digital Services). The last few months were very rewarding to me as I got exposed to a lot of different technologies, although I do tend to work a lot with Puppet most of the time and I don’t get the chance to look at other things “from the other side”. Working with the dev team on a Java memory leak issue was a great way to dig into something where I was already familiar with but I had the chance to understand a little bit more about JVM memory allocation, Linux kernel memory management and discovering great tools like jemalloc and the excellent jeprof profiler. We lost a long time playing the guess game and using the wrong tools before we found…

Leave a Comment