Had the opportunity to write an article for AWS Startups Blog, explaining how we use EC2 Spot Instances with ECS at Signal:
Every day, Signal ingests millions of documents from a growing number of publishers, including online media, print newspapers, broadcast, regulation and legislation. Our text analytics pipeline processes these documents in real time, applying our own AI algorithms and machine learning, preparing them to be searched from our application and distributed via our alerts system. The entire Signal platform is built on a large number of microservices running on Docker containers deployed to Amazon ECS. In fact, we run almost all of our workloads on top of several ECS clusters including ingestion, processing and consumption. With the hyper growth of our platform we have started to face several challenges, primarily on efficiency and capacity planning. We had a lot more questions than answers. What is the best way to use available capacity in a given cluster? Are we over provisioning on peak times for too long when we shouldn’t? Can we run our data pipeline at a better cost while still being able to scale and guarantee the expected service availability for our customers, 24/7? We had an interesting journey in finding answers for these questions and I would like to share how we were able to reduce up to 70% our EC2 computing costs without any impact on service uptime and scalability.
I would like to thank both Luca Grulla and Jo Draeger as well as several AWS architects in reviewing my draft.
Lambda draining function and Terraform config can be found here.