Creating Termination Protected Tasks In ECS (Part 1)

In order to run a processing pipeline / batch (e.g a set of tasks that process for a finite amount of time) in ECS one has two main options: use an ECS service or initiate a standalone task.

An ECS service will probably listen to a queue, a metric or an API call that will trigger the service which will then start processing. However, when it finishes processing, the service keeps the same number of desired tasks, while a task, probably tigger by the same inputs, will have a start and a finish, meaning the task will be terminated right after the processing completed.

There are different considerations for choosing a standalone task vs a DAEMON task or a REPLICA. when building a processing pipeline in ECS. However, in this post we won’t go over these rather examine the obstacles arising when choosing to work with a REPLICA task. We’ll review the usage for an EC2 based ECS cluster in order to process jobs and how to handle scale in events.

Let’s work with an example of a service that creates an ice cream whenever there is a new request coming from an SQS queue. A single ice cream creation takes a few minutes of processing time and each ECS task can create one ice cream at a time.

Scale-in Events — How Not To lose Precious Processing Time

Now, let’s think what happens when a scale-in event comes into play. First, let’s see when can that happen; we can see that the potential scale-in events comes from two main sources:

  • An ASG event that reduces the number of desired tasks.
  • A new ECS deployment that causes the tasks to get recreated with a new version.

While seem harmless at first, these events can potentially stop our tasks during processing, meaning an ice cream being thrown away while in the making. In that case while a new ice cream request can be created by a simple retry mechanism, we still lose precious time which can be crucial for our product SLA.

Instance Protection Termination For ECS REPLICA Tasks

ECS has handled “the option of having ECS dynamically manage instance termination protection on your behalf ... if enabled for a capacity provider, ECS will protect any instance from scale-in if it is running at least one non-REPLICA task”. However, ECS doesn’t handle the protection of A task during a deployment; in a deployment the EC2 doesn’t get terminated but the task does. That is something we’ll need to implement by our own until the ECS team implements a ‘task protection termination’ like mechanism (you can follow the roadmap here and here).

We have already defined the two sources of a potential task termination above so now we can review the approaches for dealing with those. It’s important to mention that these are merely the methods I have came to use during research of the topic and potentially there are better approaches to the problem.

Scale-in source: an ASG event that reduces the number of desired tasks.

While this protects from an ASG event, it still doesn’t protect from a new deployment (see below).

This source can be handled by calling the EC2 API from our code, and set the instance termination protection on while we process our task. In this case if an EC2 instance is set to be terminated, our protection will hold it until the task finishes processing and that is when our code can release the termination protection.

client = boto3.client('autoscaling')

There are cases that can be considered for using the distinct instance placement constraint and the built-in support for instance termination protection instead of implementing that by our own. For example, a natural case can be when a task uses a GPU resource and the underlining EC2 has a single GPU.

Scale-in source: a respond to a new ECS deployment that causes the tasks to get recreated with a new version.

During a deployment ECS will recreate the running task, without terminating the underlining EC2 instances. That means that if we have a task that is processing an ice cream while a developer has pushed a new version, the processing will be recreated so that our ability for near real time ice cream processing will degrade.

This source can be handled by writing our own deployment follow, managing the EC2 ASG scale during deployment by our own as following:

  • Scale out EC2 ASG
  • Scale out ECS Tasks
  • Wait for the new tasks with the new version to be created
  • Scale in EC2 ASG (assuming OldestInstance termination policy)
  • Scale in ECS Tasks

But things might get complicated; when adding the desire to scale-in /out automatically (typically managed by a capacity provider), and not necessarily by own custom deployment. At this point I have concluded a more robust solution is required:

Create a Task Self-destruct Mechanism For ECS

Of that I will elaborate in the following post:

👨🏻‍💻 Engineering Leader ⛰️ Software Developer ☁️ Cloud Solution Architect