Deep Learning Setup: ECS GPU Task On Ubuntu (Part 2)

Michael Loewenstein
CodeX

--

In order to run a GPU based task in ECS, we need to create our own EC2 instance as Fargate still doesn’t support GPUs. That shouldn’t be too hard with the ECS GPU optimized AMIs.

However, sometimes that cannot be done as teams are already using their favorite AMI setups such as an Ubuntu CIS optimized AMI or any other flavor. This means they need to install and configure the setup from scratch.

In this set of 4 articles, we’ll review the installation and configuration process of an ECS task with GPU required resources over an Ubuntu 18.04 OS.

Part 1: The NVIDIA driver

Part 2: The ECS agent

Part 3: The NVIDIA-Docker run time

Part 4: GPU configuration on ECS Agent

Docker & ECS Agent

Although installing and configuring Docker & ECS agent is very well documented, I had a few gotchas so I found it valuable to document the required steps on my own. However, I recommend going over the formal documentation which is much more detailed, and also the steps might change according to one's needs & individual requirements/preference.

Install Docker

Set up the repository

$ sudo apt-get update
$ sudo apt-get install apt-transport-https ca-certificates curl \ gnupg-agent software-properties-common
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
$ sudo add-apt-repository \ "deb [arch=amd64] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) \ stable"

Install Docker Engine

$ sudo apt-get update
$ sudo apt-get install docker-ce docker-ce-cli containerd.io

Install the ECS agent

Allow the port proxy to route traffic using loopback addresses

$ sudo sh -c "echo 'net.ipv4.conf.all.route_localnet = 1' >> /etc/sysctl.conf"
$ sudo sysctl -p /etc/sysctl.conf

Enable IAM roles for tasks

sudo apt-get install iptables-persistent
sudo iptables -t nat -A PREROUTING -p tcp -d 169.254.170.2 --dport 80 -j DNAT --to-destination 127.0.0.1:51679
sudo iptables -t nat -A OUTPUT -d 169.254.170.2 -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 51679

Add an iptables route to block off-host access to the introspection API endpoint

sudo iptables -A INPUT -i eth0 -p tcp --dport 51678 -j DROP

Write the new iptables configuration to your operating system-specific location (Ubuntu)

sudo sh -c 'iptables-save > /etc/iptables/rules.v4'

Create the /etc/ecs directory and create the Amazon ECS container agent configuration file

sudo mkdir -p /etc/ecs && sudo touch /etc/ecs/ecs.config

Create the host volume mount points on your container instance.

sudo mkdir -p /var/log/ecs /var/lib/ecs/data

Download the ECS container agent

curl -o ecs-agent.tar https://s3.amazonaws.com/amazon-ecs-agent-us-east-1/ecs-agent-latest.tar

Load the ECS container agent image.

sudo docker load --input ./ecs-agent.tar && rm ecs-agent.tar

Manage Docker as a non-root user

https://docs.docker.com/engine/install/linux-postinstall/

sudo groupadd docker
sudo usermod -aG docker $USER
# Log out and log back in so that your group membership is re-evaluated.

Create a daemon for the ECS agent:

sudo vi /etc/systemd/system/ecs-agent.service

Add the following content:

[Unit]
Description=AWS ECS Agent
Documentation=https://docs.aws.amazon.com/AmazonECS/latest/developerguide/
Requires=docker.service
After=docker.service[Service]
Restart=always
RestartPreventExitStatus=5
ExecStartPre=/bin/mkdir -p /var/lib/ecs/data
ExecStartPre=/bin/mkdir -p /var/log/ecs
ExecStartPre=-/usr/bin/docker kill ecs-agent
ExecStartPre=-/usr/bin/docker rm ecs-agent
ExecStart=/usr/bin/docker run \
--name=ecs-agent \
--restart=on-failure:10 \
--volume=/var/run/docker.sock:/var/run/docker.sock \
--volume=/var/log/ecs/:/log \
--volume=/var/lib/ecs/data:/data \
--volume=/etc/ecs:/etc/ecs \
--volume=/sys/fs/cgroup:/sys/fs/cgroup:ro \
--net=host \
--env-file=/etc/ecs/ecs.config \
amazon/amazon-ecs-agent:latest
ExecStop=-/usr/bin/docker stop ecs-agent
[Install]
WantedBy=multi-user.target

Start the service:

sudo systemctl enable ecs-agent.service
sudo systemctl start ecs-agent.service

Verify the ECS agent is working

$ docker ps

Add the ECS cluster to the ECS agent config:

Edit the config file:

sudo vi /etc/ecs/ecs.config

Add the following:

ECS_DATADIR=/data
ECS_ENABLE_TASK_IAM_ROLE=true
ECS_ENABLE_TASK_IAM_ROLE_NETWORK_HOST=true
ECS_LOGFILE=/log/ecs-agent.log
ECS_AVAILABLE_LOGGING_DRIVERS=["json-file","awslogs"]
ECS_LOGLEVEL=info
ECS_CLUSTER=defualt

Then restart the ECS agent:

sudo systemctl restart ecs-agent.service

If the cluster name is different than default, assign the ECS_CLUSTER variable with the cluster name

At this point, we should be able to view the EC2 under the ECS instances tab in the ECS cluster, meaning the agent has subscribed to the cluster successfully.

However, since we are trying to run a GPU based task in the specific ECS service under the events tab we should see the following event:

service ... was unable to place a task because no container instance met all of its requirements. The closest matching container-instance ... has insufficient GPU resource available. For more information, see the Troubleshooting section.

This is happening since the ECS agent doesn’t have the proper configuration to utilize the GPU in the EC2 that it is hosted on.

In the next article, we’ll install the Docker-NVIDIA runtime and start connecting between the ECS agent and the NVIDIA driver.

--

--

Michael Loewenstein
CodeX
Writer for

👨🏻‍💻 Engineering Leader ⛰️ Software Developer ☁️ Cloud Solution Architect