Deep Learning Setup: ECS GPU Task On Ubuntu (Part 2)
In order to run a GPU based task in ECS, we need to create our own EC2 instance as Fargate still doesn’t support GPUs. That shouldn’t be too hard with the ECS GPU optimized AMIs.
However, sometimes that cannot be done as teams are already using their favorite AMI setups such as an Ubuntu CIS optimized AMI or any other flavor. This means they need to install and configure the setup from scratch.
In this set of 4 articles, we’ll review the installation and configuration process of an ECS task with GPU required resources over an Ubuntu 18.04 OS.
Part 2: The ECS agent
Docker & ECS Agent
Although installing and configuring Docker & ECS agent is very well documented, I had a few gotchas so I found it valuable to document the required steps on my own. However, I recommend going over the formal documentation which is much more detailed, and also the steps might change according to one's needs & individual requirements/preference.
Install Docker
Set up the repository
$ sudo apt-get update
$ sudo apt-get install apt-transport-https ca-certificates curl \ gnupg-agent software-properties-common
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
$ sudo add-apt-repository \ "deb [arch=amd64] https://download.docker.com/linux/ubuntu \ $(lsb_release -cs) \ stable"
Install Docker Engine
$ sudo apt-get update
$ sudo apt-get install docker-ce docker-ce-cli containerd.io
Install the ECS agent
Allow the port proxy to route traffic using loopback addresses
$ sudo sh -c "echo 'net.ipv4.conf.all.route_localnet = 1' >> /etc/sysctl.conf"
$ sudo sysctl -p /etc/sysctl.conf
Enable IAM roles for tasks
sudo apt-get install iptables-persistent
sudo iptables -t nat -A PREROUTING -p tcp -d 169.254.170.2 --dport 80 -j DNAT --to-destination 127.0.0.1:51679
sudo iptables -t nat -A OUTPUT -d 169.254.170.2 -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 51679
Add an iptables route to block off-host access to the introspection API endpoint
sudo iptables -A INPUT -i eth0 -p tcp --dport 51678 -j DROP
Write the new iptables configuration to your operating system-specific location (Ubuntu)
sudo sh -c 'iptables-save > /etc/iptables/rules.v4'
Create the /etc/ecs
directory and create the Amazon ECS container agent configuration file
sudo mkdir -p /etc/ecs && sudo touch /etc/ecs/ecs.config
Create the host volume mount points on your container instance.
sudo mkdir -p /var/log/ecs /var/lib/ecs/data
Download the ECS container agent
curl -o ecs-agent.tar https://s3.amazonaws.com/amazon-ecs-agent-us-east-1/ecs-agent-latest.tar
Load the ECS container agent image.
sudo docker load --input ./ecs-agent.tar && rm ecs-agent.tar
Manage Docker as a non-root user
https://docs.docker.com/engine/install/linux-postinstall/
sudo groupadd docker
sudo usermod -aG docker $USER
# Log out and log back in so that your group membership is re-evaluated.
Create a daemon for the ECS agent:
sudo vi /etc/systemd/system/ecs-agent.service
Add the following content:
[Unit]
Description=AWS ECS Agent
Documentation=https://docs.aws.amazon.com/AmazonECS/latest/developerguide/
Requires=docker.service
After=docker.service[Service]
Restart=always
RestartPreventExitStatus=5
ExecStartPre=/bin/mkdir -p /var/lib/ecs/data
ExecStartPre=/bin/mkdir -p /var/log/ecs
ExecStartPre=-/usr/bin/docker kill ecs-agent
ExecStartPre=-/usr/bin/docker rm ecs-agent
ExecStart=/usr/bin/docker run \
--name=ecs-agent \
--restart=on-failure:10 \
--volume=/var/run/docker.sock:/var/run/docker.sock \
--volume=/var/log/ecs/:/log \
--volume=/var/lib/ecs/data:/data \
--volume=/etc/ecs:/etc/ecs \
--volume=/sys/fs/cgroup:/sys/fs/cgroup:ro \
--net=host \
--env-file=/etc/ecs/ecs.config \
amazon/amazon-ecs-agent:latest
ExecStop=-/usr/bin/docker stop ecs-agent
[Install]
WantedBy=multi-user.target
Start the service:
sudo systemctl enable ecs-agent.service
sudo systemctl start ecs-agent.service
Verify the ECS agent is working
$ docker ps
Add the ECS cluster to the ECS agent config:
Edit the config file:
sudo vi /etc/ecs/ecs.config
Add the following:
ECS_DATADIR=/data
ECS_ENABLE_TASK_IAM_ROLE=true
ECS_ENABLE_TASK_IAM_ROLE_NETWORK_HOST=true
ECS_LOGFILE=/log/ecs-agent.log
ECS_AVAILABLE_LOGGING_DRIVERS=["json-file","awslogs"]
ECS_LOGLEVEL=info
ECS_CLUSTER=defualt
Then restart the ECS agent:
sudo systemctl restart ecs-agent.service
If the cluster name is different than default, assign the ECS_CLUSTER variable with the cluster name
At this point, we should be able to view the EC2 under the ECS instances tab in the ECS cluster, meaning the agent has subscribed to the cluster successfully.
However, since we are trying to run a GPU based task in the specific ECS service under the events tab we should see the following event:
service ... was unable to place a task because no container instance met all of its requirements. The closest matching container-instance ... has insufficient GPU resource available. For more information, see the Troubleshooting section.
This is happening since the ECS agent doesn’t have the proper configuration to utilize the GPU in the EC2 that it is hosted on.
In the next article, we’ll install the Docker-NVIDIA runtime and start connecting between the ECS agent and the NVIDIA driver.