Solving ECS Stuck in Pending and Frozen / Stalled ECS Hosts Problems

We’ve had a strange, hard to track-down problem for months now. It has felt like a bug with Amazon ECS, but everything seems to have been working correctly.

The main way that we’ve observed this problem is that ECS would say that it was launching tasks, but they would stay in a “PENDING” state forever. Conversely, when tasks needed to be killed, the desired state would change to Stopped, but the ECS Console would indicate that they were still running. We discovered quickly, that some of our ECS Host Servers would become completely unresponsive. Sometimes with 100% CPU usage, sometimes with near zero CPU Usage. Terminating the instance, and having the Auto-Scaling group recreate it would generally solve the problem, but its never good to have things frozen without understanding why.

Often, the host servers would be completely unresponsive. We were usually unable to SSH into the server to investigate. When able to access them, looked through logs and found it full of failures about being unable to talk to external resources. After diving pretty deep, we figured out that the route table was missing a default gateway. It’s hard to talk to anything when you can only use a local network.

This is an example of a missing default gateway.

[ec2-user@ip-172-31-45-74 ~]$ route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
172.31.32.0     0.0.0.0         255.255.240.0   U     0      0        0 eth0

On a functioning instance, it should look like this. Notice the destination of 0.0.0.0 with the IP Address to the Default Gateway:

[ec2-user@ip-172-31-39-228 ~]$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         172.31.32.1     0.0.0.0         UG    0      0        0 eth0
169.254.169.254 0.0.0.0         255.255.255.255 UH    0      0        0 eth0
172.17.0.0      0.0.0.0         255.255.0.0     U     0      0        0 docker0
172.31.32.0     0.0.0.0         255.255.240.0   U     0      0        0 eth0

It was puzzling how the machine would work for a while, and then its default gateway would disappear.

I’m still not certain how exactly that is happening. However, the system log indicates that there is a period of extremely high load
and it gets frozen for minutes (maybe hours) at a time.

Some of these log entries are indicitive of major delays:

Jan 20 13:26:44 ip-172-31-123-45.ec2.internal crond[21992]: (root) INFO (Job execution of per-minute job scheduled for 13:25 delayed into subsequent minute 13:26. Skipping job run.)

Jan 17 21:20:31 ip-172-31-45-166.ec2.internal chronyd[2696]: Forward time jump detected!

Notice how these logs are out of order too:

Jan 20 13:39:22 ip-172-31-123-45.ec2.internal kernel: R13: 00007faf9dc777a8 R14: 00000000000031f9 R15: 00007faf9dc7d510
Jan 20 13:28:30 ip-172-31-123-45.ec2.internal dockerd[4660]: http: superfluous response.WriteHeader call from github.com/docker/docker/api/server/httputils.MakeErrorHandler.func1 (httputils.go:107)
Jan 20 13:36:03 ip-172-31-123-45.ec2.internal dhclient[3275]: XMT: Solicit on eth0, interval 129760ms.
Jan 20 13:28:30 ip-172-31-123-45.ec2.internal dockerd[4660]: http: superfluous response.WriteHeader call from github.com/docker/docker/api/server/httputils.MakeErrorHandler.func1 (httputils.go:107)

Finally, this may be the thing that ultimately disables the networking. It looks like `oom-killer` killed the `dhclient-script`, which maybe left the network in an very bad state:

Jan 20 15:28:36 ip-172-31-45-74.ec2.internal kernel: dhclient-script invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null),  order=0, oom_score_adj=0
Jan 20 15:28:36 ip-172-31-45-74.ec2.internal kernel: dhclient-script cpuset=/ mems_allowed=0

You can simply run

sudo dhclient eth0

to have it grab the default gateway from DHCP again. But its best to put other memory limits in place to prevent it from running out of resources to begin with.

Leave a Reply

Your email address will not be published.