Multiple machine failure in US cluster

Incident Report for Meteor Cloud

Resolved

We have confirmed with AWS support that this outage resulted from an issue with the ECS service in us-east-1 during that time.

We will investigate whether Galaxy could have done more to prevent downtime in this case. This particular outage prevented us from starting and stopping tasks but running tasks ran fine. The outage resulted because our scheduler replaced the machines that were having trouble talking to ECS. This is a normal operation that generally preserves cluster health instead of harms it, but we will consider ensuring that this operation doesn't occur too often in a bounded amount of time.

As mentioned before, Galaxy customers who need to avoid downtime even when machines fail should run their apps in High Availability mode with at least three containers.

Posted Mar 20, 2019 - 17:42 EDT

Monitoring

Between 5:00 AM and 5:40 AM (America/Los_Angeles) this morning, about 15% of the machines in our US cluster became unavailable to connect to the AWS ECS master service which we use to coordinate running containers. Galaxy's scheduler detected these machine failures and replaced the machines, but during the incident we were unable to schedule new large containers due to capacity issues, as the number of broken machines was larger than our usual capacity buffer.

Machine failures are a normal occurrence in all distributed systems and customers who require 100% uptime for their production loads must run more than one container, ideally at least 3. In general, Galaxy does not consider "my single-container app had downtime due to a machine failure" to be a problem that Galaxy itself can or should prevent.

That said, multiple machine failures (across AWS availability zones) are not common in Galaxy except in the case of an outage in our service provider (AWS). While our investigations so far suggest that the root cause was an AWS ECS outage, AWS has not reported any issues on their status page. We are in contact with AWS support to try to track down the root cause.

Posted Mar 20, 2019 - 11:02 EDT