During yesterday's US cluster scheduled maintenance, an outage caused an interruption in service for some of our customers. Some of our new app machines were unable to run containers, which led to selected apps running fewer than the desired number of containers and downtime in other cases.
Your Galaxy apps run on a set of AWS EC2 machines, managed with the AWS ECS agent and a custom scheduler. We need to occasionally replace these machines in order to protect our users from Linux security holes and upgrade other aspects of the system.
Our current policy is to do app machine replacements in a pre-scheduled maintenance window outside of business hours in the region's local time zone. We have this policy even though under normal circumstances, app machine replacements lead to no downtime for properly configured Galaxy apps. Before stopping containers on a machine and terminating the machine, the Galaxy scheduler starts new versions of the containers on a new machine, so apps see zero downtime.
We chose this policy for two reasons. First, when we first launched Galaxy several years ago, our machine replacement code had a few instances of specific bugs that led to app downtime. This particular class of issues has not been a problem for the past year. Second, a subset of our customer base would like to know more explicitly the reasons behind container termination. Previously, announcing scheduled maintenances was the only way they could learn that machines were being replaced. We have improved this recently by adding termination explanations to app logs that explain that a container died due to machine replacement, app deployment, or other reasons.
The policy of non-business-hours machine replacements put a big constraint on how we replace machines: it requires us to ensure that all machines can be replaced in a single overnight period (or use a more elaborate system that pauses during the day, or do replacements on weekends). Specifically, assuming we want to continue to replace machines one at a time (a good heuristic for limiting how many containers for a given app can be restarted at once), this limits the amount of time we can wait for a given machine to restart. We have been focused on ensuring that the process of moving containers from one machine to another can be fast, when arguably the way to have least impact on our users' apps is to ensure that this process is slow and gradual.
We recently learned that container startup times on Galaxy were starting to degrade. We investigated this and found one major cause: the ECS agent which is responsible for telling Docker which containers to run refused to run more than one “docker pull” command at a time. Our measurements showed that running “docker pull” in parallel should improve container start times, and we were happy to see that the ECS maintainers had added concurrent “docker pull” to the ECS agent a few months ago. Our tests in our QA environment showed that this sped up both normal container start times and the deluge of container starts that happen when we replace app machines, so we scheduled a maintenance window to roll out this change. Our deployments to the smaller AP and EU clusters went smoothly, just like in our QA environment.
Unfortunately, the deployment to the US cluster did not go as well. Most machines loaded their containers smoothly and quickly, but a small number of them failed. The improvement in ECS agent switched it from no concurrent “docker pull” commands to an unlimited amount of concurrency for “docker pull”, a fact our engineers did not catch when we audited the ECS agent changes.
The command “docker pull” works (roughly) by downloading any layers that don't currently exist to a temporary directory, and then applying them to the overlay directory to form an image store. Layers that exist when “docker pull” starts don't need to be written to the temporary directory, and we believe that layers are de-duplicated when written to the overlay directory. However, if a large number of “docker pull” commands are run in parallel, they can try to all download the same layers to the temporary directory. If this layer is a large base layer, then the temporary directory can easily fill up, and Docker does not deal well with this situation.
This occurred in about 1/4 of machines in our US non-IP-whitelisting cluster during yesterday's deployment. Our scheduler was able to detect that these machines were broken and automatically replace them with new machines, but many of the new machines had the same problem. The timeouts that we set for terminating a machine if it's impossible to move all containers off it in time are configured to match a “all machine replacements must occur overnight” policy and thus were too small to prevent killing the working machines when their replacements were not ready yet, so some apps did not have their requested number of containers running at all time. We believe this did not affect any machines with IP whitelisting enabled, and apps configured with more containers were more likely to have no downtime.
When our monitoring detected this problem, we realized we needed to fix our system to have a level of concurrency between “none” and “infinite”. While neither Docker nor amazon-ecs-agent lets us set that directory, we quickly rolled out a patch to our scheduler to put a finite level of concurrency into place, and the cluster eventually repaired itself.
We're deeply sorry that some of our users experienced downtime due to this issue, and we're taking immediate corrective action. Specifically, we are going to rework our system so that app machine replacement can be a much slower process that goes on for a period of days rather than one overnight period. This will allow us to reschedule a given machine's containers over the course of a period of time measured in hours rather than minutes. The timeout we apply to turning off a machine will be increased, so that even if there are temporary problems rescheduling containers, we can resolve them long before the machine is terminated.
Moving away from non-business-hours pre-scheduled maintenances will also allow us to perform the first 10-12 hours of an app machine replacement during times where at least one of our US- or Australia-based engineers are awake and working, rather than having most problems occur during the hours when we are either asleep, away from our computers, or least effective. Another advantage of this new policy is that we will be able to roll out fixes to important Linux kernel security issues as soon as they are announced and we test them rather than waiting a longer period of time where customer apps are vulnerable. This change has been requested by users and we are happy to deliver it to improve our service.
Another reason we currently announce app machine replacements is due to an unfortunate aspect of our deploy process. If you deploy a new version of your app and it builds successfully on our image builder machines but every attempt to actually run containers immediately crashes, we will continue running your older version and trying to run the new version indefinitely. However, if the old containers stop for any reason (including machine replacement), Galaxy will never try to run new containers of the old working version, just of the new non-starting version. A small percentage of our users' apps are generally in this state, where they appear to be running properly but will go down if we replace app machines. Before we change our app machine replacement policy, we will fix this situation by giving new app versions a limited time to successfully deploy, and keeping the old version as the “active” version if the new version can't start.
At MDG we pride ourselves on being able to provide good uptime to our users' apps without making you think too much about the deployment process under the hood. We failed you last night by overwhelming our machines with rushed machine replacements. We are learning from this mistake not just by fixing the precise bug that caused this outage, but also by re-evaluating the policies that lead to building infrastructure that focuses on quick rather than gradual machine replacement.
As a secondary effect, this outage caused a Meteor APM aggregator job to get behind, leading to a temporary loss of aggregated metrics (though no data loss). This has now caught up. We had already identified this job (inherited from the Kadira project) as something that needed to be rebuilt on top of a different storage platform, and that project is already in progress.