Today (September 1st) around 02:14 (UTC) our App scheduler service on us-east-1 was not working properly.
We roll out updates almost every week and the last update didn't terminate all the old scheduler machines. Scheduler machines are the ones coordinating the start and stop actions of containers and host machines.
So we had running some scheduler machines that were not working as expected as they were running with old configurations and this caused some containers to be replaced wrongly without respecting our policy to always have good containers running first and then kill the old ones.
This problem affected a few apps and in some cases causing a ~6 minutes downtime because all the containers were replaced.
We really sorry for the trouble we have caused and we have already changed our process to double check all the resources that should be destroyed by Terraform in the end of every update.