[EU Region] Issues with apps starting
Incident Report for Meteor Cloud
Postmortem

This incident stemmed from a problem in the low levels of the Linux stack that we use to run your apps and our services: systemd's journald, the daemon that collects logs from everywhere on the system, crashed with an assertion failure, possibly due to an unusual configuration we rolled out recently.

We didn't have monitoring for this particular failure case because we didn't expect this low level component to fail. Our monitoring caught some symptoms of this, but it took longer to alert than it would if we were directly monitoring for the failure, and so our post to status.meteor.com was delayed. Additionally, while we did realize there was a problem with the machine, it seemed at first that just fixing journald fully fixed the machine; it took us a little longer to realize that some customer apps were still affected.

We immediately announced maintenance windows and rolled out a better journald configuration and enable direct monitoring of the daemon.

This issue only affected a small number of app-hosting machines. In general, machines can fail for many reasons outside of anyone’s control. We have several mechanisms for automatically fixing failed machines; depending on how the machines fail, we may be able to bring up a new container before the old container goes away, or that may be impossible (say, if the entire machine just crashes immediately). To the best of our knowledge, apps running with multiple containers in High Availability mode did not have downtime due to this outage.

Posted Aug 23, 2016 - 17:25 EDT

Resolved
We believe all apps are running successfully now.

We will probably schedule a maintenance window very soon to add further monitoring for the root cause of this morning's issue and deploy a potential fix.
Posted Aug 18, 2016 - 13:34 EDT
Identified
Fix has been identified and correcting now
Posted Aug 18, 2016 - 13:29 EDT