All Systems Operational

About This Site

This status page describes the current operational status of Meteor and Galaxy systems.

Note that some aspects of Meteor and Galaxy depend on third-party systems which may have their own outages. While we try to update this page when we learn about outages in systems that we depend on, the status pages for the other system may be updated more quickly. These include:

• Auto-renewing SSL certificates are provided by Let's Encrypt: https://letsencrypt.status.io/
• Support requests are provided by ZenDesk: enter "meteor.zendesk.com" at https://status.zendesk.com/

The status of Apollo Engine is tracked at: http://status.apollographql.com/

Galaxy US infrastructure Operational
Galaxy management interface Operational
Meteor Developer Accounts server Operational
Meteor package server Operational
Galaxy EU Infrastructure Operational
Galaxy AP Infrastructure ? Operational
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Past Incidents
Mar 26, 2019

No incidents reported today.

Mar 25, 2019

No incidents reported.

Mar 24, 2019

No incidents reported.

Mar 23, 2019

No incidents reported.

Mar 22, 2019

No incidents reported.

Mar 21, 2019

No incidents reported.

Mar 20, 2019
Resolved - We have confirmed with AWS support that this outage resulted from an issue with the ECS service in us-east-1 during that time.

We will investigate whether Galaxy could have done more to prevent downtime in this case. This particular outage prevented us from starting and stopping tasks but running tasks ran fine. The outage resulted because our scheduler replaced the machines that were having trouble talking to ECS. This is a normal operation that generally preserves cluster health instead of harms it, but we will consider ensuring that this operation doesn't occur too often in a bounded amount of time.

As mentioned before, Galaxy customers who need to avoid downtime even when machines fail should run their apps in High Availability mode with at least three containers.
Mar 20, 14:42 PDT
Monitoring - Between 5:00 AM and 5:40 AM (America/Los_Angeles) this morning, about 15% of the machines in our US cluster became unavailable to connect to the AWS ECS master service which we use to coordinate running containers. Galaxy's scheduler detected these machine failures and replaced the machines, but during the incident we were unable to schedule new large containers due to capacity issues, as the number of broken machines was larger than our usual capacity buffer.

Machine failures are a normal occurrence in all distributed systems and customers who require 100% uptime for their production loads must run more than one container, ideally at least 3. In general, Galaxy does not consider "my single-container app had downtime due to a machine failure" to be a problem that Galaxy itself can or should prevent.

That said, multiple machine failures (across AWS availability zones) are not common in Galaxy except in the case of an outage in our service provider (AWS). While our investigations so far suggest that the root cause was an AWS ECS outage, AWS has not reported any issues on their status page. We are in contact with AWS support to try to track down the root cause.
Mar 20, 08:02 PDT
Mar 19, 2019

No incidents reported.

Mar 18, 2019

No incidents reported.

Mar 17, 2019

No incidents reported.

Mar 16, 2019

No incidents reported.

Mar 15, 2019

No incidents reported.

Mar 14, 2019

No incidents reported.

Mar 13, 2019

No incidents reported.

Mar 12, 2019

No incidents reported.