Investigating issues with new deploys
Incident Report for Meteor Cloud
Postmortem

We use an internal Docker Registry for each region. These Registries use a self-signed certificate. The certificates from US and EU regions expired yesterday so pushing new images (new deploys) and pulling images (new containers) were failing.

These certificates expire every 5 years but we didn't have any monitor tracking these certificates.

A few apps were down while this certificate was not renewed because our cluster scales down our machines when we have a lot of space available, so if one app was running in a machine that was turned off during the scale down and this app was only running one container this app was getting error starting the container in the new machine as pulling new images was resulting in error due to the expired certificate.

This caused a few apps to be down as the app was not able to start new containers successfully.

Actions to avoid this in the future:
1 - We are going to add monitors to these certificates as we have to all other certificates.
2 - We are going to decrease our level of "accepted" errors in the monitors of the services: a) that build new images and b) starts new containers. So this issue is not going to happen again in the future. And if something similar happens we will be notified sooner even if just a few apps are affected. These monitors didn't fire as just a few apps were affected.

Posted Jul 19, 2021 - 11:21 EDT

Resolved
This incident has been resolved.
Posted Jul 18, 2021 - 14:25 EDT
Monitoring
We've replaced the certificates.

Now we are replacing the app machines, this is not immediate as we don't want to kill many containers in the same time but new containers are going to be created in the new machines already with new certificates to access the registry.

Deploys should be working fine now as well.
Posted Jul 18, 2021 - 13:00 EDT
Update
Galaxy AP was not affected by this incident.
Posted Jul 18, 2021 - 12:16 EDT
Identified
We have a self-signed certificate in our Docker Registry that needs to be renewed each 5 years and it expired yesterday night.

We are updating these certificates now.

We are going to post more details later here.

This is affecting new deploys and new containers as any action requiring a Docker image from our Registry is failing.
Posted Jul 18, 2021 - 11:55 EDT
Investigating
We are investigating issues with new deploys
Posted Jul 18, 2021 - 11:26 EDT
This incident affected: Galaxy US infrastructure, Galaxy EU Infrastructure, and Galaxy AP Infrastructure.