We use an internal Docker Registry for each region. These Registries use a self-signed certificate. The certificates from US and EU regions expired yesterday so pushing new images (new deploys) and pulling images (new containers) were failing.
These certificates expire every 5 years but we didn't have any monitor tracking these certificates.
A few apps were down while this certificate was not renewed because our cluster scales down our machines when we have a lot of space available, so if one app was running in a machine that was turned off during the scale down and this app was only running one container this app was getting error starting the container in the new machine as pulling new images was resulting in error due to the expired certificate.
This caused a few apps to be down as the app was not able to start new containers successfully.
Actions to avoid this in the future:
1 - We are going to add monitors to these certificates as we have to all other certificates.
2 - We are going to decrease our level of "accepted" errors in the monitors of the services: a) that build new images and b) starts new containers. So this issue is not going to happen again in the future. And if something similar happens we will be notified sooner even if just a few apps are affected. These monitors didn't fire as just a few apps were affected.