Between 4:00 PM and approximately 5:36 PM PST on Tuesday, November 23, we experienced an outage on most Coinbase production systems. During this outage, users were unable to access Coinbase using our websites and apps, and therefore were unable to use our products. This post aims to describe what happened and the reasons, and to discuss how we plan to avoid such problems in the future.
On November 23, 2021, at 4:00 PM PT (November 24, 2021 00:00 UTC), the SSL certificate for an internal hostname on one of our Amazon Web Services (AWS) accounts expired. An expired SSL certificate was used by many of our internal load balancers which caused the majority of connections between services to fail. Due to the fact that our API routing layer communicates with backend services via subdomains of this internal hostname, about 90% of incoming API traffic returns errors.
Error rates returned to normal once we were able to migrate all upload balances to a valid certificate.
Context: Testimonials at Coinbase
It’s helpful to provide some background on how we manage SSL certificates at Coinbase. For the most part, public hostname certificates like coinbase.com are managed and provided by Cloudflare. For internal hostname certificates used to route traffic between backend services, we have historically made use of AWS IAM server certificates.
One downside of IAM Server certificates is that certificates must be generated outside of AWS and loaded via an API call. So in the last year, our infrastructure team moved from IAM Server Certificates to AWS Certificate Manager (ACM). ACM solves the security issue because AWS builds both the public and private components of the certificate within ACM and stores the encrypted version in IAM for us. Only connected services such as Cloudfront and Elastic Load Balancers will be able to access the certificates. to reject acm: export certificate Permission ensures all AWS IAM roles cannot be exported.
In addition to additional security features, ACM also automatically renews certificates prior to expiration. Given that the ACM certificates were supposed to be renewed and we did the migration, how did that happen?
Root Cause Analysis
Incident responders quickly noticed that the expired certificate was the IAM server’s certificate. This was unexpected because the above mentioned ACM migration had been widely publicized in engineering communication channels at the time; So we were operating under the assumption that we were working exclusively on ACM certifications.
As we later found out, one of the certificate migrations didn’t go as planned; The group of engineers working on the migration uploaded a new IAM certificate and postponed the rest of the migration. Unfortunately, the delays were not reported as widely as they should have been, and changes to the team and staff structure have led to incorrectly assuming the project was completed.
Regardless of immigration status, you can ask the same question we asked ourselves: “Why weren’t we notified that this certificate had expired?” Answer: We were. The alerts were sent to an email distribution group that we found to be only two people. This group was originally larger in size, but it shrunk as team members left and was not sufficiently relocated as new members joined the team.
In short, critical certificate expiration was allowed due to three factors:
- IAM’s migration to ACM was incomplete.
- Expiry alerts were sent via email only and were either filtered or ignored.
- There were only two people on the email distribution list.
Accuracy and improvements
To resolve this incident, we migrated all load balances that were using an expired IAM certificate to the existing auto-renew ACM certificate that was provided as part of the original migration plan. This took longer than required due to the number of load balancers involved and cautioned us in identifying, testing and implementing the required infrastructure changes.
To ensure we never encounter an issue like this again, we have taken the following steps to address the factors mentioned in the RCA section above:
- We have completed the migration to ACM, are no longer using IAM server certificates and are deleting any old certificates to reduce noise.
- We are adding automated monitoring connected to our alert and relay system to increase email alerts. These pages will appear upon imminent expiration as well as when ACM certificates are no longer eligible for automatic renewal.
- We have added a permanent group alias to our email distribution list. Moreover, this group is automatically updated when employees join and leave the company.
- We are building a repository of incident handling processes to reduce the time needed to identify, test and implement new changes.
We take the uptime and performance of our infrastructure very seriously, and work hard to support the millions of customers who choose Coinbase to manage their cryptocurrency. If you are interested in solving challenges like the ones listed here, feel free to work with us.
Incident Post Mortem: November 23, 2021 originally posted on the Coinbase Blog on Medium, where people continue the conversation by highlighting and responding to this story.