A friend tagged me yesterday on LinkedIn with an update that Microsoft Teams - a team communication service, something like Slack - had gone down due to an expired certificate. How can this even happen?
Update Feb 5: I received some feedback that pointed to more MS certificates that were expired - markets.books.microsoft.com (e-books in MS app store - not supported any more) or vlscppe.microsoft.com (volume licensing) - it showed expired last night but has a new cert installed now (issued 5 days ago) - both are API endpoints.
I am sure there will be a post-mortem, pointing its finger to a number of "regular" near misses that grouped together and let a certificate slip through the net. I was curious though what users think. The best place for immediate feedback is Twitter so I spent a little bit of time just scrolling through hundreds of tweets.
Interestingly, a lot of people were forgiving and simply said that it could happen to anyone - and they were right. DEFCON is the synonym for the most hard-core geeks, hackers, and experts on security. Still, their website fell victim of the expired certificate as well in mid-2017. Interestingly, they did renew the certificate (as you can find from public CT logs), they just didn't manage to install it properly.
Figure: DefCon website down due to expired certificate.
A large portion of people on Twitter tried to suggest "a simple solution" - calendar reminders, TO-DO tasks. The trouble is that these would have their deadline in a year or two - enough time to cause trouble again when the time comes. With the DEFCON incident in mind, it's actually much harder than it seems.
Figure: Status of "teams.microsoft.com" domain as shown in our domain audit tool.
You can see that it has 19 Critical items (expired, or expiring within a week), and 4 imminent (expiring within 7-14 days). You can also see that there are 512 services, i.e., domain names within the "teams.microsoft.com" with either a valid certificate or a certificate expired in the last 4 weeks.
500 is a huge number from the management point of view. Many of them will be abandoned - domains created for a particular project that has finished, temporary domains, or simply services not relevant anymore. It means that someone has to be able to say whether a particular domain is still "live" or whether it can be ignored. In my opinion, domains with certificates should be only abandoned once explicitly announced as such.
Let's quickly have a look at the certificates that expired (or probably expired - this is a quick audit and change of issuer may cause inaccuracies) in the last 4 weeks.
- auditservice-staging.teams.microsoft.com - 09 Jan 2020, 05:49
- auditservice.teams.microsoft.com - 17 Jan 2020, 03:29
- auditservice-int.teams.microsoft.com - 18 Jan 2020, 15:51
- *.urlp.gcc.teams.microsoft.com - 25 Jan 2020, 12:00
- urlp.gcc.teams.microsoft.com - 25 Jan 2020, 12:00
- stage.urlp.gcc.teams.microsoft.com - 25 Jan 2020, 12:00
- *.stage.urlp.gcc.teams.microsoft.com - 25 Jan 2020, 12:00
- eastus2.fabric.int.teams.microsoft.com - 28 Jan 2020, 12:56
- emailactions.teams.microsoft.com - 29 Jan 2020, 16:20
- emailactions-test.teams.microsoft.com - 29 Jan 2020, 16:20
- emailactions-int.teams.microsoft.com - 29 Jan 2020, 16:20
- retentionhook-int.teams.microsoft.com - 01 Feb 2020, 16:23
- retentionhook-test.teams.microsoft.com - 01 Feb 2020, 16:23
- retentionhook.teams.microsoft.com - 01 Feb 2020, 16:24
- *.smba.gcc.teams.microsoft.com - 02 Feb 2020, 12:00
- smba.gcc.teams.microsoft.com - 02 Feb 2020, 12:00
- cachewriter-int.teams.microsoft.com - 02 Feb 2020, 18:20
You can see that many of them are for test or staging systems - i.e., possibly safe to ignore. Some domains are not active (e.g., emailactions-int.teams.microsoft.com). Some, however, are hard to judge as they are publicly available and still expired (e.g., auditservice.teams.microsoft.com).
This is only a "domain name" view. Each domain can have a number of separate services - https, imap, message queues, databases, etc. This makes management of expiry even more complicated. Further, the number of certificates never goes down. In my experience, having a 10% annual growth is not so bad place to be at.
If you now think that managing certificates is hard then this is only the tip of the iceberg. Technology companies would have 10-100x more internal certificates providing internal infrastructure security). If we can see 500 domains within "teams.microsoft.com", it's very likely that there are another 50,000 certificates securing databases, APIs, inter-server communication, message queues, and so on.
The bottomline is that things can get very messy once the number of certificates gets beyond a couple of dozen (with Let's Encrypt) and something like 100 for long-term certificates. Still, one would think that Microsoft is so close to technology to have mastered this side of its business by now.
There are many services that try to be too clever with a complicated management and workflow approach. I have had my own experience and decided that the best way to do it is to copycat Google. Do you remember how it started? They wisely figured out that any data classification is hard and error-prone and it's much better to simply work with full-text data and search and build logic centrally. That's what we are doing with KeyChest for companies of any size and any budget.
KEYCHEST Web Expiry Management - website certificate expiration is easily forgotten—causing costly downtime. Our expert service automatically checks and renews your certificates, on time, and correctly, so you can start every day with confidence.