Let’s Encrypt automates certificate renewals. It sells the idea that you install a client and don’t have to think about it again. This requires reliability that has to start with Let’s Encrypt itself. We can now see that with KeyChest.
The problem of any automation is that it breaks without warning. Let’s Encrypt is brilliant in providing us with certificates. What it does not offer is any service beyond that. The service that would mitigate the risk of 6 renewals instead of usual one.
It is not unusual to build own infrastructure monitoring but what should you do when renewals suddenly start failing. Has something happened on your servers or is the Let’s Encrypt certification authority down?
Let’s Encrypt has its own manually edited status page - https://letsencrypt.status.io. I used this for a couple of uptime analysis before. It provides more detail than any other verification authority (CA) provides but it still contains a lot of uncertainty.
We have now launched an independent monitoring of Let’s Encrypt CA. It continually request certificates and stores detailed information for each response. This allows a detailed view into the Let’s Encrypt performance.
Figure: Let’s Encrypt latency for the last 60 minutes.
You can see the last hour of monitoring online at https://keychest.net/letsencrypt. You can also sign-up for weekly reports and updates.
First Downtime Detection
When we were checking for errors in log files on 25 June, we suddenly saw loads of failures. It didn’t occur to us that we can actually see a real time a Let’s Encrypt downtime. I confess that we had to improve data processing to see the downtime in the dashboard. Eventually we could see what the monitoring servers detected.
Figure: The first detection of a Let’s Encrypt Downtime.
What I find amazing is that our servers detected the downtime 11 minutes before Let’s Encrypt did it itself. It’s official entry states the start of the incident at 7:04am. We looked at our logs and that was the moment when the internet facing RESTful API started returning a new error after its caching was exhausted.
Here is a brief dissection of failures recorded by our monitoring servers.
What can we see in the test logs? Let's have a look at the Frankfurt server:
06:53:45 - failed to create a new_account - connection was refused
06:53:44 - new_order - the initial message starting an issuance transaction failed with an error message "No Key ID in JWS header" - HTTP code 400
(at this point "dir" and "nonce" request worked correctly)
06:54:07 - new_account fails again - the message is "Unknown desc = failed to select one blockedKeys" and also "Too many connections"
The server continued attempts to initiate renewals in 30 second intervals. Each attempt included:
- dir - request for a list of the API endpoints - correct response
- nonce - request for a new random transaction nonce - correct reponse
- new_account - "failed to select one blockedKeys" / '"Too many connections"
- new_order - "No Key ID in JWS header"
The responses were the same up to 07:06:38am. At this point the "dir" command fails with HTTP code 503 - Service Unavailable. At this point, Let's Encrypt seems to have noticed a problem.
The "dir" command continued to fail till 07:16:53 when it returned the first correct response. And the server got the first certificate finalized at 7:17:00.
The Mombai server has shown the first failure on the "finalize" command - submission of the CSR.
6:53:58 - new_order fails with an error "Error retrieving account", HTTP code 500
06:59:27 - attempt for new_account fails with "Unknown desc = failed to select one blockedKeys"
errors now become identical to those recorded by the Frankfurt server.
07:07:26 - the dir command fails for the first time.
The first correct response was then received at 07:16:54.
The Singapore server recorded the first failure at 06:53:54 when the new_order was refused.
The events were similar to those on the Mombai server with the first dir failing at 07:07:09. And the recovery at 07:16:53.
The server in N. Virginia, US detected the first failure at 06:53:33 and the recovery is shown at 07:16:53.
the monitoring system consists of completely autonomous servers that collect the performance data. This is transformed into hourly stats and compiled into reports once a week.
Online reports query the monitoring servers directly to obtain fresh data that is then presented to the visitors of https://keychest.net/letsencrypt.
We believe that the monitoring only makes sense if it provides a sufficient granularity and also a good reliability of its results. We decided that we use 4 or more monitoring servers running in different locations. Locations provided by Amazon data centers. The setup consists of:
Production Let’s Encrypt monitoring:
- Mombai, India
- Frankfurt, Germany
- N. Virginia, US
As we have to comply with rate limits, we monitor the production Let’s Encrypt API (application interface) with 400 domain names. this allows us to run monitoring with two transactions per minute.
Staging Let’s Encrypt monitoring:
- Ohio, US
- N. California, US
- Tokyo, Japan
- London, UK
- Sydney, Australia
The monitoring here is more frequent - at four transactions per minute.
Have a suggestion, interested in more granular information, want to chat? Let me know at Dan.firstname.lastname@example.org.