By Yair Knijn · April 22, 2025
The reliability manager who thought a cert outage was a 30-minute fix
Ask a reliability manager what a cert outage costs and you often get a single number: thirty minutes. Swap the cert, bounce the service, write the postmortem over coffee. That number is why certificate expiry never gets monitoring, never gets a named owner, and never lands in a capacity plan. It sits in the same mental bucket as a stuck cron job.
The estimate is wrong because it prices the last step and ignores the four before it. An expiry incident is discovery, then root-cause, then re-issuance, then deployment, usually spread across three teams who do not normally talk under pressure.
Why cert outages take hours, not minutes
Expiry has a nasty property: nothing fires until the exact second of impact. No warning curve, no slow degradation, no error rate creeping up a dashboard. At notAfter the handshake simply starts failing, and your first signal is a customer or a downstream service complaining. Your discovery clock has already started late.
Root-cause eats the next chunk. The symptom reads as "the API is down," not "a certificate expired." On-call burns twenty minutes ruling out a deploy, a database, a network change, before someone finally runs openssl s_client and reads the date. Re-issuance carries its own latency: who owns the CA account, getting the CSR right, surviving rate limits, passing the challenge. Deployment is the only quick part, and only when you know which hosts and load balancers serve that cert. String it together and the median runs to hours.
When the failure lands on payment rails
On 21 July 2024 the Bank of England's CHAPS high-value payment system went down, and the reporting attributed it to an expired TLS certificate. CHAPS settles trillions of pounds a day, so this was not a marketing site falling over. Three days earlier the same RTGS infrastructure had taken a separate hit that ran past four hours. Put two of those in one week and "thirty minutes" stops being a credible planning assumption.
The cadence is tightening, too. In January 2025 Let's Encrypt announced six-day short-lived certificates and certs for raw IP addresses. A renewal that used to happen quarterly now lands roughly every five days. Any process leaning on a human remembering, or a calendar reminder, breaks at that frequency. The only thing that survives a six-day lifetime is automation you actually trust.
Pricing one incident honestly
Run the arithmetic your thirty-minute number skips. Take a mid-sized org that hits a handful of expiry outages over a couple of years, and put real time against each one:
- Discovery: half an hour of customer impact before anyone even names the cause.
- Root-cause and re-issuance: two to three hours of senior on-call, plus the CA-account owner dragged in.
- Deployment and verification: rolling the new cert across every host and edge, then proving the handshake.
That is a full afternoon of expensive people per incident, before you count revenue lost while the front door was shut. The honest figure is not minutes of engineering time. It is hours of cross-team time and an outage that reaches your board.
Pre-funding the boring fix
The fix is unglamorous and cheap next to one outage. Monitor expiry as a first-class signal with real lead time, not a 7-day warning that pages into a void. Give every certificate a named owner and a documented CA account, so re-issuance does not begin with "who has the login." Keep a runbook listing which hosts and load balancers serve each cert. Above all, automate renewal and make the alert fire on challenge failure, not on the cert already being dead.
Automate Certificates treats each tenant as an Environment with a live inventory: every cert, its owner, its expiry, and whether the last renewal actually succeeded. Renewal runs without a human in the loop, and the alert fires when a renewal fails, days before notAfter, while you still have time to fix it calmly. If you have been budgeting expiry as a thirty-minute swap, see what the inventory and automated renewal cover, and price the boring fix against one real outage.