By Yair Knijn · December 10, 2025
The monitoring that watches your certs expired its own cert. Now you're blind.
You run reliability. You built the cert dashboard, wired the 30-day and 7-day expiry alerts, and you sleep because the system tells you before anything breaks. The bad assumption is buried in that last word. Your inventory covers the application certs, not the certs underneath the thing doing the watching. The ones that take you down sit on the boxes between your check and your phone.
The certs hiding in your alert pipeline
Walk the path an alert actually travels. Prometheus scrapes exporters over TLS, Alertmanager posts to a webhook, the webhook hits your pager over HTTPS, and Grafana sits behind an ingress with its own cert. Every hop is a certificate, most provisioned once by hand, by someone who has since changed teams, and none of them live in the spreadsheet your dashboard reads from.
The asymmetry is the trap. A cert expiring on a customer-facing endpoint is loud. A cert expiring on alertmanager.internal goes silent, because the only thing positioned to notice just lost its ability to talk. You get nothing, and nothing feels like healthy.
When the watcher needs watching
Here is where oversight turns into cascade. Picture a TCP load balancer fanning connections round-robin across a few proxy instances that terminate TLS. One holds an expired cert. Connections routed there fail TLS negotiation and never reach the access logs, so the failure is invisible from outside and obvious only in the keystore. Some users break, the dashboards stay green, and nobody pages, because it looks like dropped clients, not a cert event.
Now add the circular case. Your monitoring scrapes through that same load balancer, or the lapsed cert is the one Alertmanager presents. The detector cannot reach what it is meant to detect, and the warning would have to travel the exact path the lapse just severed. The cert protecting your alert channel becomes the one point whose failure also deletes the warning about it. Plenty of well-run companies have eaten public outages from nothing more than plain expiry. The version that scares me more is where the expiry takes out the eyes first.
Out-of-band alerting that survives a TLS failure
The answer is not more checks inside the same blast radius. You need a second channel that does not share fate with the thing it watches. A probe running on the same cluster, behind the same ingress, posting through the same webhook as everything else is the same outage wearing a different hostname.
- Run the expiry check from outside your primary infrastructure, against a different DNS path, so a load-balancer cert failure cannot blind the checker.
- Give the alert path a fallback that does not depend on the cert under test: a separate trust chain, or an escalation that degrades to SMS when the webhook fails.
- Page on probe failure and stale heartbeats, not only on a future expiry date. "Nothing heard in 15 minutes" has to wake someone, because silence is how a self-blinding system fails.
Map and test the dependency graph
Draw your renewal system as a graph and ask one question of every node: if this cert expires, can the alert about it still reach a human? Include the ACME client's outbound TLS to the CA, the cert on the vault holding your account key, and the ingress in front of the dashboard you would open to investigate. Then test it. Set a cert to expire in staging, let it lapse, and confirm a page lands within minutes through a path that never touches the expired cert. A renewal system you have never watched fail under controlled conditions is one whose failure mode you meet live.
Automate Certificates inventories every cert you run, including the ones on your monitoring, ingress, and alert hops that hand-built spreadsheets miss, and renews them before they can knock out the channel that would have warned you. See how the dependency-aware inventory maps it.