DNS-01 was your scalable choice. Then your DNS provider had a bad day during a renewal.

You picked DNS-01 for the right reasons. It does wildcards, it works for hosts that never touch the public internet, and you don't have to punch port 80 through a load balancer to satisfy a CA. The platform team wired the provider plugin into the ACME client, watched a few issuances go green, and moved on.

What bites later is the assumption that a renewal is the same operation as that first issuance. It is, except now it runs unattended at 3am, against a DNS API that owes you nothing. Issuance succeeding once told you little about renewal succeeding every ninety days.

Why DNS-01 ties renewal to your DNS provider's API

HTTP-01 keeps validation on infrastructure you already run: your web server answers a challenge over a connection you control. DNS-01 moves that onto your DNS provider's control plane. Every renewal makes your automation authenticate to that provider, write a _acme-challenge TXT record, wait for it to become visible to the CA's resolvers, then ask the CA to validate.

Three remote operations, each able to fail on its own, none on hardware you can SSH into. Let's Encrypt has flagged how brittle this coupling is and is drafting a persistent-record alternative aimed at the pain of racing ephemeral TXT records on every challenge.

The failure window: outages, rate limits, and propagation lag

Three distinct things go wrong, and they read differently in your logs. A provider API outage returns 5xx or just hangs, so the client never writes the record. A rate limit returns 429 because some other automation in the account is hammering the same API, rejecting your write while the provider is up. Propagation lag is the quiet one: the write succeeds, your client immediately asks the CA to validate, and the resolver still holds a cached negative answer or hasn't seen the new record yet.

That last case is the dangerous one because nothing errored. The record exists, the API returned 200, and the order still fails. TXT records can take minutes to appear globally behind high TTLs and slow edge propagation, and a client that finalizes the instant it writes walks straight into that gap.

Renewal headroom: why you renew well before not_after

Renew at the last possible moment and one bad DNS day becomes an outage. Renew early and that same day becomes a retry. Start at roughly one-third of remaining lifetime, not when you're days from not_after. For a 90-day cert that puts you near day 60, with weeks of runway to absorb a provider incident, a rate-limit storm, or a propagation stall before anyone is paged.

Short-lived certificates make this non-negotiable. As validity windows compress, the absolute margin shrinks with them. "It renewed fine last time" is not headroom.

Retry, backoff, and fallback so a bad DNS day doesn't expire a cert

The fix is not a longer cron interval. It is making each attempt resilient on its own:

When one provider serves hundreds of names, serialize or stagger renewals so you don't rate-limit yourself, and keep a fallback challenge type for the SANs that can use it.

This is the unglamorous resilience work that decides whether DNS-01 scales or quietly dies on a holiday weekend. Automate Certificates runs propagation-aware DNS-01 with retry, backoff, and early renewal headroom per Environment, so a provider's bad day becomes a retried attempt instead of an expired cert. The feature overview covers the renewal policy and challenge-type controls.