Agency Website Downtime Response Playbook: What to Do in the First 30 Minutes
The difference between a client who stays after an outage and a client who leaves often has nothing to do with how long the site was down. It has everything to do with whether the agency looked like they knew what they were doing during the downtime.
A client who hears nothing for two hours and then gets a vague apology has no way to distinguish a competent agency from an incompetent one. A client who gets a message within 10 minutes saying "we are aware of the issue, we have identified it as an SSL renewal failure, we expect resolution within 45 minutes, we will update you when it is resolved" — that client is not going anywhere.
Here is the playbook.
Minute 0–5: Detect and Triage
Confirm it is real. Your monitoring system should have already fired an alert. Before doing anything else, confirm that the detection is accurate — run a quick check from a different device or network to confirm the site is genuinely down and not a monitoring false positive.
Identify the failure type. The most common causes for agency-managed client sites:
- SSL certificate expired
- DNS record changed (pointing to the wrong server or nothing)
- Hosting outage (server or CDN layer)
- Upstream vendor outage (Cloudflare, Stripe, Shopify) causing partial or full failure
- Deployment error (last code push broke something)
- Domain registration lapse
The failure type determines everything about how you respond and how quickly you can resolve it.
Check vendor status pages. If the outage started within the last 30 minutes and you cannot immediately identify a client-side cause, check the major vendor status pages before assuming the fault is yours. A Cloudflare or AWS regional outage will show up there within 5-15 minutes of impact.
Minute 5–10: Notify the Client
Do not wait until you have a resolution. Notify within 10 minutes of detection, even if you only know that you are investigating.
The first notification should contain:
- Confirmation that you are aware of the issue
- What you know so far (or "we are investigating" if you do not yet know)
- When you will next update them
Template:
Hi [Client], we are aware that [site] is experiencing [brief description of symptom]. We are investigating and will update you within [30 minutes / 1 hour]. We will message you as soon as we have a confirmed resolution time.
Do not say "I think" or "it might be." Say what you know and what you are doing. Uncertainty is fine; vagueness is not. "We have identified an SSL renewal failure and are fixing it now" is better than "we're looking into it."
Where to send the notification: Use whatever channel the client actually reads. If they are in Slack with you, send it there. If they email, send an email. Do not send a notification to a channel the client does not monitor regularly.
Minute 10–25: Diagnose and Fix
Work the problem in order of likelihood.
SSL expiry — Check the certificate expiry date and the renewal configuration. For Let's Encrypt: check whether the auto-renewal job ran. For manually-issued certs: check whether the renewal was in your calendar. Resolution: renew the certificate. Most CDNs propagate a new certificate within 5-15 minutes.
DNS error — Check whether the DNS record resolves to the expected IP. Check your DNS provider's change log — when did the record last change and who changed it? If the change was inadvertent, revert it. Note that DNS changes can take up to an hour to propagate depending on TTL settings; set a low TTL before making the fix.
Hosting outage — If the hosting provider has a confirmed outage, there is nothing to fix on your end. Your job is to communicate status to the client until the provider resolves it. Monitor the provider's status page and relay updates.
Deployment error — If the outage started immediately after a deployment, revert to the previous known-good version first, then diagnose the deployment. Get the site back up first; understand the cause second.
Domain lapse — Check domain registration expiry in your registrar. If the domain has lapsed, the resolution path depends on how recently it expired and which registrar holds it. This is a situation where speed matters — most registrars have a grace period of 5-40 days before the domain is released for re-registration.
Minute 25–30: Client Update
Even if the issue is not resolved, update the client at the 30-minute mark.
If resolved:
Hi [Client], the issue has been resolved. [Site] is back online as of [time]. The cause was [brief description]. We will include a full incident summary in this month's monitoring report. Let us know if you notice anything else unusual.
If still in progress:
Update on [site]: we have identified the cause as [description] and are working on resolution. Current estimated resolution time: [time]. We will update you when it is resolved or in [30 minutes], whichever is sooner.
Keep the client on a predictable update cadence — every 30 minutes if unresolved. Even "no change, still working on it" is better than silence.
After Resolution: The Post-Incident Memo
Within 24 hours of resolution, send a one-page incident summary. This does not need to be long, but it needs to cover:
- What happened (the technical cause)
- When it was detected and by what
- How long it lasted
- What was done to resolve it
- What has changed to prevent recurrence
The post-incident memo converts a downtime event into evidence that you are running a professional operation. Clients who receive them are significantly less likely to raise the incident in a renewal conversation.
Building the Playbook into Your Workflow
The playbook only works if it is accessible when needed — not buried in a document that nobody has looked at in six months. Practical ways to keep it live:
- Pin the notification templates in your team's Slack workspace
- Keep the vendor status page URLs bookmarked on the device you use for incident response
- Run a quarterly drill with your team: someone fakes a client site being down, and the team practices the first 30 minutes of the response
The agencies that handle downtime well are not the ones with the fewest incidents. They are the ones that respond consistently, communicate clearly, and document everything.
Merlonix detects SSL failures, DNS drift, and upstream vendor outages before your clients do — and fires the alert that starts your 30-minute clock. Start monitoring →
→ Complete guide: Agency Monitoring: The Complete Guide to Monitoring Client Websites at Scale
→ Platform guide: Monitoring for E-commerce Agencies