Agency Monitoring: The Complete Guide to Monitoring Client Websites at Scale

Monitoring client websites is not the same problem as monitoring your own infrastructure. When you monitor your own systems, you know the architecture, you control the deployment decisions, and you are the stakeholder who cares about the outcome. When you monitor on behalf of clients, you are operating across a portfolio of sites you did not build, with SLA obligations that vary by client, in a context where your clients expect you to know about problems before they do.

This guide covers the full scope of agency monitoring — what to monitor, how to structure alerting across a portfolio, how to handle vendor outages, how to communicate with clients, and what to look for in monitoring tools built for multi-client use.


Why Agency Monitoring Is Different

The most common mistake agencies make when setting up client monitoring is applying the same approach they would use for a single site. That approach breaks under the weight of a portfolio.

You are not the only stakeholder. For your own infrastructure, you decide the SLA, the alert thresholds, and the response protocol. For client sites, the SLA is negotiated, the acceptable downtime is defined by contract, and the response protocol has to balance the client's expectations with your team's capacity.

Alert volume scales with portfolio size. A monitoring tool configured to alert on every anomaly is manageable for one site. For 25 sites, the same configuration generates alert noise that desensitises the team to real incidents. Agency monitoring requires deliberate alerting architecture, not default tool configurations.

Per-client isolation is an operational requirement. Client A's uptime history should not be visible to Client B. Status page links you share with clients should show only that client's data. When a compliance question arises, the monitoring data needs to be queryable per client — not extracted from a flat list that mixes all client data.

Vendor incidents propagate across the portfolio. When Cloudflare, AWS, Stripe, or another upstream provider has an incident, it can simultaneously affect multiple clients. Agencies need a way to identify vendor-sourced incidents, separate them from client-specific issues, and communicate them to affected clients without manually checking each account.


What to Monitor for Agency Clients

Agency monitoring should cover four signal categories: SSL certificates, DNS records, uptime, and upstream vendor status.

SSL Certificate Monitoring

SSL certificate expiry is the most common preventable cause of client website outages. An expired SSL certificate makes the site inaccessible to any browser with strict security enforcement — which is all modern browsers.

What to monitor:

  • Certificate expiry date: Alert 30 days out for proactive renewal, immediate alert on expiry.
  • Certificate validity: Confirm the certificate is issued to the correct domain and not using a mismatched or self-signed certificate.
  • Certificate chain integrity: Verify the full chain to the root CA is present and valid. A broken chain produces browser warnings even if the certificate itself is valid.
  • HSTS and HPKP headers: Where clients have enabled these security headers, monitor that the configuration remains valid.

A common failure mode is monitoring the primary domain but not the subdomains. If a client runs their marketing site at www.example.com and their client portal at portal.example.com, both certificates require monitoring — each has its own expiry date and renewal process.

For a detailed implementation guide, see SSL Certificate Monitoring for Agencies: How to Stop Client Outages Before They Happen.

DNS Record Monitoring

DNS changes are the second most common cause of unexpected client site outages. They are also among the hardest to diagnose, because DNS changes propagate gradually and the symptoms vary depending on where the affected user is located.

What to monitor:

  • A and AAAA records: Alert when the IP addresses associated with a domain change unexpectedly. Expected changes — CDN rotations, planned migrations — should be logged in advance and excluded from alerting.
  • MX records: Changes to mail exchange records affect email delivery. A client whose email stops working is a severe incident regardless of whether the website itself is functional.
  • CNAME records: Especially relevant for clients using CDN services or third-party hosting platforms where CNAME records point to provider infrastructure.
  • NS records: Changes to nameserver records indicate the DNS authority for a domain has changed — typically intentional (registrar migration) but occasionally the result of unauthorised access.
  • TTL values: Unusually short TTL values can indicate an imminent planned change or a misconfiguration.

For implementation guidance, see DNS Monitoring for Marketing Agencies: Catching Changes Before They Break Client Sites.

Uptime Monitoring

Uptime monitoring confirms that a site is responding to requests. Done well, it provides the SLA compliance data you need to deliver against client contracts. Done poorly, it generates false positives that erode team trust in the alerting system.

What to monitor:

  • HTTP status codes: A 5xx response from two or more independent check locations in a 2-minute window should trigger an immediate alert. A single 5xx may be a transient network issue.
  • Response time: Track baseline response times per client and alert on significant deviations. A site that normally responds in 200ms but starts consistently responding in 4 seconds has a performance issue even if it is technically "up."
  • Content validation: For critical pages, verify that the page content matches an expected pattern — not just that the server returned 200. A broken deployment can return 200 with an error page.
  • Multi-location checks: Use check nodes in at least two geographic locations to distinguish client-specific incidents from check-node network issues.

Uptime SLAs require documented uptime monitoring. For guidance on setting defensible SLAs, see Uptime SLAs for Agency Clients: What to Promise and What Monitoring You Need.

Upstream Vendor Monitoring

Agencies are increasingly exposed to upstream vendor incidents that affect client sites even when the agency's own infrastructure is functioning correctly. A Stripe outage breaks checkout on e-commerce clients. A Cloudflare incident degrades performance for clients using Cloudflare CDN. A Shopify platform issue affects the entire cohort of Shopify-hosted client sites.

What to monitor:

  • Key vendor status pages: Track the status of platforms your clients rely on — Cloudflare, AWS, GCP, Stripe, Shopify, HubSpot, and any platform-specific to your client portfolio.
  • Vendor-to-client impact mapping: Maintain a record of which vendor dependencies each client has. When a vendor incident is detected, immediately identify which clients are affected before they report the issue.
  • Communication cadence during vendor incidents: Have a defined communication protocol for vendor incidents — when you inform clients, what you say, and how you distinguish vendor issues from your own operational responsibility.

For the full vendor incident playbook, see How Vendor Outages Affect Marketing Agencies and Their Clients and How to Handle Third-Party Downtime with Clients: Communication and Escalation.


Structuring Alerts for a Client Portfolio

Alert architecture is where most agency monitoring setups fail. The default configuration for most monitoring tools is optimised for single-site operators who want to know about everything. Agency monitoring needs to be deliberately quieter — and reliably loud when something actually requires action.

The Alert Fatigue Problem

Alert fatigue develops when the volume of non-critical alerts trains the team to treat all alerts as low-priority. The typical progression: a team receives dozens of alerts per day, many of which resolve themselves or are minor SSL pre-expiry warnings. Within a few weeks, the team stops treating alerts as urgent. A real incident then sits unaddressed for 45 minutes because nobody checked.

The solution is not better tools — it is deliberate alert threshold configuration.

Immediate alert triggers (page on-call, any hour):

  • SSL certificate expired (not: expiring in 60 days)
  • DNS record points to an unexpected IP
  • Site returning 5xx from two or more independent check locations
  • Upstream vendor with confirmed major incident affecting client's stack
  • Client has contacted agency to report an issue

Digest items (daily or weekly summary, no immediate alert):

  • SSL expiring in 30–60 days
  • Brief resolution hiccups under 2 minutes
  • Minor vendor incidents with confirmed low client impact
  • DNS TTL changes within expected patterns

SLA Tiering for Portfolio Alert Routing

Not all clients warrant the same on-call urgency. A brochure site on a basic retainer does not require a 2am paged response. An e-commerce client on a premium SLA does.

Define priority tiers before configuring alert routing:

Priority A — immediate paging, any hour: Clients on SLA contracts with material downtime impact. E-commerce sites, booking platforms, lead generation landing pages with active campaigns.

Priority B — business hours immediate, off-hours digest: Mid-tier retainer clients. Off-hours incidents go to a digest reviewed first thing in the morning.

Priority C — digest only: Low-value or static sites. Incidents go to a daily digest. Clients are notified next business day.

Map each client to a tier before going live with alerting. This mapping is the most commonly skipped configuration step, and its absence is the primary cause of alert fatigue in agency monitoring setups.

For implementation guidance, see How to Manage a Multi-Client Monitoring Dashboard Without Losing Your Mind.


Client-Facing Monitoring Output

Monitoring creates data. Client-facing monitoring translates that data into value for clients. These are different activities with different outputs.

Status Pages

A client-facing status page is a public or password-protected URL that shows the real-time status and recent history of the client's monitored assets. When a client asks "is our site up?", the status page is the answer — without requiring the agency to manually check.

What an agency status page should show:

  • Current status of each monitored asset (SSL, DNS, uptime)
  • Real-time incident status with a plain-English description
  • Uptime percentage for the last 30 and 90 days
  • Recent incident history with resolution timestamps

What it should not show:

  • Any data from other clients
  • Internal agency monitoring tool configuration
  • Alerts that are not yet confirmed incidents

For implementation guidance, see Why Agencies Need a Client-Facing Status Page (and How to Set One Up).

Monthly Monitoring Reports

Monthly reports serve two purposes: they demonstrate the value of the monitoring service, and they create the audit trail for SLA compliance.

An effective agency monitoring report covers:

  • Uptime percentage for the period, against the contracted SLA
  • Incidents: what happened, when, how long the resolution took
  • SSL and DNS status: current certificate expiry dates, any changes detected
  • Vendor incidents that affected the client's stack
  • Upcoming: certificates due for renewal in the next 60 days, any planned maintenance

What clients do not need in a monitoring report:

  • Raw alert logs
  • Technical infrastructure detail
  • Check-by-check granularity

For the full report template, see How to Report Website Monitoring to Clients: What to Include and What to Skip and How to Automate Monthly Monitoring Reports for Agency Clients.


Incident Response for Agency Monitoring

When a real incident occurs, the agency's response protocol determines client perception as much as the actual resolution.

The First 30 Minutes

The first 30 minutes of a client incident determine whether the agency appears in control or reactive. The sequence:

  1. Confirm the incident — verify from at least two independent data sources (monitoring tool + manual check) before notifying the client. A single failed check may be a false positive.
  2. Assess client impact — is the main site down, or a single page? Is the issue affecting all users or a subset? Is it an agency-controlled issue (SSL expired) or a vendor issue?
  3. Notify the client — plain language, no jargon. "Your site is showing errors for visitors. We are investigating." Not "We are observing elevated 5xx rates from our uptime check nodes."
  4. Identify the cause — check SSL status, DNS records, recent deployments, and vendor status pages in parallel.
  5. Resolve or escalate — if the issue is agency-controlled (expired certificate, misconfigured DNS), resolve it. If it is vendor-caused, communicate the vendor's status and expected resolution timeline.

For the complete playbook, see Agency Website Downtime Response Playbook: What to Do in the First 30 Minutes.


Evaluating Monitoring Tools for Agency Use

Most uptime monitoring tools are designed for DevOps teams monitoring their own infrastructure. The features that matter for agency portfolio monitoring are substantially different.

What Agency Monitoring Tools Must Support

Per-client grouping: Assets organised by client, not by asset type. The view for a client account should show all of that client's monitored assets together — SSL, DNS, uptime — without manual filtering.

Tiered alert routing: Different on-call rules for different clients. Priority A clients get immediate pages; Priority C clients get daily digests. This routing must be configurable per client without affecting other clients.

Client-facing status views: Per-client status pages you can share directly with clients. The URL must not expose data from other clients.

Portfolio-level summary: A single view showing the health of the entire portfolio — how many clients have active incidents, how many have upcoming certificate renewals, and the aggregate uptime across the portfolio.

Role-based access: Account managers should see their clients' data without access to the entire portfolio. On-call engineers should see everything during active incidents.

Exportable per-client history: Monthly reports pulled per client. The export should cover the period, format it for client consumption, and not require manual filtering from a full-portfolio export.

For a full evaluation framework, see Monitoring Tools for Digital Agencies in 2026: What to Look For.


Related Reading


Start monitoring your client portfolio →