Incident communication
Tell users when auth is broken
When Olympus is down (or degraded), users can't sign in. They're stuck. Communicate clearly.
Status page
A separate domain (status.your-domain.com) hosted outside your main infra, so if main infra is dead, status still works.
Tools:
- Atlassian Statuspage (paid).
- BetterStack.
- Self-hosted Cstate / Upptime (GitHub-hosted static).
Manual incident lifecycle
Mark status page:
[Identified - 2026-05-13 14:32 UTC]
We're aware of users unable to sign in. Investigating.
[Investigating - 14:45 UTC]
We've identified the issue: database connection pool exhaustion.
Working on a fix.
[Identified - 15:02 UTC]
Fix deployed. Monitoring recovery.
[Resolved - 15:30 UTC]
Authentication is back to normal. Postmortem to follow.Time-stamped, factual, no jargon.
Notification channels
When you mark "Identified," notifications:
- Email to subscribers.
- Webhook to Slack channel.
- RSS feed.
- Twitter / Bluesky (status page integrations).
Customers subscribe to whatever they prefer.
Severity levels
Define:
- Major outage: total inability to sign in.
- Partial outage: some users can't sign in (specific provider).
- Degraded: slow but functional.
- Maintenance: scheduled, communicated in advance.
Use consistent labels. Customers learn what each means.
What to say (and not say)
Say:
- What's happening (high level).
- Estimated time to fix.
- What to do (or "no action needed").
Don't say (yet):
- Specific cause (in progress).
- Blame (vendor, team).
- Speculation.
Save those for the postmortem.
In-app banner
Beyond status page, banner in app:
{incident && (
<Banner intent={incident.severity}>
{incident.message}
<Link href={`https://status.your-domain.com/incident/${incident.id}`}>
View status
</Link>
</Banner>
)}Fed from status page API.
Comms during the incident
Update every:
- 15 min while investigating.
- 30 min after identification.
- Resolution + 30 min ("monitoring").
Even "no update, still investigating" is communication. Silence is the worst.
After resolution
Within 5 business days, post the postmortem:
Postmortem: Authentication outage on 2026-05-13
Summary
-------
On May 13, between 14:32 and 15:30 UTC, ~80% of sign-in attempts failed.
Root cause: connection pool exhaustion after a config change.
Timeline
--------
14:32 - Config change rolled out.
14:35 - First reports of sign-in failures.
14:42 - Incident identified.
14:55 - Mitigation: rolled back config.
15:02 - Auth recovering.
15:30 - Fully resolved.
Root cause
----------
Yesterday's config change increased the pool size, but a faulty entry
caused fewer-than-expected connections. Under traffic, the pool was exhausted.
Action items
------------
1. Add automated validation of config changes (Owner: SRE, due May 25).
2. Add canary deploy for config (Owner: Platform, due June 1).
3. Add connection pool monitoring (Owner: Platform, done).Honest. Future-focused. No blame.
Customer notification
For B2B with SLAs, notify directly:
Subject: Authentication outage on 2026-05-13, postmortem
Dear Customer,
Yesterday we experienced an authentication outage affecting approximately
80% of sign-in attempts between 14:32 and 15:30 UTC.
We've published a full postmortem here: [URL]
As required by our SLA, you're entitled to a 5% service credit for the
affected period, which will be applied to your next invoice.
We're sorry for the disruption.
[Your Team]Apologize. Don't hide. Demonstrate the credit.
When NOT to mark
Some events feel like outages but aren't:
- One user can't sign in (might be user-side).
- A specific edge feature broken (not "authentication").
Mark only when meaningfully widespread. Otherwise: noise.
Maintenance
For scheduled work:
Scheduled maintenance - 2026-05-20 04:00-04:30 UTC
We're upgrading our database. Authentication may be briefly unavailable.
Subscribed users will receive notification 48h beforehand.Announce. Conduct. Confirm complete.
Practice
Quarterly: tabletop exercise. Pretend outage happens. Walk through:
- Detection: how fast?
- Internal coordination: Slack channel ready?
- Status page: who can update?
- Customer comms: who writes the email?
Tighten the process. When real, faster.