Operational runbook template
How to write a useful runbook for Olympus operations
A runbook is "what to do when [thing] happens." For 3 AM incidents, having one beats not having one.
Anatomy
# Runbook: Auth flow failures spike
## When this fires
Login success rate < 95% over 5 min OR error rate > 5%.
## Severity
- 90-95% success: warn (slack ping).
- < 90%: page (on-call).
## Detect
- Grafana alert: olympus-login-success-rate.
- Status page: yellow if < 95%, red if < 80%.
## Diagnose (in order)
1. Is Kratos up? `podman ps | grep ciam-kratos` → if down, restart.
2. Is Postgres up? `podman exec ciam-postgres pg_isready` → if not, see [Postgres recovery].
3. Network: `curl https://ciam.your-domain.com/health/ready` from same network as ops.
4. Recent deploys? `git log --since="1 hour ago"`, recent push? Suspect.
5. Rate limit: are we hitting our own DDoS protections? Check Caddy logs.
## Mitigate
- If recent deploy is the suspect:cd /opt/olympus && git revert HEAD --no-edit && git push podman-compose up -d
- If DB is slow:
- Check `pg_stat_activity` for stuck queries.
- Restart Postgres only as last resort.
- If Kratos / Hydra panicking:
- Restart container.
- If keeps crashing, check disk space, RAM, configs.
## Verify resolution
- Login success rate climbs to > 99%.
- Synthetic test passes.
- Status page back to green.
## Communicate
- Update status page when resolved.
- Slack message to #alerts: "Resolved, RCA tomorrow."
## Followup (within 5 business days)
- Postmortem document.
- Action items in tracker.Principles
Be specific
Bad: "If Kratos is having issues, restart it."
Good: "Check podman ps | grep ciam-kratos. If status is not 'Up', run podman start ciam-kratos. Wait 30s. Check curl ...health/ready."
3 AM person doesn't remember context. Spell it out.
One thing per runbook
Don't have "general troubleshooting" runbook. Split:
runbook-login-failures.mdrunbook-database-down.mdrunbook-email-not-sending.md
Each focused. Find quickly.
Up-to-date
Runbooks decay. Review:
- After every incident: was it useful? Update.
- Quarterly: walk through. Find broken commands.
- After dependency changes (Hydra v2 → v3): update commands.
Test in non-emergency
Practice runbooks during chaos drills (see chaos engineering). Find bugs when not stressed.
Sections
When this fires
Trigger condition. Specific.
Alert "high_error_rate", rate(http_requests{status=~"5.."}[5m]) > 0.05Severity
When to wake people. Don't page for routine.
Detect
How to confirm the issue exists. Multiple sources:
- Grafana metrics.
- Logs.
- Direct check.
Diagnose
Steps in priority order. Most common first.
Mitigate
Actions to take. Each command verbatim.
Verify resolution
Concrete check that "the bad thing stopped."
Communicate
Who needs to know. How.
Followup
Postmortem, action items.
Index
# Olympus Runbooks
## Authentication
- [Login failures spike](./runbook-login-failures.md)
- [MFA enrollment broken](./runbook-mfa-broken.md)
- [OAuth2 token endpoint down](./runbook-oauth2-token.md)
## Database
- [Postgres unavailable](./runbook-pg-down.md)
- [Postgres slow queries](./runbook-pg-slow.md)
- [Disk full](./runbook-disk-full.md)
## Email
- [Verification emails not sending](./runbook-email-fail.md)
- [Bounce rate spike](./runbook-email-bounce.md)
## Security
- [Account takeover reported](./runbook-ato.md)
- [Brute force detected](./runbook-brute-force.md)
## Operational
- [Cert expiring soon](./runbook-cert-expiry.md)
- [Backup failed](./runbook-backup-fail.md)Easy to navigate.
Tools
Markdown in git
Simple, version-controlled. Reviewable changes.
Slab / Notion
Web-friendly. Search. Comments.
Atlassian Confluence
If you're already on it. Not the easiest to maintain.
Code annotations
For tighter coupling:
/**
* If this throws frequently, see [runbook-login-failures.md].
*/
function processLogin() { ... }Find runbook from code.
Templates
For consistent runbooks:
# scripts/new-runbook.sh
cp templates/runbook.md docs/runbooks/$NAME.md
sed -i "s/{name}/$NAME/" docs/runbooks/$NAME.md# templates/runbook.md
# Runbook: {name}
## When this fires
[Specific trigger]
## Severity
[]
## Detect
1.
2.
## Diagnose
1.
2.
## Mitigate
1.
2.
## Verify resolution
-
## Communicate
-
## Followup
- Filling in is fast.
Avoid
"Use your judgment"
3 AM judgment is worse than rested judgment. Tell them what to do.
Open-ended steps
"Check the logs" → for what?
Specific:
Check logs for "panic" or "fatal":
podman logs ciam-kratos --since 30m | grep -E "(panic|fatal)"Outdated tools
If runbook says "ssh to the Hetzner box" but you've migrated to AWS, it's actively misleading. Maintain.