On-call rotation for Olympus
Set up paging and handoff so auth is reliable
Olympus is critical infrastructure for your users. When it breaks, someone has to fix it. Set up on-call rotation.
What's on-call
The person responsible for responding to alerts that night / week. They:
- Get paged when something breaks.
- Diagnose, mitigate.
- Escalate if needed.
- Hand off at rotation end.
Tooling
PagerDuty / OpsGenie / Pushover
Paid services that:
- Manage rotations.
- Receive alerts.
- Page the right person.
- Track responses.
Free / cheap alternatives:
- Better Stack.
- Self-hosted: Cabot.
Slack + cron
For solo or small team:
- Alerts to Slack channel.
- Designated on-call sees them.
- No automatic paging.
Less rigorous; works for low-volume.
Rotation
For a team:
on_call:
rotation:
- person: alice@your-corp.com
days: Monday, Tuesday, Wednesday
- person: bob@your-corp.com
days: Thursday, Friday, Saturday, Sunday
start: 09:00 UTC
end: 09:00 UTC next day # 24 hoursOr weekly:
Week 1: Alice
Week 2: Bob
Week 3: CarolEach on-call is full 168 hours.
Solo
If you're solo (small project / hobby):
- You're always on-call.
- Set up alerts during awake hours only.
- Accept that some incidents will be delayed.
That's fine for non-critical. Don't pretend you have 24/7 if you don't.
Alert categories
critical_immediate_page:
- Auth completely down (all logins failing).
- Data loss event.
- Active security incident.
high_within_1h:
- Login error rate > 5% but < 50%.
- Latency p99 > 1s.
- Backup failing.
medium_within_business_hours:
- Disk > 80%.
- Cert expires in < 14 days.
- Slow query alerts.
low_review_weekly:
- Anomalies in audit log.
- Email bounce rate slightly high.Don't page for non-critical.
Runbook readily available
When pager goes off, on-call shouldn't have to remember everything. Link directly:
PagerDuty alert: "Login error rate spike"
→ description includes runbook URL
→ on-call clicks → reads steps → actsSee Runbook format.
Escalation
If primary on-call doesn't respond in 5 min: escalate to secondary.
Don't expect heroics. People miss pages.
escalation:
- level: 1
notify: primary_on_call
timeout: 5m
- level: 2
notify: secondary_on_call
timeout: 5m
- level: 3
notify: tech_lead
timeout: 5m
- level: 4
notify: ctoTiered. Failures cascade.
Handoff
End of rotation:
# On-call handoff: 2026-05-13 → 2026-05-14
## Open incidents
- None.
## Recent fixes
- Restarted ciam-kratos at 04:30 after OOM. Should investigate memory leak.
## Watch list
- Disk at 76% (cleanup script will run tomorrow).
- Login rate elevated this morning (suspected botnet, IPs blocked).
## Notes for next on-call
- Don't restart Postgres without taking a backup first.
- New employee Carol joining team Monday, keep an eye on her access provisioning.Markdown in shared doc. Next person picks up cleanly.
Hours
Pager during these hours = on-call:
| Tier | Hours |
|---|---|
| 24/7 ops | Always |
| Business + nights | 09-22 weekdays, alerts only weekends |
| Business only | 09-18 weekdays, queue alerts otherwise |
For solo / small: business hours is fine for non-emergencies.
Don't burn out
On-call is stressful. Mitigate:
- Rotate equally (no person on-call > 25% of the time).
- Time off after high-stress weeks.
- Compensate (extra time off, bonus).
Burnout = mistakes = more incidents.
Practice
Quarterly tabletop:
- Pretend incident.
- On-call walks through response.
- Identify gaps.
Without practice, your runbooks aren't actually usable.
What you measure
Track:
- MTTR (mean time to resolve).
- Number of pages per week.
- False-positive rate.
Goals:
- MTTR trending down.
- Pages per week not increasing (or decreasing).
- False-positives < 10%.
If pages increase: your system is getting fragile. Invest in reliability.
Tooling integration
PagerDuty integrates with:
- Slack: alerts show in channel + ping individual.
- GitHub: link incidents to PRs.
- Datadog: alerts forwarded.
- Custom webhooks: anything.
Set up these channels for visibility and context.
Postmortem culture
After every page:
- Was it a real issue?
- What was the cause?
- How was it resolved?
- How to prevent recurrence?
Document in postmortems. Track action items. Implement.