On-call rotation for Olympus

Olympus is critical infrastructure for your users. When it breaks, someone has to fix it. Set up on-call rotation.

What's on-call

The person responsible for responding to alerts that night / week. They:

Get paged when something breaks.
Diagnose, mitigate.
Escalate if needed.
Hand off at rotation end.

Tooling

PagerDuty / OpsGenie / Pushover

Paid services that:

Manage rotations.
Receive alerts.
Page the right person.
Track responses.

Free / cheap alternatives:

Better Stack.
Self-hosted: Cabot.

Slack + cron

For solo or small team:

Alerts to Slack channel.
Designated on-call sees them.
No automatic paging.

Less rigorous; works for low-volume.

Rotation

For a team:

on_call:
  rotation:
    - person: alice@your-corp.com
      days: Monday, Tuesday, Wednesday
    - person: bob@your-corp.com
      days: Thursday, Friday, Saturday, Sunday
  start: 09:00 UTC
  end: 09:00 UTC next day  # 24 hours

Or weekly:

Week 1: Alice
Week 2: Bob
Week 3: Carol

Each on-call is full 168 hours.

Solo

If you're solo (small project / hobby):

You're always on-call.
Set up alerts during awake hours only.
Accept that some incidents will be delayed.

That's fine for non-critical. Don't pretend you have 24/7 if you don't.

Alert categories

critical_immediate_page:
  - Auth completely down (all logins failing).
  - Data loss event.
  - Active security incident.
  
high_within_1h:
  - Login error rate > 5% but < 50%.
  - Latency p99 > 1s.
  - Backup failing.

medium_within_business_hours:
  - Disk > 80%.
  - Cert expires in < 14 days.
  - Slow query alerts.

low_review_weekly:
  - Anomalies in audit log.
  - Email bounce rate slightly high.

Don't page for non-critical.

Runbook readily available

When pager goes off, on-call shouldn't have to remember everything. Link directly:

PagerDuty alert: "Login error rate spike"
→ description includes runbook URL
→ on-call clicks → reads steps → acts

See Runbook format.

Escalation

If primary on-call doesn't respond in 5 min: escalate to secondary.

Don't expect heroics. People miss pages.

escalation:
  - level: 1
    notify: primary_on_call
    timeout: 5m
  - level: 2
    notify: secondary_on_call
    timeout: 5m
  - level: 3
    notify: tech_lead
    timeout: 5m
  - level: 4
    notify: cto

Tiered. Failures cascade.

Handoff

End of rotation:

# On-call handoff: 2026-05-13 → 2026-05-14

## Open incidents
- None.

## Recent fixes
- Restarted ciam-kratos at 04:30 after OOM. Should investigate memory leak.

## Watch list
- Disk at 76% (cleanup script will run tomorrow).
- Login rate elevated this morning (suspected botnet, IPs blocked).

## Notes for next on-call
- Don't restart Postgres without taking a backup first.
- New employee Carol joining team Monday, keep an eye on her access provisioning.