Olympus Docs
CookbookOperations

On-call rotation for Olympus

Set up paging and handoff so auth is reliable

Olympus is critical infrastructure for your users. When it breaks, someone has to fix it. Set up on-call rotation.

What's on-call

The person responsible for responding to alerts that night / week. They:

  • Get paged when something breaks.
  • Diagnose, mitigate.
  • Escalate if needed.
  • Hand off at rotation end.

Tooling

PagerDuty / OpsGenie / Pushover

Paid services that:

  • Manage rotations.
  • Receive alerts.
  • Page the right person.
  • Track responses.

Free / cheap alternatives:

Slack + cron

For solo or small team:

  • Alerts to Slack channel.
  • Designated on-call sees them.
  • No automatic paging.

Less rigorous; works for low-volume.

Rotation

For a team:

on_call:
  rotation:
    - person: alice@your-corp.com
      days: Monday, Tuesday, Wednesday
    - person: bob@your-corp.com
      days: Thursday, Friday, Saturday, Sunday
  start: 09:00 UTC
  end: 09:00 UTC next day  # 24 hours

Or weekly:

Week 1: Alice
Week 2: Bob
Week 3: Carol

Each on-call is full 168 hours.

Solo

If you're solo (small project / hobby):

  • You're always on-call.
  • Set up alerts during awake hours only.
  • Accept that some incidents will be delayed.

That's fine for non-critical. Don't pretend you have 24/7 if you don't.

Alert categories

critical_immediate_page:
  - Auth completely down (all logins failing).
  - Data loss event.
  - Active security incident.
  
high_within_1h:
  - Login error rate > 5% but < 50%.
  - Latency p99 > 1s.
  - Backup failing.

medium_within_business_hours:
  - Disk > 80%.
  - Cert expires in < 14 days.
  - Slow query alerts.

low_review_weekly:
  - Anomalies in audit log.
  - Email bounce rate slightly high.

Don't page for non-critical.

Runbook readily available

When pager goes off, on-call shouldn't have to remember everything. Link directly:

PagerDuty alert: "Login error rate spike"
→ description includes runbook URL
→ on-call clicks → reads steps → acts

See Runbook format.

Escalation

If primary on-call doesn't respond in 5 min: escalate to secondary.

Don't expect heroics. People miss pages.

escalation:
  - level: 1
    notify: primary_on_call
    timeout: 5m
  - level: 2
    notify: secondary_on_call
    timeout: 5m
  - level: 3
    notify: tech_lead
    timeout: 5m
  - level: 4
    notify: cto

Tiered. Failures cascade.

Handoff

End of rotation:

# On-call handoff: 2026-05-13 → 2026-05-14

## Open incidents
- None.

## Recent fixes
- Restarted ciam-kratos at 04:30 after OOM. Should investigate memory leak.

## Watch list
- Disk at 76% (cleanup script will run tomorrow).
- Login rate elevated this morning (suspected botnet, IPs blocked).

## Notes for next on-call
- Don't restart Postgres without taking a backup first.
- New employee Carol joining team Monday, keep an eye on her access provisioning.

Markdown in shared doc. Next person picks up cleanly.

Hours

Pager during these hours = on-call:

TierHours
24/7 opsAlways
Business + nights09-22 weekdays, alerts only weekends
Business only09-18 weekdays, queue alerts otherwise

For solo / small: business hours is fine for non-emergencies.

Don't burn out

On-call is stressful. Mitigate:

  • Rotate equally (no person on-call > 25% of the time).
  • Time off after high-stress weeks.
  • Compensate (extra time off, bonus).

Burnout = mistakes = more incidents.

Practice

Quarterly tabletop:

  • Pretend incident.
  • On-call walks through response.
  • Identify gaps.

Without practice, your runbooks aren't actually usable.

What you measure

Track:

  • MTTR (mean time to resolve).
  • Number of pages per week.
  • False-positive rate.

Goals:

  • MTTR trending down.
  • Pages per week not increasing (or decreasing).
  • False-positives < 10%.

If pages increase: your system is getting fragile. Invest in reliability.

Tooling integration

PagerDuty integrates with:

  • Slack: alerts show in channel + ping individual.
  • GitHub: link incidents to PRs.
  • Datadog: alerts forwarded.
  • Custom webhooks: anything.

Set up these channels for visibility and context.

Postmortem culture

After every page:

  • Was it a real issue?
  • What was the cause?
  • How was it resolved?
  • How to prevent recurrence?

Document in postmortems. Track action items. Implement.

On this page