Olympus Docs
CookbookOperations

Health checks and uptime monitoring

External and internal health monitoring

A health check is "is the service reachable and responsive?" Different stakeholders need different checks.

Levels of health check

Liveness

"Is the process running?"

GET /healthz
→ 200 (process answers)

Used by container orchestrator (Podman, K8s) to decide whether to restart. Cheap, always 200 unless the process is wedged.

Readiness

"Can the service handle requests right now?"

GET /healthz/ready
→ Checks DB connection, Kratos/Hydra reachable, etc.
→ 200 only if all OK

Used by load balancer to decide whether to send traffic. If unready, LB routes away.

Deep health

"Is everything internally consistent?"

GET /healthz/deep
→ Runs a test query, calls upstream APIs, checks cache
→ 200 if all green, 503 with details otherwise

Slower. Run from monitoring, not from LB.

What Olympus exposes

EndpointServiceReturns
/health/aliveKratos200 always
/health/readyKratos200 if DB up
/health/aliveHydra200 always
/health/readyHydra200 if DB up
/healthzHera200 if process running
/healthzAthena200 if process running
/healthzCaddy200 if it can respond

See Health endpoints for details.

External monitoring

Uptime monitor

Free / cheap services that hit your endpoint periodically:

Each pings /healthz/ready every 1-5 minutes. Alert on failure.

What to monitor

MonitorEndpointThreshold
Hera uphttps://ciam.your-domain.com/healthzDown > 1 min
Hydra uphttps://ciam.your-domain.com/health/readyDown > 1 min
Login worksSynthetic test (real login)Failure > 1 sample
Cert validityopenssl s_client check< 14 days

Synthetic tests

Pure HTTP up-check is useful but doesn't catch logical breakage. Synthetic tests simulate real user flows:

// scripts/synth-login.ts (cron every 5 min)
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto("https://ciam.your-domain.com/login");
await page.fill('input[name="identifier"]', "synth@your-domain.com");
await page.fill('input[name="password"]', process.env.SYNTH_PASSWORD);
await page.click('button[type="submit"]');
await page.waitForURL(/\/dashboard/);
console.log("login OK");
await browser.close();

Alert if it fails. Catches: cert expiry, DB issues, broken UI, code regressions.

Internal monitoring

For each service, expose /metrics (Prometheus):

http_requests_total{path="/login",status="200"} 1234
http_request_duration_seconds_bucket{...}

Scrape with Prometheus. Visualize with Grafana.

Alerts:

# alerts.yml
groups:
  - name: olympus
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        annotations:
          summary: "5xx rate above 5%"
      
      - alert: KratosDown
        expr: up{job="kratos"} == 0
        for: 2m
        annotations:
          summary: "Kratos has been down for 2+ minutes"

Status page

Public-facing: status.your-domain.com.

Tools:

Host the status page OUTSIDE your main infrastructure, if your infra is down, status should still work.

Incident notification

Tie monitor failures to notifications:

UptimeRobot detects down

PagerDuty / OpsGenie / Pushover

On-call gets paged

For non-critical:

  • Slack notification.
  • Email.

For critical (auth completely down):

  • Phone call, SMS.

Test your monitoring

Regularly:

  • Kill a service for 30 seconds, did alerts fire?
  • Trigger a synthetic failure, was it caught?

If alerts didn't fire, fix the monitoring. Untested monitoring is no monitoring.

SLOs

Set Service Level Objectives:

SLOTarget
Login success rate> 99.5% over 30 days
Login latency p99< 300ms
Uptime (overall)> 99.9% per month

Measure against these. Discuss with stakeholders when missed.

Track SLO burn rates, fast burn (1h) vs slow burn (1d). Distinguishes alert urgency.

On this page