Health checks and uptime monitoring
External and internal health monitoring
A health check is "is the service reachable and responsive?" Different stakeholders need different checks.
Levels of health check
Liveness
"Is the process running?"
GET /healthz
→ 200 (process answers)Used by container orchestrator (Podman, K8s) to decide whether to restart. Cheap, always 200 unless the process is wedged.
Readiness
"Can the service handle requests right now?"
GET /healthz/ready
→ Checks DB connection, Kratos/Hydra reachable, etc.
→ 200 only if all OKUsed by load balancer to decide whether to send traffic. If unready, LB routes away.
Deep health
"Is everything internally consistent?"
GET /healthz/deep
→ Runs a test query, calls upstream APIs, checks cache
→ 200 if all green, 503 with details otherwiseSlower. Run from monitoring, not from LB.
What Olympus exposes
| Endpoint | Service | Returns |
|---|---|---|
/health/alive | Kratos | 200 always |
/health/ready | Kratos | 200 if DB up |
/health/alive | Hydra | 200 always |
/health/ready | Hydra | 200 if DB up |
/healthz | Hera | 200 if process running |
/healthz | Athena | 200 if process running |
/healthz | Caddy | 200 if it can respond |
See Health endpoints for details.
External monitoring
Uptime monitor
Free / cheap services that hit your endpoint periodically:
- UptimeRobot: free for 50 monitors, 5-min interval.
- BetterUptime: nicer UI, free tier.
- Pingdom: mature, paid.
- Cronitor: cron-aware.
Each pings /healthz/ready every 1-5 minutes. Alert on failure.
What to monitor
| Monitor | Endpoint | Threshold |
|---|---|---|
| Hera up | https://ciam.your-domain.com/healthz | Down > 1 min |
| Hydra up | https://ciam.your-domain.com/health/ready | Down > 1 min |
| Login works | Synthetic test (real login) | Failure > 1 sample |
| Cert validity | openssl s_client check | < 14 days |
Synthetic tests
Pure HTTP up-check is useful but doesn't catch logical breakage. Synthetic tests simulate real user flows:
// scripts/synth-login.ts (cron every 5 min)
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto("https://ciam.your-domain.com/login");
await page.fill('input[name="identifier"]', "synth@your-domain.com");
await page.fill('input[name="password"]', process.env.SYNTH_PASSWORD);
await page.click('button[type="submit"]');
await page.waitForURL(/\/dashboard/);
console.log("login OK");
await browser.close();Alert if it fails. Catches: cert expiry, DB issues, broken UI, code regressions.
Internal monitoring
For each service, expose /metrics (Prometheus):
http_requests_total{path="/login",status="200"} 1234
http_request_duration_seconds_bucket{...}Scrape with Prometheus. Visualize with Grafana.
Alerts:
# alerts.yml
groups:
- name: olympus
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
annotations:
summary: "5xx rate above 5%"
- alert: KratosDown
expr: up{job="kratos"} == 0
for: 2m
annotations:
summary: "Kratos has been down for 2+ minutes"Status page
Public-facing: status.your-domain.com.
Tools:
- Statuspage (Atlassian): paid.
- Status.io: paid.
- Cstate: free, static-generated.
- Atlassian Statuspage → free tier.
Host the status page OUTSIDE your main infrastructure, if your infra is down, status should still work.
Incident notification
Tie monitor failures to notifications:
UptimeRobot detects down
↓
PagerDuty / OpsGenie / Pushover
↓
On-call gets pagedFor non-critical:
- Slack notification.
- Email.
For critical (auth completely down):
- Phone call, SMS.
Test your monitoring
Regularly:
- Kill a service for 30 seconds, did alerts fire?
- Trigger a synthetic failure, was it caught?
If alerts didn't fire, fix the monitoring. Untested monitoring is no monitoring.
SLOs
Set Service Level Objectives:
| SLO | Target |
|---|---|
| Login success rate | > 99.5% over 30 days |
| Login latency p99 | < 300ms |
| Uptime (overall) | > 99.9% per month |
Measure against these. Discuss with stakeholders when missed.
Track SLO burn rates, fast burn (1h) vs slow burn (1d). Distinguishes alert urgency.