Health endpoints
Every /health endpoint across the Olympus stack
Every service in Olympus exposes a health endpoint. This page lists each one, what it actually checks, and how to interpret a failure.
Inventory
| Service | Path | Container port | Public? |
|---|---|---|---|
| Caddy | /healthz | 80, 443 | yes (via ingress) |
| Hera CIAM | /health | 3000 | no (intranet only) |
| Hera IAM | /health | 4000 | no |
| Athena CIAM | /api/health | 3001 | no |
| Athena IAM | /api/health | 4001 | no |
| Site | /health | 2000 | no |
| Kratos CIAM | /health/alive, /health/ready | 3100 (public), 3101 (admin) | partial |
| Kratos IAM | /health/alive, /health/ready | 4100, 4101 | partial |
| Hydra CIAM | /health/alive, /health/ready | 3102 (public), 3103 (admin) | partial |
| Hydra IAM | /health/alive, /health/ready | 4102, 4103 | partial |
| Postgres | pg_isready | 5432 | no |
| pgAdmin | /misc/ping | - | no |
What each one actually checks
Caddy /healthz
Caddy responds 200 OK to /healthz if the proxy process is up. It does not verify that any backend is reachable. A 200 from Caddy means "Caddy is alive"; it does not mean Hera or Kratos is alive.
Hera /health
Hera responds 200 OK if the Next.js server is up. It does not check Kratos connectivity. A failed Hera health means the Next.js process is dead, the container should restart.
Athena /api/health
Athena's /api/health is the most thorough of the app-level health checks:
- 200 OK if the process is up.
- It validates
SESSION_SIGNING_KEYis set (the container would have refused to start otherwise, but the endpoint re-confirms). - It validates
ENCRYPTION_KEYis set. - It does not check Postgres connectivity (separate endpoint planned in [athena#TBD]).
Even though it doesn't check DB, a failing /api/health from Athena always means the container is broken.
Site /health
Site responds 200 OK if the Next.js server is up. Same shape as Hera.
Kratos /health/alive vs /health/ready
Kratos splits liveness from readiness:
/health/alive, process is up and able to accept requests. Does not verify DB connectivity./health/ready, DB is reachable, migrations are at the expected version, and the courier is operational.
For most monitoring, probe /health/ready on every Kratos. A non-200 means Kratos cannot do work even though the process is alive.
Hydra /health/alive vs /health/ready
Same split as Kratos. Probe /health/ready for the meaningful signal.
Postgres
Use pg_isready -h <host> -p 5432 -U <user> from a container that has pg_isready installed (any Postgres client image works). A 0 exit means accepting connections.
pgAdmin
Internal endpoint; not generally useful for production monitoring because pgAdmin is a single-user admin tool, not a service users depend on.
Probe configuration
Compose healthcheck: blocks
Most services have a healthcheck: configured in compose.{dev,prod}.yml:
ciam-kratos:
healthcheck:
test: ["CMD", "wget", "-q", "-O-", "http://localhost:5001/health/ready"]
interval: 30s
timeout: 5s
retries: 3
start_period: 30sIf a service is unhealthy for 3 consecutive checks, the container is marked unhealthy in podman ps. Compose by default does not restart unhealthy containers (the container is still "running" from Podman's point of view), your supervisor (systemd unit, kubelet, etc.) needs the restart logic.
External monitoring
For external uptime monitoring, the simplest probes are:
GET https://ciam.<domain>/.well-known/openid-configuration
GET https://iam.<domain>/.well-known/openid-configurationA 200 means: Caddy is up, the upstream Hydra is up, the DB is reachable. This single probe covers most of the stack.
For a second tier of probes:
GET https://ciam.<domain>/.ory/kratos/sessions/whoami
# Returns 401 if Kratos is up but there's no session, that's a healthy 401Treat the 401 as healthy here (Kratos returns 200 only if there's a valid session cookie).
Interpreting failures
| Symptom | Likely cause |
|---|---|
| Caddy 502 | Backend container is down or unhealthy. Caddy is healthy itself. |
| Caddy 504 | Backend is slow (timeout). Either backend is overloaded or there's a DB issue. |
Caddy 200 on /healthz but /.well-known/openid-configuration 502 | Hydra is down. |
Kratos /health/ready 503 with body database not available | Postgres is down or unreachable. |
Kratos /health/ready 503 with body database migrations missing | Kratos was deployed but the migration job didn't run. |
Athena /api/health 200 but every route 500 | Likely SESSION_SIGNING_KEY rotation went wrong. See incident response. |
Where next
- Operate, Incident Response, full on-call playbook.
- Operate, Network Topology, what's host-bound vs internal.
- Operate, Logs and Observability, how to read logs for failure modes.