Implementing /healthz
A proper health endpoint for your services
Olympus services have /health. Your own apps need similar. Doing it right matters for ops.
Two endpoints
/healthz/live (liveness)
"Is the process running and able to respond?"
app.get("/healthz/live", (req, res) => {
res.status(200).send("OK");
});Always returns 200 if process is alive. Used by:
- Container orchestrator (Kubernetes, Podman) to decide whether to restart.
- Load balancer for basic availability.
Should NEVER fail, if the process is alive enough to handle this request, return 200.
/healthz/ready (readiness)
"Is the process able to serve traffic right now?"
app.get("/healthz/ready", async (req, res) => {
try {
await db.queryOne("SELECT 1");
await olympusClient.ping();
res.status(200).send("OK");
} catch (err) {
res.status(503).json({ error: err.message });
}
}Returns 503 if dependencies are unhealthy. Used by:
- Load balancer to route traffic elsewhere.
- Deployment system to wait for ready before announcing.
Quick. < 1 second response.
What to check
| Check | Liveness | Readiness | Deep |
|---|---|---|---|
| Process alive | ✓ | ✓ | ✓ |
| DB reachable | ✓ | ✓ | |
| External APIs | (maybe) | ✓ | |
| Cache | ✓ | ||
| Disk space | ✓ |
Deep health: full system check. Not on hot path.
Liveness pitfalls
If liveness checks DB and DB is down:
- Liveness fails → orchestrator kills process.
- Restart. DB still down. Loop.
Don't check dependencies in liveness. Just "is process alive."
// BAD
app.get("/healthz/live", async (req, res) => {
await db.query("SELECT 1"); // ← restarts on DB issue
res.send("OK");
});
// GOOD
app.get("/healthz/live", (req, res) => res.send("OK"));Readiness can fail temporarily
Readiness CAN fail when DB is down. LB routes elsewhere. Doesn't restart process.
This is correct: the process is fine, the system isn't.
Deep health
For more detail:
app.get("/healthz/deep", async (req, res) => {
const checks = await Promise.allSettled([
checkDb(),
checkRedis(),
checkOlympus(),
checkDisk(),
checkMemory(),
]);
const results = checks.map((c, i) => ({
name: ["db", "redis", "olympus", "disk", "memory"][i],
status: c.status === "fulfilled" ? "ok" : "fail",
detail: c.status === "fulfilled" ? c.value : c.reason.message,
}));
const overall = results.every(r => r.status === "ok") ? 200 : 503;
res.status(overall).json({ checks: results });
}Output:
{
"checks": [
{ "name": "db", "status": "ok" },
{ "name": "redis", "status": "fail", "detail": "connection refused" },
{ "name": "olympus", "status": "ok" },
{ "name": "disk", "status": "ok", "detail": "12% used" },
{ "name": "memory", "status": "ok", "detail": "234 MB / 1024 MB" }
]
}Diagnose: which dependency is broken.
Don't expose to public
Health endpoints shouldn't be public:
- Reveal infrastructure.
- Can be probed for fingerprinting.
@health path /healthz*
@internal_only remote_ip 10.0.0.0/8
handle @health {
handle @internal_only {
reverse_proxy app:3000
}
respond 404
}Only internal network can hit.
For public uptime checks (status page): a separate /status endpoint that exposes less:
app.get("/status", (req, res) => {
res.status(200).json({ status: "operational" });
});Doesn't leak details.
Response time
Health checks should be fast. < 100ms.
If /healthz/ready is slow:
- Use cached results.
- Don't re-check dependencies on every hit.
let lastCheck = { time: 0, ok: false };
app.get("/healthz/ready", async (req, res) => {
if (Date.now() - lastCheck.time < 5000) {
return res.status(lastCheck.ok ? 200 : 503).send();
}
const ok = await fullCheck();
lastCheck = { time: Date.now(), ok };
res.status(ok ? 200 : 503).send();
});5s cache. Heavy load tolerates.
Startup probe
Some orchestrators have startup probe, slower than liveness:
# Kubernetes
startupProbe:
httpGet: { path: /healthz/live, port: 3000 }
failureThreshold: 30
periodSeconds: 5For services that take minutes to warm up (cache primer, large DB).
Once startup succeeds, liveness takes over.
Custom checks
Beyond DB:
async function checkCustom() {
// E.g., verify featureFlag service is reachable
// Or: verify a specific data invariant
const usersCount = await db.queryOne("SELECT COUNT(*) FROM identities").count;
if (usersCount < 1) {
return { ok: false, reason: "no_users_imported" };
}
return { ok: true };
}Per-app-specific health.
Outbound network
If your app depends on external API (Stripe, etc.):
async function checkStripe() {
const res = await fetch("https://api.stripe.com/v1/healthcheck", { timeout: 1000 });
return res.ok;
}Be careful: external dependencies make YOUR healthcheck flaky. If Stripe is slow, your readiness fails, your app appears down.
Better: track external in /healthz/deep, not /healthz/ready.
Versioning
Include version in deep:
res.json({
version: process.env.GIT_SHA,
build_time: process.env.BUILD_TIME,
checks: [...],
});Helps debug: "is this old version still running?"
Common pitfalls
Same endpoint for liveness and readiness
Conflates "should restart" with "should route traffic." Different decisions.
Health checks doing too much
Slow checks cascade. Keep fast.
Health checks behind authn
app.get("/healthz", requireAuth, (req, res) => ...);Bad. LB can't authenticate. Use IP allowlist instead.
Returning 200 for "almost ready"
if (allChecks.some(c => c.ok)) return 200; // some passed → OK?No, ALL should pass. Otherwise return 503.