Implementing /healthz

Olympus services have /health. Your own apps need similar. Doing it right matters for ops.

Two endpoints

/healthz/live (liveness)

"Is the process running and able to respond?"

app.get("/healthz/live", (req, res) => {
  res.status(200).send("OK");
});

Always returns 200 if process is alive. Used by:

Container orchestrator (Kubernetes, Podman) to decide whether to restart.
Load balancer for basic availability.

Should NEVER fail, if the process is alive enough to handle this request, return 200.

/healthz/ready (readiness)

"Is the process able to serve traffic right now?"

app.get("/healthz/ready", async (req, res) => {
  try {
    await db.queryOne("SELECT 1");
    await olympusClient.ping();
    res.status(200).send("OK");
  } catch (err) {
    res.status(503).json({ error: err.message });
  }
}

Returns 503 if dependencies are unhealthy. Used by:

Load balancer to route traffic elsewhere.
Deployment system to wait for ready before announcing.

Quick. < 1 second response.

What to check

Check	Liveness	Readiness	Deep
Process alive	✓	✓	✓
DB reachable		✓	✓
External APIs		(maybe)	✓
Cache			✓
Disk space			✓

Deep health: full system check. Not on hot path.

Liveness pitfalls

If liveness checks DB and DB is down:

Liveness fails → orchestrator kills process.
Restart. DB still down. Loop.

Don't check dependencies in liveness. Just "is process alive."

// BAD
app.get("/healthz/live", async (req, res) => {
  await db.query("SELECT 1");  // ← restarts on DB issue
  res.send("OK");
});

// GOOD
app.get("/healthz/live", (req, res) => res.send("OK"));

Readiness can fail temporarily

Readiness CAN fail when DB is down. LB routes elsewhere. Doesn't restart process.

This is correct: the process is fine, the system isn't.

Deep health

For more detail:

app.get("/healthz/deep", async (req, res) => {
  const checks = await Promise.allSettled([
    checkDb(),
    checkRedis(),
    checkOlympus(),
    checkDisk(),
    checkMemory(),
  ]);
  
  const results = checks.map((c, i) => ({
    name: ["db", "redis", "olympus", "disk", "memory"][i],
    status: c.status === "fulfilled" ? "ok" : "fail",
    detail: c.status === "fulfilled" ? c.value : c.reason.message,
  }));
  
  const overall = results.every(r => r.status === "ok") ? 200 : 503;
  res.status(overall).json({ checks: results });
}

Output:

{
  "checks": [
    { "name": "db", "status": "ok" },
    { "name": "redis", "status": "fail", "detail": "connection refused" },
    { "name": "olympus", "status": "ok" },
    { "name": "disk", "status": "ok", "detail": "12% used" },
    { "name": "memory", "status": "ok", "detail": "234 MB / 1024 MB" }
  ]
}

Diagnose: which dependency is broken.

Don't expose to public

Health endpoints shouldn't be public:

Reveal infrastructure.
Can be probed for fingerprinting.

@health path /healthz*
@internal_only remote_ip 10.0.0.0/8
handle @health {
  handle @internal_only {
    reverse_proxy app:3000
  }
  respond 404
}

Only internal network can hit.

For public uptime checks (status page): a separate /status endpoint that exposes less:

app.get("/status", (req, res) => {
  res.status(200).json({ status: "operational" });
});

Doesn't leak details.

Response time

Health checks should be fast. < 100ms.

If /healthz/ready is slow:

Use cached results.
Don't re-check dependencies on every hit.

let lastCheck = { time: 0, ok: false };
app.get("/healthz/ready", async (req, res) => {
  if (Date.now() - lastCheck.time < 5000) {
    return res.status(lastCheck.ok ? 200 : 503).send();
  }
  const ok = await fullCheck();
  lastCheck = { time: Date.now(), ok };
  res.status(ok ? 200 : 503).send();
});

5s cache. Heavy load tolerates.

Startup probe

Some orchestrators have startup probe, slower than liveness:

# Kubernetes
startupProbe:
  httpGet: { path: /healthz/live, port: 3000 }
  failureThreshold: 30
  periodSeconds: 5

For services that take minutes to warm up (cache primer, large DB).

Once startup succeeds, liveness takes over.

Custom checks

Beyond DB:

async function checkCustom() {
  // E.g., verify featureFlag service is reachable
  // Or: verify a specific data invariant
  const usersCount = await db.queryOne("SELECT COUNT(*) FROM identities").count;
  if (usersCount < 1) {
    return { ok: false, reason: "no_users_imported" };
  }
  return { ok: true };
}

Per-app-specific health.

Outbound network

If your app depends on external API (Stripe, etc.):

async function checkStripe() {
  const res = await fetch("https://api.stripe.com/v1/healthcheck", { timeout: 1000 });
  return res.ok;
}

Be careful: external dependencies make YOUR healthcheck flaky. If Stripe is slow, your readiness fails, your app appears down.

Better: track external in /healthz/deep, not /healthz/ready.

Versioning

Include version in deep:

res.json({
  version: process.env.GIT_SHA,
  build_time: process.env.BUILD_TIME,
  checks: [...],
});

Helps debug: "is this old version still running?"

Common pitfalls

Same endpoint for liveness and readiness

Conflates "should restart" with "should route traffic." Different decisions.

Health checks doing too much

Slow checks cascade. Keep fast.

Health checks behind authn

app.get("/healthz", requireAuth, (req, res) => ...);

Bad. LB can't authenticate. Use IP allowlist instead.

Returning 200 for "almost ready"

if (allChecks.some(c => c.ok)) return 200;  // some passed → OK?

No, ALL should pass. Otherwise return 503.

Implementing /healthz

On this page