Distributed tracing across Olympus

For complex auth flows, tracing helps you see which span took how long. Olympus's services support OpenTelemetry, set up once, get rich traces forever.

What to trace

For each auth request:

Caddy ingress.
Kratos public API call.
Hydra admin API call (for OAuth2 flows).
Postgres queries.
External HTTP (Postmark, OIDC providers).

A single login might involve 3-5 services and 10+ DB queries. Tracing visualizes this.

Setup

OTLP collector

Run an OpenTelemetry Collector that accepts traces and forwards to backend:

# docker-compose.yml
otel-collector:
  image: otel/opentelemetry-collector-contrib:0.95.0
  command: ["--config=/etc/otel-config.yaml"]
  volumes: ["./otel-config.yaml:/etc/otel-config.yaml"]
  ports:
    - "4317:4317"  # gRPC
    - "4318:4318"  # HTTP

# otel-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
      http:

exporters:
  otlphttp/jaeger:
    endpoint: http://jaeger:14268
  otlp/honeycomb:
    endpoint: api.honeycomb.io:443
    headers:
      x-honeycomb-team: ${HONEYCOMB_KEY}

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [otlphttp/jaeger]

Kratos

# kratos.yml
tracing:
  service_name: kratos
  provider: otel
  providers:
    otlp:
      server_url: otel-collector:4317
      insecure: true
  sampling:
    sampling_ratio: 0.1   # 10% of requests

Hydra

# hydra.yml
tracing:
  service_name: hydra
  provider: otel
  providers:
    otlp:
      server_url: otel-collector:4317
      insecure: true

Hera / Athena (Node)

// instrumentation.ts
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";

const sdk = new NodeSDK({
  serviceName: "hera",
  traceExporter: new OTLPTraceExporter({ url: "http://otel-collector:4317" }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

Add to Next.js:

// instrumentation.ts (Next.js conventions)
export function register() {
  if (process.env.NEXT_RUNTIME === "nodejs") {
    require("./otel-init");
  }
}

What auto-instrumentation gives you

HTTP requests (incoming + outgoing).
Postgres queries.
Redis ops.
Fetch / http.request.

Without writing instrumentation per call.

Custom spans

For app logic:

import { trace } from "@opentelemetry/api";
const tracer = trace.getTracer("hera-app");

export async function processLogin(creds) {
  return tracer.startActiveSpan("login.process", async (span) => {
    span.setAttributes({ "auth.method": creds.method });
    try {
      const result = await doLogin(creds);
      span.setAttribute("auth.outcome", "success");
      return result;
    } catch (err) {
      span.setAttribute("auth.outcome", "failure");
      span.recordException(err);
      throw err;
    } finally {
      span.end();
    }
  });
}

Context propagation

For a single user request hitting multiple services, traces should connect:

Browser → Hera → Kratos → Postgres
              → Hydra → Postgres

Auto-instrumentation handles W3C trace context (traceparent header). Each service receives it, propagates to outgoing calls.

Verify by checking Jaeger UI: the trace ID should appear in all service spans.

Sampling

10% sampling is reasonable for prod. Higher = more data, more cost.

For critical errors, sample 100%:

if (result.outcome === "failure") {
  span.setAttribute("force_sample", "1");
}

Tail-based sampling in collector:

processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors-policy
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-policy
        type: latency
        latency: { threshold_ms: 1000 }
      - name: ten-percent
        type: probabilistic
        probabilistic: { sampling_percentage: 10 }

Always keep errors + slow + 10% of normal.

Backends

Free / self-hosted:

Jaeger (UI + storage).
Tempo (Grafana).

Paid:

Honeycomb (great UX).
Datadog APM.
New Relic.
Lightstep (now ServiceNow).

For Olympus-typical scale: Jaeger self-hosted is plenty. Or Grafana Cloud Tempo's free tier.

What to look for

Latency outliers

In Jaeger / Honeycomb, filter duration > 1s. Investigate slow ones.

Common culprits:

Slow DB queries (missing index).
Hot external dependencies (OIDC provider slow).
Cold starts (rare in Olympus's long-running services).

Errors with full context

When an error happens, the trace shows what else was happening at the same time. Useful for "was the DB also slow?" or "did Hydra return an error?"

Service dependency

Auto-generated service map from traces. See link.

Cost

Traces are higher-volume than logs. Manage:

Sample aggressively (10% default).
Truncate large attributes.
Set retention (7-30 days).
Compress.

Don't trace everything

Some endpoints are noisy and uninteresting (health checks). Exclude:

import { ATTR_HTTP_ROUTE } from "@opentelemetry/semantic-conventions";
sdk = new NodeSDK({
  instrumentations: [
    getNodeAutoInstrumentations({
      "@opentelemetry/instrumentation-http": {
        ignoreIncomingRequestHook: (req) => req.url?.includes("/health"),
      },
    }),
  ],
});

Distributed tracing across Olympus

On this page