Chaos engineering for Olympus

Chaos engineering: deliberately injecting failures to see how your system handles them. For Olympus, the high-value experiments are around dependency failures (DB, network, time).

Practice scenarios

Scenario A: Postgres restarts

Expected: services lose connections, fail open requests, reconnect when DB comes back. No data loss.

podman restart ciam-postgres
# Watch logs:
podman logs -f ciam-kratos

Observe:

How long are auth requests failing?
Does Kratos recover automatically?
Are sessions still valid after recovery?

If recovery is poor, increase connection pool retry settings.

Scenario B: Postgres slow

Inject latency:

podman exec ciam-postgres tc qdisc add dev eth0 root netem delay 500ms
# Run test traffic
# Remove:
podman exec ciam-postgres tc qdisc del dev eth0 root

Observe:

p99 login latency goes from ~150ms to ~700ms.
Are connection pools saturated? Do requests queue?

Scenario C: DNS failure

Email provider DNS fails:

# In Kratos container:
podman exec ciam-kratos sh -c "echo '0.0.0.0 smtp.postmarkapp.com' >> /etc/hosts"
# User triggers recovery email

Observe:

Recovery flow itself succeeds (UI says "check email").
Email is queued by courier, retried, eventually fails.
User has no recovery for the duration.

This is a real concern, your email provider going down breaks recovery. Have a fallback or escalation path.

Scenario D: Clock skew

Kratos clock drifts 5 minutes:

podman exec ciam-kratos date -s "+5 minutes"

Observe:

TOTP codes from authenticator apps fail (codes are time-based).
Recovery tokens expire prematurely.
OAuth2 tokens might be issued with future iat causing client rejection.

Mitigation: ensure NTP is healthy. See Troubleshooting, Clock skew.

Scenario E: Caddy restart

podman restart ciam-caddy

Observe: brief unavailability (5-10s). All in-flight requests fail. Session cookies survive (in browser).

Caddy fetches new certs on first start, if Let's Encrypt is rate-limiting you, this fails. Use cert persistence: mount caddy_data volume.

Scenario F: Disk full

Fill up disk:

fallocate -l 10G /tmp/filler

Observe:

Postgres write failures.
Kratos can't create new sessions (write to DB).
Cleanup: rm /tmp/filler → recovery.

This is a graceful-degradation test. Should the app return "system unavailable" or accept reads while rejecting writes?

Scenario G: Memory pressure

stress --vm 1 --vm-bytes 14G --timeout 60s

Observe:

OOM killer activates.
Containers might be killed (depending on cgroup config).
Service unavailability.

Mitigation: set per-container memory limits via podman, so one runaway doesn't OOM everything.

Game days

Schedule a Game Day quarterly:

A team member injects a fault (private from the rest).
Others detect, diagnose, mitigate.
Track time-to-detect and time-to-mitigate.

Practice fielding the alert, not just designing for it.

Production chaos

Only run chaos in production if you've practiced extensively in staging. Even then:

During business hours, with operators on standby.
Bounded blast radius (small experiment).
Clear "abort" criteria.
Customer comms ready if it spills.

For Olympus single-host deployments, prod chaos isn't typically valuable, the system is small enough to know its failure modes.

For multi-host: yes, run prod chaos. Tools: Gremlin, AWS Fault Injection Simulator, kubectl-chaos.

Lessons typical chaos finds

Restarts cascade: restarting Kratos kills inflight Hydra flows it was processing.
Retry storms: services retry endlessly during dep outage, multiplying load when dep recovers.
Stale caches: caches outlive their dep, serving stale data.
Bad fallbacks: code paths designed for "if X fails do Y" never tested, Y is broken.

Chaos engineering for Olympus

On this page