Chaos engineering for Olympus
Find weaknesses before incidents do
Chaos engineering: deliberately injecting failures to see how your system handles them. For Olympus, the high-value experiments are around dependency failures (DB, network, time).
Practice scenarios
Scenario A: Postgres restarts
Expected: services lose connections, fail open requests, reconnect when DB comes back. No data loss.
podman restart ciam-postgres
# Watch logs:
podman logs -f ciam-kratosObserve:
- How long are auth requests failing?
- Does Kratos recover automatically?
- Are sessions still valid after recovery?
If recovery is poor, increase connection pool retry settings.
Scenario B: Postgres slow
Inject latency:
podman exec ciam-postgres tc qdisc add dev eth0 root netem delay 500ms
# Run test traffic
# Remove:
podman exec ciam-postgres tc qdisc del dev eth0 rootObserve:
- p99 login latency goes from ~150ms to ~700ms.
- Are connection pools saturated? Do requests queue?
Scenario C: DNS failure
Email provider DNS fails:
# In Kratos container:
podman exec ciam-kratos sh -c "echo '0.0.0.0 smtp.postmarkapp.com' >> /etc/hosts"
# User triggers recovery emailObserve:
- Recovery flow itself succeeds (UI says "check email").
- Email is queued by courier, retried, eventually fails.
- User has no recovery for the duration.
This is a real concern, your email provider going down breaks recovery. Have a fallback or escalation path.
Scenario D: Clock skew
Kratos clock drifts 5 minutes:
podman exec ciam-kratos date -s "+5 minutes"Observe:
- TOTP codes from authenticator apps fail (codes are time-based).
- Recovery tokens expire prematurely.
- OAuth2 tokens might be issued with future
iatcausing client rejection.
Mitigation: ensure NTP is healthy. See Troubleshooting, Clock skew.
Scenario E: Caddy restart
podman restart ciam-caddyObserve: brief unavailability (5-10s). All in-flight requests fail. Session cookies survive (in browser).
Caddy fetches new certs on first start, if Let's Encrypt is rate-limiting you, this fails. Use cert persistence: mount caddy_data volume.
Scenario F: Disk full
Fill up disk:
fallocate -l 10G /tmp/fillerObserve:
- Postgres write failures.
- Kratos can't create new sessions (write to DB).
- Cleanup:
rm /tmp/filler→ recovery.
This is a graceful-degradation test. Should the app return "system unavailable" or accept reads while rejecting writes?
Scenario G: Memory pressure
stress --vm 1 --vm-bytes 14G --timeout 60sObserve:
- OOM killer activates.
- Containers might be killed (depending on cgroup config).
- Service unavailability.
Mitigation: set per-container memory limits via podman, so one runaway doesn't OOM everything.
Game days
Schedule a Game Day quarterly:
- A team member injects a fault (private from the rest).
- Others detect, diagnose, mitigate.
- Track time-to-detect and time-to-mitigate.
Practice fielding the alert, not just designing for it.
Production chaos
Only run chaos in production if you've practiced extensively in staging. Even then:
- During business hours, with operators on standby.
- Bounded blast radius (small experiment).
- Clear "abort" criteria.
- Customer comms ready if it spills.
For Olympus single-host deployments, prod chaos isn't typically valuable, the system is small enough to know its failure modes.
For multi-host: yes, run prod chaos. Tools: Gremlin, AWS Fault Injection Simulator, kubectl-chaos.
Lessons typical chaos finds
- Restarts cascade: restarting Kratos kills inflight Hydra flows it was processing.
- Retry storms: services retry endlessly during dep outage, multiplying load when dep recovers.
- Stale caches: caches outlive their dep, serving stale data.
- Bad fallbacks: code paths designed for "if X fails do Y" never tested, Y is broken.