Identity graph for fraud detection
Connect identities across signals to spot abuse
A single user might create multiple accounts (anti-spam evasion, multi-account abuse, sock-puppets). Identity graph connects accounts that share signals.
Data model
CREATE TABLE identity_signals (
identity_id UUID,
signal_type TEXT, -- "ip", "device_fingerprint", "phone", "address", "stripe_customer"
signal_value TEXT,
observed_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX ix_signals ON identity_signals(signal_type, signal_value);
-- Find linked identities
SELECT DISTINCT s1.identity_id, s2.identity_id
FROM identity_signals s1
JOIN identity_signals s2 ON s1.signal_value = s2.signal_value
AND s1.signal_type = s2.signal_type
AND s1.identity_id != s2.identity_id;Two identities share a signal → connected.
Signals to capture
- IP address (changes but stable for periods).
- Device fingerprint.
- Phone number (rare same-phone reuse).
- Credit card token / Stripe customer (very strong link).
- Address (delivery, billing).
- User agent (weak; many share).
- Email domain.
Strength of signals
| Signal | Strength | Reuse rate |
|---|---|---|
| Identifier | (not a link, that's the user) | |
| Credit card | High | 1-2% legitimate reuse |
| Phone | High | < 5% legitimate (family) |
| Device fingerprint | Medium | 5-10% (shared computers) |
| IP | Low | 20-30% (shared NAT) |
| User agent | Very low | many share |
Score per signal:
const weights = {
credit_card: 0.9,
phone: 0.8,
device: 0.6,
ip: 0.3,
user_agent: 0.1,
};Sum scores between two identities = link strength.
Graph queries
"Find identities likely the same as X":
WITH linked AS (
SELECT
s2.identity_id AS related_id,
SUM(weights[s1.signal_type]) AS strength
FROM identity_signals s1
JOIN identity_signals s2 ON s1.signal_value = s2.signal_value
AND s1.signal_type = s2.signal_type
AND s1.identity_id != s2.identity_id
WHERE s1.identity_id = $X
GROUP BY s2.identity_id
)
SELECT * FROM linked WHERE strength > 0.5
ORDER BY strength DESC
LIMIT 20;Threshold 0.5: high confidence same person.
Use cases
Multi-account abuse
User created 10 accounts to abuse free-tier. Signals:
- Same credit card across 10 accounts.
- Same device fingerprint across 10 accounts.
Detection:
SELECT identity_id, COUNT(DISTINCT related_id) FROM ... GROUP BY 1 HAVING COUNT > 3;Flag, suspend, investigate.
Account takeover
Existing user's account suddenly has new device/IP. Cross-reference: does the new device match any other identity's known device?
If yes: known bad actor's device. Strong evidence of ATO.
Sock puppets (forums, reviews)
User has multiple accounts to post fake content. Detect by signal cluster.
SELECT graph.cluster_id, COUNT(*) AS member_count
FROM identity_graph
GROUP BY 1
HAVING COUNT(*) > 5;Clusters of 5+ identities sharing signals = suspect.
Privacy
The identity graph is a powerful tool, and a privacy concern. Document:
- What signals you collect.
- Why.
- How long.
Don't sell the data. Don't share without legal basis.
Storage
Signals can be voluminous. For 1M users with 10 signals each = 10M rows. Indexable, manageable.
For long-term: aggregate after 90 days (frequency counts, no per-event records).
Manual review
Don't auto-ban based on graph alone. False positives:
- Family share IPs.
- Shared device (library, kiosk).
- Same card across accounts (parent paying for child).
Use graph as INPUT to manual review queue.
Combine with anomaly
Graph + anomaly + behavior = stronger signal.
Graph alone: suspect.
Graph + behavior anomaly: stronger suspect.
Graph + anomaly + new device + bad IP: act.Weighted score combines.
Cluster IDs
ALTER TABLE identities ADD COLUMN cluster_id UUID;
-- Background job: assign cluster IDs based on link strength
UPDATE identities SET cluster_id = ... WHERE id IN (...);Membership in cluster useful for grouping investigations.
Visualizations
For investigators, graph viz:
// Build edges
const links = await db`SELECT id1, id2, strength FROM identity_links`;
// Render with d3 or similarManual inspection of who-is-connected-to-who. Detective work.
Tools
For complex graphs:
- Neo4j (graph database).
- TigerGraph.
- Self-hosted with Postgres + pg_graph (sufficient for most).
For Olympus scale: Postgres is fine.
When you don't need this
For B2C apps with low abuse / low fraud, identity graph is overkill. Worth it when:
- Free tier with valuable resources.
- Marketplace fraud risk.
- Multi-account incentive (referral programs).
- Regulated industries.