Identity graph for fraud detection

A single user might create multiple accounts (anti-spam evasion, multi-account abuse, sock-puppets). Identity graph connects accounts that share signals.

Data model

CREATE TABLE identity_signals (
  identity_id UUID,
  signal_type TEXT,    -- "ip", "device_fingerprint", "phone", "address", "stripe_customer"
  signal_value TEXT,
  observed_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX ix_signals ON identity_signals(signal_type, signal_value);

-- Find linked identities
SELECT DISTINCT s1.identity_id, s2.identity_id
FROM identity_signals s1
JOIN identity_signals s2 ON s1.signal_value = s2.signal_value
                         AND s1.signal_type = s2.signal_type
                         AND s1.identity_id != s2.identity_id;

Two identities share a signal → connected.

Signals to capture

IP address (changes but stable for periods).
Device fingerprint.
Phone number (rare same-phone reuse).
Credit card token / Stripe customer (very strong link).
Address (delivery, billing).
User agent (weak; many share).
Email domain.

Strength of signals

Signal	Strength	Reuse rate
Email	Identifier	(not a link, that's the user)
Credit card	High	1-2% legitimate reuse
Phone	High	< 5% legitimate (family)
Device fingerprint	Medium	5-10% (shared computers)
IP	Low	20-30% (shared NAT)
User agent	Very low	many share

Score per signal:

const weights = {
  credit_card: 0.9,
  phone: 0.8,
  device: 0.6,
  ip: 0.3,
  user_agent: 0.1,
};

Sum scores between two identities = link strength.

Graph queries

"Find identities likely the same as X":

WITH linked AS (
  SELECT 
    s2.identity_id AS related_id,
    SUM(weights[s1.signal_type]) AS strength
  FROM identity_signals s1
  JOIN identity_signals s2 ON s1.signal_value = s2.signal_value
                           AND s1.signal_type = s2.signal_type
                           AND s1.identity_id != s2.identity_id
  WHERE s1.identity_id = $X
  GROUP BY s2.identity_id
)
SELECT * FROM linked WHERE strength > 0.5
ORDER BY strength DESC
LIMIT 20;

Threshold 0.5: high confidence same person.

Use cases

Multi-account abuse

User created 10 accounts to abuse free-tier. Signals:

Same credit card across 10 accounts.
Same device fingerprint across 10 accounts.

Detection:

SELECT identity_id, COUNT(DISTINCT related_id) FROM ... GROUP BY 1 HAVING COUNT > 3;

Flag, suspend, investigate.

Account takeover

Existing user's account suddenly has new device/IP. Cross-reference: does the new device match any other identity's known device?

If yes: known bad actor's device. Strong evidence of ATO.

Sock puppets (forums, reviews)

User has multiple accounts to post fake content. Detect by signal cluster.

SELECT graph.cluster_id, COUNT(*) AS member_count
FROM identity_graph
GROUP BY 1
HAVING COUNT(*) > 5;

Clusters of 5+ identities sharing signals = suspect.

Privacy

The identity graph is a powerful tool, and a privacy concern. Document:

What signals you collect.
Why.
How long.

Don't sell the data. Don't share without legal basis.

Storage

Signals can be voluminous. For 1M users with 10 signals each = 10M rows. Indexable, manageable.

For long-term: aggregate after 90 days (frequency counts, no per-event records).

Manual review

Don't auto-ban based on graph alone. False positives:

Family share IPs.
Shared device (library, kiosk).
Same card across accounts (parent paying for child).

Use graph as INPUT to manual review queue.

Combine with anomaly

Graph + anomaly + behavior = stronger signal.

Graph alone: suspect.
Graph + behavior anomaly: stronger suspect.
Graph + anomaly + new device + bad IP: act.

Weighted score combines.

Cluster IDs

ALTER TABLE identities ADD COLUMN cluster_id UUID;

-- Background job: assign cluster IDs based on link strength
UPDATE identities SET cluster_id = ... WHERE id IN (...);

Membership in cluster useful for grouping investigations.

Visualizations

For investigators, graph viz:

// Build edges
const links = await db`SELECT id1, id2, strength FROM identity_links`;
// Render with d3 or similar

Manual inspection of who-is-connected-to-who. Detective work.

Tools

For complex graphs:

Neo4j (graph database).
TigerGraph.
Self-hosted with Postgres + pg_graph (sufficient for most).

For Olympus scale: Postgres is fine.

When you don't need this

For B2C apps with low abuse / low fraud, identity graph is overkill. Worth it when:

Free tier with valuable resources.
Marketplace fraud risk.
Multi-account incentive (referral programs).
Regulated industries.

Identity graph for fraud detection

On this page