Book a 30-min call
cd ../blogs
$ cat posts/2026-04-12.mdx

Graph Neural Networks for Real-Time Fraud Detection at Scale

April 12, 2026 · ImmovableTech Team

  • Fraud Detection
  • Graph Neural Networks
  • Real-Time Systems
  • Kafka
  • Financial AI

Rules Don’t Catch Rings

Rule-based fraud detection works until it doesn’t. It catches the obvious stuff — a transaction from an unusual country, a sudden spike in spending, a card used at two locations simultaneously. But the fraud that costs real money in 2026 isn’t some stolen card number getting tested with a $1 charge. It’s coordinated fraud rings: networks of synthetic identities, mule accounts, and layered transactions designed to look perfectly normal individually.

That’s the problem our client — a Series D fintech processing $8.2B annually — brought to us. Their rule-based system caught 70% of individual fraud attempts. But it was blind to the other 30%: transactions that only looked fraudulent when you examined the relationships between accounts.

Why Graphs

A transaction isn’t just an event. It’s a connection. Account A sends money to Account B, which sends to Account C, which converts to crypto. Each transaction is clean. The pattern is fraud.

Traditional ML models — gradient boosted trees, logistic regression — look at each transaction independently. They see features like amount, time, merchant category, device fingerprint. They’re good at catching outliers. But they can’t see that Accounts A, B, and C were all created from the same IP range, funded from the same source, and started transacting within 72 hours of each other.

Graph neural networks can. A GNN takes a graph of transactions and accounts as input, learns structural features (how many hops to a known fraudulent account? does this cluster have unusual symmetry?), and classifies nodes or edges based on both local features and global topology.

The Architecture

Source DBs → Debezium CDC → Kafka → Flink (enrichment) →
  → Neo4j (graph update) → GNN Inference (PyTorch Geometric) →
  → Risk Score → Decision Engine → Block/Flag/Pass

Graph Construction

We model the transaction network as a heterogeneous graph:

  • Nodes: accounts, devices, IP addresses, merchants
  • Edges: transactions (with amount, timestamp, type), logins (account → device), registrations (account → IP)

Neo4j holds the live graph, updated in near-real-time via Kafka. When a new transaction arrives, Flink enriches it with account age, historical velocity, and device metadata, then pushes the transaction as a new edge into Neo4j.

GNN Model

We use a 3-layer GraphSAGE model trained on 6 months of labelled fraud data (~14M transactions, ~80K confirmed fraud). The model learns 128-dimensional embeddings for each account node that capture both the account’s own features and its neighbourhood structure.

At inference time, when a new transaction arrives:

  1. Flink enriches the transaction and updates Neo4j
  2. A subgraph of 2-hop neighbours around the involved accounts gets extracted (typically 50-200 nodes)
  3. The GNN scores the transaction based on the subgraph embeddings
  4. The risk score feeds into the decision engine alongside the traditional rule-based score

The two-model approach was deliberate. The GNN catches patterns that rules miss (coordinated rings). Rules catch obvious fraud that doesn’t require graph context (stolen card used in a new country). Combining both gives us 99.6% precision at 94% recall — significantly better than either model alone.

Latency Engineering

The hardest constraint was latency. Payment processors require a fraud decision within their authorization window — typically under 2 seconds. Our budget was 800ms from transaction arrival to risk score.

Breaking it down:

  • Kafka delivery: ~20ms
  • Flink enrichment: ~50ms
  • Neo4j subgraph extraction: ~150ms (the bottleneck)
  • GNN inference: ~80ms (ONNX-optimized, batched)
  • Decision engine: ~10ms
  • Network overhead: ~100ms

The Neo4j subgraph extraction was the pain point. Our first approach — Cypher queries for 2-hop neighbours — took 400ms+ for highly connected accounts. We switched to a pre-computed neighbourhood cache in Redis, updated asynchronously. The cache serves 95% of requests in under 30ms. Cache misses fall back to Neo4j with a slightly higher latency budget.

The Training Pipeline

Fraud models go stale fast. Fraudsters adapt, new attack patterns emerge, and the data distribution shifts monthly. We retrain the GNN weekly on a rolling 6-month window.

The tricky part is labelling. Confirmed fraud labels arrive days or weeks after the transaction (when the customer disputes a charge or an investigation concludes). We handle this with a two-stage approach:

  1. Preliminary labels from the rule-based system and customer reports (available within 24-48 hours)
  2. Confirmed labels from investigations (available within 2-4 weeks)

We train on preliminary labels for rapid model updates and validate against confirmed labels to measure true accuracy. The gap between preliminary and confirmed accuracy is our “label noise” metric — when it exceeds 5%, we investigate whether the rule-based system is mislabelling.

Results

After 14 weeks:

  • 99.6% precision at 94% recall — meaning only 0.4% of flagged transactions are false positives
  • Sub-800ms end-to-end latency for 99th percentile
  • $47M in annual fraud prevented (up from $33M with rules-only)
  • 2.3M transactions scored daily without a single missed SLA

The most satisfying metric: ring detection. In the first month, the GNN identified 23 coordinated fraud rings that the rule-based system had scored as completely clean. One ring involved 47 accounts and $3.2M in fraudulent transactions over 6 weeks.

What Didn’t Work

End-to-end GNN training. We tried training the GNN to output binary fraud/not-fraud directly. It worked on the test set but had terrible precision in production — too many false positives on legitimate but unusual transaction patterns (e.g., a small business doing a large bulk purchase). The risk-score approach, where the GNN outputs a continuous score that feeds into a decision engine with adjustable thresholds, gave us the control we needed.

Full graph inference. Running GNN inference on the entire transaction graph is computationally impractical at our scale. The subgraph extraction approach — pulling 2-hop neighbours for each transaction — is an approximation. We lose some long-range patterns (3+ hop fraud chains), but the latency trade-off is worth it.

Real-time retraining. We tried online learning — updating the model with each new confirmed fraud label. The model destabilised after a week. Weekly batch retraining with proper validation is slower but far more reliable.


We build real-time ML systems as part of our AI & Machine Learning Engineering practice. Read the full case study or talk to us about fraud detection at scale.