VynFi — synthetic data for audit, fraud & financial-network ML

Privacy-preserving, fully synthetic financial datasets and ML demos — generated by a deterministic engine that produces balanced double-entry general ledgers, multi-entity group consolidations, AML/banking transactions, and OCEL process logs, all carrying ground-truth fraud / anomaly labels and grounded in real accounting standards (IFRS · US/French/German GAAP · ISA) and the ISO 21378 audit-data model.

Everything here is synthetic — no client or real-world data — so it can be used freely to train, benchmark, and stress-test audit, fraud-detection, and graph-ML systems.

🚀 Interactive demos

Space What it does
🔍 Inverse-Audit Detector Label-free anomaly detection on a synthetic GL — fit the normal-system manifold, then flag journal entries by deviation via two fit-on-self residual arms (per-JE density + relational account-flow-graph) routed into one risk score. Pick a fraud scenario, see per-arm ROC, recall @ audit budget, and the top suspicious entries.
🔀 Counterfactual GL Explorer Seed-locked baseline vs counterfactual ledgers from a causal-DAG intervention — pick a scenario (control-stress / SoD breakdown), see the effect-field distribution shift, the intervention trace, and the exact changed lines. Byte-deterministic generation, so the diff is signal, not noise.
🛡️ Fraud-GNN Demo Graph-neural-network fraud detection on the JE network — edge fraud predictor, node anomaly explorer, and a live check with confusion matrix + ROC.
🔗 Accounting Network Explorer Interactive ISO 21378 account-class flow graph — filter by business process, fraud, anomaly, amount, top-N; drill from Level-2 classes into Level-3 sub-classes.
📊 Process Mining Demo pm4py directly-follows graphs, variants, and statistics on the supply-chain OCEL 2.0 event log.
🗂️ Data Explorer Browse and inspect the VynFi synthetic datasets.
🕵️ Perfect Audit Crime Challenge Two-track community leaderboard — flag the planted fraud in synthetic GLs and help map the detectability frontier. Track A (ledger only): the mimetic perfect crime — fraud drawn from the ledger's own normal distribution — is provably uncatchable. Track B (ledger + ISA-520/505 evidence): it becomes catchable. Upload a submission → PR-AUC + per-observability recall on held-out labels.

📦 Datasets

Group audit & consolidation

Dataset Highlights
vynfi-group-audit-enterprise-2000 End-to-end 2 000-entity group: matched intercompany pairs, eliminations, IFRS-consolidated financial statements + schedules + notes + CTA/NCI/equity-method rollforwards.
vynfi-group-audit-3yr-medium Multi-period (3-year) group-audit bundle — period N+1 opens from period N's closing trial balance.
vynfi-je-network-2k 68.5 M-edge consolidated journal-entry network from the 2 000-entity group — drop-in for GNN training (PyG / DGL), with is_fraud, ic_pair_id, is_eliminated.

General ledger / journal entries

Dataset Highlights
vynfi-journal-entries-1m ~1 M-entry manufacturing GL with ISA 240 manual flags, fraud labels, and chart of accounts.
vynfi-journal-entries-10m Research-scale ~10.9 M-entry synthetic GL.
vynfi-audit-p2p Procure-to-Pay document chain (PO/GR/VI/Payment) with fraud labels — audit-engagement grade.

Causal / counterfactual

Dataset Highlights
vynfi-counterfactual-gl Seed-locked, byte-deterministic baseline ↔ counterfactual GL pairs under named causal-DAG interventions (control-environment, SoD) — each pair differs only by the intervention's effect; the diff split isolates the changed lines. A clean treatment/control substrate for causal ML, treatment-effect estimation, and residual-based audit analytics.

AML / banking

Dataset Highlights
vynfi-aml-100k 748 K banking transactions with AML/SAR-style labels and velocity features.
vynfi-sar-narratives 156 K transactions paired with suspicious-activity-report narratives + AML labels.

Process mining (OCEL 2.0)

Dataset Highlights
vynfi-ocel-manufacturing Manufacturing OCEL 2.0 event log — production-order lifecycle + quality inspections.
vynfi-supply-chain-ocel 5-company manufacturing supply-chain OCEL 2.0 event log for cross-process mining.

Challenge

Dataset Highlights
perfect-audit-crime-data The ledgers behind the Perfect Audit Crime Challenge — 3 multi-entity GLs across two tracks (ledger / ledger + ISA-520/505 evidence) with a planted mimetic perfect-crime family; labels held out for scoring.

🤖 Models

Model What it is
je-fraud-gnn GraphSAGE 2-layer journal-entry fraud classifier (test AUC 0.914) + attribute-reconstruction GAE node-anomaly scorer (per-edge AUC 0.654, unsupervised). Includes weights, preprocessor, and full metrics.

All datasets and demos are synthetic and contain no client or real-world data.