Introduction
Every VeriSynth run produces a proof receipt — a small, cryptographically verifiable JSON file that records exactly how a synthetic dataset was generated. This proof enables reproducibility, auditability, and trust in the synthetic data lifecycle — without revealing the original data itself.Why Proofs Matter
When working with sensitive datasets (e.g. healthcare, finance, clinical trials), you need to prove that:- No real individuals were exposed
- The data wasn’t tampered with
- The synthetic results can be verified independently
What’s in a Proof Receipt?
Each proof file (proof.json) contains key metadata about the synthesis process:
Core Components
| Field | Description |
|---|---|
| verisynth_version | The exact software version used for the run |
| timestamp_utc | UTC time of generation |
| input / output | File paths, row counts, SHA-256 hashes, and Merkle roots |
| model | Model type, seed, and fidelity metrics |
| proof | Final Merkle root representing dataset lineage |
1. File Hashing (SHA-256)
Each dataset (input and synthetic output) is hashed using SHA-256, producing a unique 64-character fingerprint. Example:2. Merkle Roots
VeriSynth combines all row-level hashes into a Merkle tree, creating a single compact fingerprint (Merkle root) representing the entire dataset. This allows independent verification without needing the full dataset.How it works
- Each record (row) is hashed individually.
- Pairs of hashes are combined and re-hashed up the tree.
- The final root hash represents the entire dataset’s integrity.
3. Statistical Metrics
Each proof includes quantitative fidelity and privacy diagnostics:| Metric | Meaning | Goal |
|---|---|---|
| corr_mean_abs_delta | Average difference in correlation between real and synthetic data | Lower = better realism |
| naive_reid_risk | Fraction of synthetic rows statistically too similar to any real record | Lower = better privacy |
(optional) ks_pvalues | Kolmogorov–Smirnov similarity scores for numeric columns | Higher = better alignment |
4. Deterministic Seeds
Each run uses a random seed that makes the entire synthesis process reproducible. Example:5. Proof Verification (Coming Soon)
You’ll soon be able to verify proofs using:- Re-hash your input and output files
- Recompute Merkle roots
- Compare against the proof receipt
- Report if the dataset is unchanged and valid
6. Reproducibility in Action
Try running VeriSynth twice with the same parameters:7. Future Extensions
Planned features for the proof system:| Feature | Description |
|---|---|
| ed25519 signatures | Allow signed, verifiable proofs with public keys |
| ZK verification layer | Optional zero-knowledge proof mode for third-party attestations |
| Diff privacy tracking (ε) | Embed DP parameters directly in proofs |
| Remote proof viewer | Visualize proofs via the VeriSynth web dashboard |
Summary
| Principle | Description |
|---|---|
| Integrity | Every dataset is hashed, verified, and traceable |
| Reproducibility | Same input + seed → identical proof |
| Privacy | No real records retained or exposed |
| Auditability | Proof receipts can be independently verified offline |