Introduction

Every VeriSynth run produces a proof receipt — a small, cryptographically verifiable JSON file that records exactly how a synthetic dataset was generated. This proof enables reproducibility, auditability, and trust in the synthetic data lifecycle — without revealing the original data itself.

Why Proofs Matter

When working with sensitive datasets (e.g. healthcare, finance, clinical trials), you need to prove that:

No real individuals were exposed
The data wasn’t tampered with
The synthetic results can be verified independently

Traditional anonymization or “black-box” synthetic data tools can’t provide this level of assurance. VeriSynth changes that by generating verifiable cryptographic receipts with every run.

What’s in a Proof Receipt?

Each proof file (proof.json) contains key metadata about the synthesis process:

{
  "verisynth_version": "core-0.1.0",
  "timestamp_utc": "2025-10-16T00:23:54Z",
  "input": {
    "path": "data/patients.csv",
    "rows": 10,
    "sha256": "05c493ca63a4...82b7",
    "merkle_root": "ddc322ce1f1d...a4c"
  },
  "output": {
    "path": "out/synthetic.csv",
    "rows": 1000000,
    "sha256": "c4452526fa8c...acb9",
    "merkle_root": "8045dd531825...c31"
  },
  "model": {
    "engine": "GaussianCopulaSynthesizer",
    "seed": 42,
    "metrics": {
      "corr_mean_abs_delta": 0.23,
      "naive_reid_risk": 0.0
    }
  },
  "proof": "merkle_root: 8045dd531825e51b8241d67732074492cad53fb415b8b393f556a7483eac8c31"
}

Core Components

Field	Description
verisynth_version	The exact software version used for the run
timestamp_utc	UTC time of generation
input / output	File paths, row counts, SHA-256 hashes, and Merkle roots
model	Model type, seed, and fidelity metrics
proof	Final Merkle root representing dataset lineage

1. File Hashing (SHA-256)

Each dataset (input and synthetic output) is hashed using SHA-256, producing a unique 64-character fingerprint. Example:

sha256sum data/patients.csv
# → 05c493ca63a434a419da68828ec08eef23b997c94f7588ccdf5f8c5ac4ee82b7

Even a one-character change in the file alters the hash entirely. This ensures data integrity — you can confirm that the file used to create synthetic data hasn’t changed.

2. Merkle Roots

VeriSynth combines all row-level hashes into a Merkle tree, creating a single compact fingerprint (Merkle root) representing the entire dataset. This allows independent verification without needing the full dataset.

How it works

Each record (row) is hashed individually.
Pairs of hashes are combined and re-hashed up the tree.
The final root hash represents the entire dataset’s integrity.

This approach is inspired by blockchain data structures — but runs fully offline.

3. Statistical Metrics

Each proof includes quantitative fidelity and privacy diagnostics:

Metric	Meaning	Goal
corr_mean_abs_delta	Average difference in correlation between real and synthetic data	Lower = better realism
naive_reid_risk	Fraction of synthetic rows statistically too similar to any real record	Lower = better privacy
(optional) `ks_pvalues`	Kolmogorov–Smirnov similarity scores for numeric columns	Higher = better alignment

These values help users balance realism vs. privacy, and are logged for auditability.

4. Deterministic Seeds

Each run uses a random seed that makes the entire synthesis process reproducible. Example:

verisynth data/patients.csv -o out/ --rows 1000000 --seed 42

Re-running with the same seed, input, and model produces identical output and identical proof.json — a strong guarantee for regulated environments.

5. Proof Verification (Coming Soon)

You’ll soon be able to verify proofs using:

verisynth verify out/proof.json

This command will:

Re-hash your input and output files
Recompute Merkle roots
Compare against the proof receipt
Report if the dataset is unchanged and valid

Expected output:

Verifying proof.json ...
✅ Merkle roots match
✅ Input/output hashes verified
✅ Metrics within tolerance

Result: VERIFIED (deterministic run)

6. Reproducibility in Action

Try running VeriSynth twice with the same parameters:

verisynth data/patients.csv -o out1/ --rows 1000 --seed 42
verisynth data/patients.csv -o out2/ --rows 1000 --seed 42

Then compare proofs:

diff out1/proof.json out2/proof.json

If your proof system is working correctly, the files will be identical — down to the Merkle root.

7. Future Extensions

Planned features for the proof system:

Feature	Description
ed25519 signatures	Allow signed, verifiable proofs with public keys
ZK verification layer	Optional zero-knowledge proof mode for third-party attestations
Diff privacy tracking (ε)	Embed DP parameters directly in proofs
Remote proof viewer	Visualize proofs via the VeriSynth web dashboard

Summary

Principle	Description
Integrity	Every dataset is hashed, verified, and traceable
Reproducibility	Same input + seed → identical proof
Privacy	No real records retained or exposed
Auditability	Proof receipts can be independently verified offline

Getting started

Proof Receipts

Introduction

Why Proofs Matter

What’s in a Proof Receipt?

Core Components

1. File Hashing (SHA-256)

2. Merkle Roots

How it works

3. Statistical Metrics

4. Deterministic Seeds

5. Proof Verification (Coming Soon)

6. Reproducibility in Action

7. Future Extensions

Summary

Getting started

​Introduction

​Why Proofs Matter

​What’s in a Proof Receipt?

​Core Components

​1. File Hashing (SHA-256)

​2. Merkle Roots

​How it works

​3. Statistical Metrics

​4. Deterministic Seeds

​5. Proof Verification (Coming Soon)

​6. Reproducibility in Action

​7. Future Extensions

​Summary

Introduction

Why Proofs Matter

What’s in a Proof Receipt?

Core Components

1. File Hashing (SHA-256)

2. Merkle Roots

How it works

3. Statistical Metrics

4. Deterministic Seeds

5. Proof Verification (Coming Soon)

6. Reproducibility in Action

7. Future Extensions

Summary