Documentation Index
Fetch the complete documentation index at: https://docs.verisynth.ai/llms.txt
Use this file to discover all available pages before exploring further.
Models in VeriSynth
VeriSynth uses machine learning models to learn the statistical structure of real data and generate synthetic data that behaves like the real world — without exposing sensitive records.
Each model is designed to balance realism, privacy, and computational efficiency.
The system is modular, so users can plug in new synthesis engines over time.
Current Model: Gaussian Copula (GC)
The Gaussian Copula Synthesizer is the first and default model in VeriSynth Core.
It’s a fast, deterministic, and lightweight model for tabular synthetic data — perfect for most structured datasets.
How It Works
-
Learn relationships
GC learns pairwise correlations between all columns in your dataset.
It transforms them into a continuous Gaussian space where relationships are easier to model.
-
Model dependencies
A multivariate Gaussian distribution is fit to the transformed data.
This captures how each variable depends on others (e.g., age ↔ BMI ↔ blood pressure).
-
Sample synthetic data
New synthetic samples are drawn from this learned distribution, then mapped back to the original data space.
-
Output verification
The generated dataset is post-processed and evaluated for correlation deltas, privacy risk, and consistency metrics.
Example
Let’s say your real dataset has these correlations:
| Variable Pair | Correlation |
|---|
| Age ↔ BMI | 0.42 |
| BMI ↔ Systolic BP | 0.61 |
| Age ↔ Glucose | 0.58 |
The synthetic dataset produced by VeriSynth will preserve these relationships closely, often within a correlation delta of ±0.1 — enough to be statistically realistic for most analytical and ML tasks.
Instead of random noise, GC learns your dataset’s statistical DNA and regenerates it faithfully.
Strengths
| Feature | Description |
|---|
| Lightweight | Runs entirely on CPU — no GPUs required |
| Deterministic | Same seed → same result, ideal for reproducibility |
| Explainable | Transparent mathematical model (no black box) |
| Fast | Generates up to 1M+ rows in seconds |
| Auditable | Outputs proof receipts with correlation metrics |
Limitations
| Limitation | Description |
|---|
| Linear correlations only | GC struggles with nonlinear or multimodal relationships |
| Limited categorical complexity | High-cardinality categories may be oversimplified |
| No temporal or sequential modeling | Best for static, tabular datasets |
Roadmap: Upcoming Models
VeriSynth Core is model-agnostic — future releases will expand into deep generative and privacy-enhanced models.
| Model | Type | Status | Description |
|---|
| CTGAN | GAN-based tabular | 🧩 Planned | Captures nonlinear relationships and rare categories |
| TVAE | Variational Autoencoder | 🧩 Planned | Produces smoother synthetic distributions and uncertainty estimates |
| DP-GC | Differentially Private Gaussian Copula | 🧩 Planned | Adds formal privacy bounds (ε) to correlation modeling |
| TimeSynth | Sequential/time-series | 🧩 Planned | Models synthetic patient timelines, transactions, or sensor data |
| ImageSynth | Generative Vision | 🚧 Exploratory | Extends proof-based synthesis to visual data |
Model Selection Philosophy
We believe every synthetic data generator should be:
- Understandable — transparent about how it learns and samples
- Reproducible — deterministic seeds, reproducible metrics
- Verifiable — accompanied by measurable fidelity and privacy proofs
- Modular — easy to swap models as new techniques evolve
VeriSynth is built as a modular framework, so you can run:
verisynth data/patients.csv -o out/ --model gaussian_copula
# (future)
verisynth data/patients.csv -o out/ --model ctgan
Model Registry (Coming Soon)
We’re working on a model registry system that will let users:
- View available models (
verisynth models list)
- Inspect metadata, dependencies, and required hardware
- Register custom synthesis engines via plugin
- Benchmark models on fidelity vs privacy
Example:
verisynth models list
# → gaussian_copula, ctgan, tvae, dp_gc
Model Validation Metrics
Each model is evaluated using:
| Metric | Description |
|---|
| Correlation delta (Δ) | Measures difference in variable relationships |
| KS test (p-value) | Checks distributional similarity |
| Naive re-identification risk | Approximates privacy exposure |
| Synthetic utility | Performance of ML models trained on synthetic data |
These metrics are logged to proof.json for transparency and reproducibility.
Example CLI Run (GC Model)
verisynth data/finance.csv -o out/ --rows 50000 --model gaussian_copula --seed 42
Produces:
📁 out/synthetic.csv
🧾 out/proof.json
With proof:
{
"model": { "engine": "gaussian_copula", "seed": 42 },
"metrics": { "corr_mean_abs_delta": 0.15, "naive_reid_risk": 0.0 }
}
Summary
| Principle | Description |
|---|
| Start simple | GC gives accurate, transparent tabular synthesis |
| Build modularly | New models can plug into the same proof pipeline |
| Stay verifiable | Every model, no matter how complex, will always produce a proof receipt |