The science behind synthetic data

Synthetic respondents have moved from speculation to standard practice. Peer-reviewed studies in Political Analysis, the Journal of Marketing, Psychology & Marketing, and replications by EY, Harvard, MIT Sloan, and Qualtrics show that calibrated synthetic data now matches—and in some cases exceeds—traditional human-only research.

95%
EY brand-survey replication correlation
90%
of human test-retest reliability (arXiv 2025)
77%
of human-analyst themes recovered (Journal of Marketing 2025)
THE HEADLINE FINDING

EY replicated their CEO brand survey with 1,000 synthetic personas

95%
correlation with the original survey

In a double-blind test, professional services firm EY took its annual Global Brand Survey—aimed at CEOs of US companies with $1B+ in revenue—and ran it twice: once through traditional fielding, once through 1,000 synthetic personas built by Aaru.

The synthetic survey returned 95% correlation with the real one. EY also recreated their annual Global Wealth Research Report in a single day, with 90%+ median correlation to the original six-month study.

— Toni Clayton-Hine, EY CMO. Reported in Solomon Partners (Sept 2025).

Read the case study
PEER-REVIEWED RESEARCH

The academic case for synthetic respondents

Four foundational papers from leading journals establish that calibrated synthetic data reproduces human survey responses with rigor.

Journal of Marketing
Arora, Chakraborty & Nishimura · 2025 · Vol. 89(2)

AI–Human Hybrids for Marketing Research

The AI–human hybrid generates information-rich, coherent data that surpasses human-only data in depth and insightfulness, and matches human performance in theme generation. LLM hybrid recovered 77% of themes identified by human analysts.

DOI: 10.1177/00222429241276529
arXiv
Maier et al. · October 2025 · arXiv:2510.08338

LLMs Reproduce Human Purchase Intent via Semantic Similarity

Tested against 9,300 human responses across 57 personal-care surveys, the Semantic Similarity Rating method achieved 90% of human test-retest reliability. Distributional similarity to real data exceeded 0.85 (Kolmogorov–Smirnov).

Read on arXiv
Political Analysis
Argyle et al. · 2023 · Cambridge University Press

Out of One, Many: Using Language Models to Simulate Human Samples

The foundational “silicon samples” paper. GPT-3 conditioned on sociodemographic backstories accurately emulates response distributions across human subgroups, successfully replicating real survey results across diverse populations.

DOI: 10.1017/pan.2023.2
Psychology & Marketing
Sarstedt, Adler, Rau & Schmitt · 2024 · Vol. 41(6)

Using LLMs to Generate Silicon Samples in Consumer & Marketing Research

Establishes formal academic guidelines for silicon sampling. Concludes synthetic samples hold particular promise in upstream parts of the research process: qualitative pretesting, pilot studies, and hypothesis generation.

DOI: 10.1002/mar.21982

See the science in action

Generate personas, run a survey or interview, and see what synthetic respondents reveal — in minutes.