ContinuousBench: does DP synthetic text actually transfer knowledge?
TL;DR. We built ContinuousBench, a benchmark that measures whether differentially private (DP) synthetic text retains the corpus-specific knowledge that motivated using a sensitive dataset in the first place. On standard benchmarks like IMDB sentiment and OpenReview, every method looks roughly the same (mid-90s%) — there is no room left to distinguish them. On ContinuousBench, training on the real corpus jumps a Gemma 3 4B model from ~1% to ~96% on the Geminon track and from ~14% to ~70% on the News track. SOTA DP synthesis barely closes that gap — even at ε = 100. We release a corpus, QA set, and standardized harness, and we regenerate it quarterly to stay contamination-free.
What's wrong with current evaluations
DP text synthesis promises a way to learn from sensitive corpora — clinical notes, legal records, internal documents — through a privacy-preserving release. But the question that matters isn't "does the output look like the original?" It's: does it transmit the capabilities you would have gotten by training on the real data?
Existing evaluations don't answer this, for three reasons:
- The knowledge is already in the base model. IMDB sentiment, classification-style tasks on familiar domains — modern pretrained LLMs already solve these out of the box. Strong downstream numbers don't prove the corpus mattered; they prove pretraining was enough.
- The metrics measure style, not knowledge. MAUVE, BLEU, next-token loss, distributional similarity — all reward stylistic matching. A DP generator can ace these while losing every specific fact in the source corpus.
- Teacher–student distillation contaminates the signal. Synthetic data is usually generated by a model much stronger than the one trained on it. The downstream gains then conflate corpus signal with ordinary distillation from a more capable teacher.
The visible symptom is saturation. On the benchmarks the field actually uses, every method sits within a few points of the ceiling:
ContinuousBench
Each release packages three things: a fresh training corpus, a derived short-answer QA set, and a standardized training/eval harness. Three design choices fall out of the diagnosis above:
- Freshness. QA tests knowledge unavailable to the base model — either procedurally generated fictional facts (the Geminon track, a Pokémon-inspired invented world) or post-cutoff news (the News track, from CC-News).
- Grounded QA, not style. Every question has a normalized short answer scored automatically. No MAUVE-style proxies.
- DP-learnable by construction. Every test question is supported by ≥200 independent records, so the target is population-level knowledge rather than singleton memorization. A singleton split is kept as a privacy sanity check.
To eliminate teacher–student confounds, the harness uses the same base checkpoint for the generator and the downstream model. The two tracks differ in how the corpus is seeded — Geminon from a procedurally generated index, News from CC-News articles:
{
"name": "Boreling",
"classification": "Frost Geminon",
"type1": "ice",
"type2": null,
"ability": "Berserk",
"hp": 69,
"attack": 60,
"defense": 63,
"special attack": 67,
"special defense": 68,
"speed": 40,
"base_stat_total": 367,
"weight": 52,
"height": 12,
"idx": 10003,
"evolution_line": ["Boreling", "Borelash", "Borastat"],
"move": {
"name": "Powder Snow",
"short_description": "Has a chance to freeze the target."
}
}
{
"text": "As the first stage in the evolutionary line that culminates in Borastat, Boreling is a quintessential Frost Geminon. This pure ice-type, known as Gemidex entry #10003, measures 12 meters in height and weighs 52 lbs. Its passive ability is Berserk, and it can be taught the move Powder Snow, which has a percentage-based chance to freeze an opponent. Boreling's base stat total of 367 is comprised of 69 HP, 60 attack, 63 defense, 67 special attack, 68 special defense, and a speed rating of 40.",
"tag": [
{
"idx": 10003,
"info": [
"idx",
"name",
"classification",
"type1",
"ability",
"evolution_line",
"height",
"weight",
"base_stat_total",
"hp",
"attack",
"defense",
"special attack",
"special defense",
"speed",
"move.name",
"move.short_description"
]
}
],
"type": "wiki"
}
{
"question": "What is the classification of Boreling?"
"answer": "Frost Geminon"
}
{
"question": "What are the types of Boreling?"
"answer": "Ice"
}
...
{
"url": "https://www.hindustantimes.com/cricket/pakistan-knock-on-iccs-door-...",
"hostname": "www.hindustantimes.com",
"title": "Pakistan knocks on ICC's door, demands immediate removal of match referee...",
"date": "2025-09-15",
"crawl_date": "2025-09-15T09:31:13Z",
"language": "en",
"text": "Pakistan knocks on ICC's door, demands immediate removal of match referee for staying silent on India's no handshake. The PCB has escalated India's no-handshake controversy to the ICC, asking for the immediate removal of match referee Andy Pycroft..."
}
{
"question": "Who was the match referee for the India vs. Pakistan Asia Cup match on September 14, 2025?",
"answer": "Andy Pycroft",
"support_count": 648,
"closedbook_gemini-2.5-pro": {
"answer": "Javagal Srinath",
"is_correct": false
},
"openbook_gemini-2.5-flash-lite": [
{ "article_id": 126152, "answer": "Not mentioned", "is_correct": false },
...
{ "article_id": 1713242, "answer": "Not mentioned", "is_correct": false }
]
}
Results
The headline number is the gap between non-private and DP synthesis. Non-private synthesis (ε = ∞) nearly matches training on the real corpus — 92.5% vs 96.4% on Geminon, 65.5% vs 70.4% on News. SOTA DP synthesis collapses: 13.7% / 20.6% at ε = 100, and barely above no-training at ε = 10. Private Evolution — a popular training-free DP method — falls below no-training on Geminon, because it can only resample from the base model's proposal distribution, which doesn't contain the fresh facts we test.
Put differently: DP generators today preserve the form of a corpus — its style, topics, document structure — while losing the specific factual content that makes the data valuable. Between ε = ∞ and ε = 10, MAUVE on Geminon drops by 0.10; QA accuracy drops by 88.6 percentage points. A method that cannot recover information attested across hundreds of independent records has not meaningfully preserved the corpus, and closing that gap is the central open problem for DP text synthesis. We hope ContinuousBench can serve as a yardstick.
Corpus, QA, and
eval harness are released.
See more curation and experimental details in paper.