ContinuousBench: does DP synthetic text actually transfer knowledge?
Co-led by Peihan Liu and Alex Bie; joint work with Lucas Rosenblatt, Weiwei Kong, Natalia Ponomareva, Gautam Kamath, Rachel Cummings, Roxana Geambasu, Yu Gan, and Lillian Tsai.
TL;DR. We built ContinuousBench, a benchmark for whether DP synthetic text retains the corpus-specific knowledge that motivated using a sensitive dataset in the first place. Existing benchmarks like IMDB and OpenReview are saturated — train-on-real beats no-training by only a few points (51% → 62% on OpenReview, 94% → 97% on IMDB), leaving no room to distinguish DP synthesis methods.
On ContinuousBench, training on the real corpus jumps a Gemma 3 4B model from ~1% → ~97% on Geminon and from ~14% → ~70% on News. Yet with differential privacy, even at ε = 100, SOTA DP synthesis only reaches ~4% on Geminon and ~26% on News, nowhere near train-on-real.
What's wrong with current evaluations for DP synthetic text?
The pitch for DP text synthesis is simple: take a sensitive corpus, e.g. internal docs, and release synthetic text that others can safely use, for example to train models. But looking similar is not enough. The goal is to preserve the capabilities we would have gained from training on the real data: if the real data would teach a model to answer questions about internal knowledge, the synthetic data should too*.
Existing evaluations often miss this for three reasons:
- The knowledge already in the base model. On IMDB, Yelp, OpenReview, and other familiar classification-style benchmarks, modern LMs can already solve these out of the box, even at ~1B scale. Hence, if a modern pretrained LM gets good numbers after training on DP synthetic data, that does not necessarily mean the synthetic data taught it anything new.
- The metrics reward style, not knowledge. MAUVE, BLEU, classification accuracy, etc. can all improve when the synthetic text only gets the right "vibe". A DP synthetic copy can score well on these metrics while losing the specific facts in the source corpus, so good metric scores do not necessarily mean real information transfer.
- Teacher-student distillation muddies the signal. Synthetic data is often generated by a stronger model and then used to train a weaker model for eval. The gains may come from the stronger model teaching the weaker one, rather than from knowledge learned from the private corpus.
The visible symptom is saturation. On the benchmarks the field actually uses, every method sits within a few points of the ceiling:
ContinuousBench
Each release packages three things: a fresh training corpus, a derived short-answer QA set, and a standardized training/eval harness. The benchmark is built around three design goals:
- Freshness. The corpus is constructed to be unseen by the tested model. In Geminon, we procedurally generate a Pokémon-inspired fictional world, and generate articles from it with Gemini. In News, we source news articles after the training cutoff from periodically released CommonCrawl-News dumps. The QA set tests knowledge from that corpus.
- Grounded QA. Each question is linked to supporting records in the corpus and has a normalized short answer, so evaluation can be automatic and verifiable.
- DP-learnable facts by construction. Each main test question is supported by at least 200 independent records, so the target is population-level knowledge rather than singleton memorization. We also keep a singleton split, where each fact appears in only one record, as a privacy sanity check.
To avoid teacher-student confounds, we use the same base checkpoint for the generator and the downstream evaluator. Geminon gives us a fully controlled fictional domain for studying DP learnability, while News gives us a continuously collected real-world corpus with natural messiness and time-evolving facts.
- name
- Boreling
- classification
- Frost Geminon
- type
- ice
- ability
- Berserk
- stats
- HP 69 / Atk 60 / Def 63 / SpA 67 / SpD 68 / Spe 40 / BST 367
- size
- 12 m, 52 lbs
- evolution
- Boreling → Borelash → Borastat
- move
- Powder Snow - Has a chance to freeze the target
- text
- Boreling is an ice-type Geminon, also known as the Frost Geminon, and it is the first form of a three-stage evolution line that includes Borelash and Borastat. Physically, it is 12 meters tall and weighs 52 lbs. Boreling has a base stat total of 367, with its individual stats being 69 HP, 60 attack, 63 defense, 67 special attack, 68 special defense, and 40 speed. Its ability is Berserk, and it can learn the move Powder Snow, which has a chance to inflict the frozen status on an opponent.
- type
- wiki
- tag
- name, classification, type, ability, all six stats, evolution_line, size, move
- Q
- What is the classification of Boreling?
- A
- Frost Geminon
- supports
- 7410, 8929, 10420, … # articles that support this answer
- Q
- What are the types of Boreling?
- A
- Ice
- supports
- 3, 1515, 5964, … # articles that support this answer
- url
- hindustantimes.com/cricket/pakistan-knock-on-iccs-door-…
- date
- 2025-09-15
- title
- Pakistan knocks on ICC's door, demands immediate removal of match referee…
- text
- Pakistan knocks on ICC's door, demands immediate removal of match referee for staying silent on India's no handshake. The PCB has escalated India's no-handshake controversy to the ICC, asking for the immediate removal of match referee Andy Pycroft…
- question
- Who was the match referee for the India vs. Pakistan Asia Cup match on September 14, 2025?
- answer
- Andy Pycroft
- support_count
- 648
- closedbook
- Javagal Srinath ✘ # zeroshot Gemini
- openbook
-
"Not mentioned" ✘ # Gemini with context of articles 126152…"Andy Pycroft" ✓ # Gemini with context of articles 747116
The gap under differential privacy
The main result is simple: synthesis works, while DP synthesis mostly does not. In the standard “fine-tune then sample” pipeline, public synthetic data (ε = ∞) nearly matches the real corpus. With DP, accuracy collapses. On Geminon, DP synthesis reaches only 4% at ε = 100 and 3% at ε = 10. On News, it reaches 26% at ε = 100 and 20% at ε = 10 — barely above no-training in both cases. Another promising training-free DP method, Private Evolution, does not close the gap either: it falls below no-training on both Geminon and News.
The pattern is clear. Today's DP synthetic text can preserve the form of a corpus, its style, topics, and document structure, while losing the factual content that makes the data valuable. On Geminon, for example, going from ε = ∞ to ε = 10 only drops MAUVE by 0.10, but drops QA accuracy by 89%. . If a method cannot recover facts repeated across hundreds of independent records, it has not really preserved the corpus :(
Release. Closing this gap is a central challenge for DP text synthesis, and a key roadblock for deploying DP in the AI era. We release the corpora and QA sets, along with the evaluation harness, and we hope ContinuousBench can serve as a yardstick for progress. More curation and experimental details are in the paper.