← back

ContinuousBench: does DP synthetic text actually transfer knowledge?

Co-led by Peihan Liu and Alex Bie; joint work with Lucas Rosenblatt, Weiwei Kong, Natalia Ponomareva, Gautam Kamath, Rachel Cummings, Roxana Geambasu, Yu Gan, and Lillian Tsai.

TL;DR. We built ContinuousBench, a benchmark for whether DP synthetic text retains the corpus-specific knowledge that motivated using a sensitive dataset in the first place. Existing benchmarks like IMDB and OpenReview are saturated — train-on-real beats no-training by only a few points (51% → 62% on OpenReview, 94% → 97% on IMDB), leaving no room to distinguish DP synthesis methods.

On ContinuousBench, training on the real corpus jumps a Gemma 3 4B model from ~1% → ~97% on Geminon and from ~14% → ~70% on News. Yet with differential privacy, even at ε = 100, SOTA DP synthesis only reaches ~4% on Geminon and ~26% on News, nowhere near train-on-real.

What's wrong with current evaluations for DP synthetic text?

The pitch for DP text synthesis is simple: take a sensitive corpus, e.g. internal docs, and release synthetic text that others can safely use, for example to train models. But looking similar is not enough. The goal is to preserve the capabilities we would have gained from training on the real data: if the real data would teach a model to answer questions about internal knowledge, the synthetic data should too*.

Existing evaluations often miss this for three reasons:

  1. The knowledge already in the base model. On IMDB, Yelp, OpenReview, and other familiar classification-style benchmarks, modern LMs can already solve these out of the box, even at ~1B scale. Hence, if a modern pretrained LM gets good numbers after training on DP synthetic data, that does not necessarily mean the synthetic data taught it anything new.
  2. The metrics reward style, not knowledge. MAUVE, BLEU, classification accuracy, etc. can all improve when the synthetic text only gets the right "vibe". A DP synthetic copy can score well on these metrics while losing the specific facts in the source corpus, so good metric scores do not necessarily mean real information transfer.
  3. Teacher-student distillation muddies the signal. Synthetic data is often generated by a stronger model and then used to train a weaker model for eval. The gains may come from the stronger model teaching the weaker one, rather than from knowledge learned from the private corpus.

The visible symptom is saturation. On the benchmarks the field actually uses, every method sits within a few points of the ceiling:

0 25 50 75 100 51 55 62 OpenReview 94 96 97 IMDB 1 3 97 Geminon (ours) 14 20 70 News (ours) No training Train on DP-synth Train on real
Standard benchmarks are nearly saturated: the base model (Gemma 3 4B) is already strong. ContinuousBench creates headroom to measure whether synthetic data transfers new knowledge.

ContinuousBench

Each release packages three things: a fresh training corpus, a derived short-answer QA set, and a standardized training/eval harness. The benchmark is built around three design goals:

To avoid teacher-student confounds, we use the same base checkpoint for the generator and the downstream evaluator. Geminon gives us a fully controlled fictional domain for studying DP learnability, while News gives us a continuously collected real-world corpus with natural messiness and time-evolving facts.

Source article collected post-cutoff news from CC-News + dedup
url
hindustantimes.com/cricket/pakistan-knock-on-iccs-door-…
date
2025-09-15
title
Pakistan knocks on ICC's door, demands immediate removal of match referee…
text
Pakistan knocks on ICC's door, demands immediate removal of match referee for staying silent on India's no handshake. The PCB has escalated India's no-handshake controversy to the ICC, asking for the immediate removal of match referee Andy Pycroft…
cluster, generate QA, then verify support with open-book retrieval
QA pair with closed- and open-book baselines
question
Who was the match referee for the India vs. Pakistan Asia Cup match on September 14, 2025?
answer
Andy Pycroft
support_count
648
closedbook
Javagal Srinath # zeroshot Gemini
openbook
"Not mentioned" # Gemini with context of articles 126152
"Andy Pycroft" # Gemini with context of articles 747116

The gap under differential privacy

The main result is simple: synthesis works, while DP synthesis mostly does not. In the standard “fine-tune then sample” pipeline, public synthetic data (ε = ∞) nearly matches the real corpus. With DP, accuracy collapses. On Geminon, DP synthesis reaches only 4% at ε = 100 and 3% at ε = 10. On News, it reaches 26% at ε = 100 and 20% at ε = 10 — barely above no-training in both cases. Another promising training-free DP method, Private Evolution, does not close the gap either: it falls below no-training on both Geminon and News.

0 25 50 75 100 1 14 No training 3 20 DP-Syn ε=10 4 26 DP-Syn ε=100 93 66 Syn (ε=∞) 97 70 Train on real Geminon News
Downstream QA accuracy for the standard “fine-tune then sample” synthesis pipeline. Non-private synthesis nearly matches the real corpus, while DP synthesis falls far behind. Results use Gemma 3 4B as generator and evaluator.

The pattern is clear. Today's DP synthetic text can preserve the form of a corpus, its style, topics, and document structure, while losing the factual content that makes the data valuable. On Geminon, for example, going from ε = ∞ to ε = 10 only drops MAUVE by 0.10, but drops QA accuracy by 89%. . If a method cannot recover facts repeated across hundreds of independent records, it has not really preserved the corpus :(

Release. Closing this gap is a central challenge for DP text synthesis, and a key roadblock for deploying DP in the AI era. We release the corpora and QA sets, along with the evaluation harness, and we hope ContinuousBench can serve as a yardstick for progress. More curation and experimental details are in the paper.