ContinuousBench: does DP synthetic text actually transfer knowledge?

June 1, 2026 · paper · dataset · eval harness

Co-led by Peihan Liu and Alex Bie; joint work with Lucas Rosenblatt, Weiwei Kong, Natalia Ponomareva, Gautam Kamath, Rachel Cummings, Roxana Geambasu, Yu Gan, and Lillian Tsai.

TL;DR. We built ContinuousBench, a recurringly generated benchmark for whether DP synthetic text retains the corpus-specific knowledge that motivated using a sensitive dataset in the first place. Existing benchmarks like IMDB and OpenReview are saturated — train-on-real beats no-training by only a few points (51% → 62% on OpenReview, 94% → 97% on IMDB), leaving no room to distinguish DP synthesis methods.

On ContinuousBench, training on the real corpus jumps a Gemma 3 4B model from ~1% → ~97% on Geminon and from ~14% → ~70% on News. Yet with differential privacy, even at ε = 100, SOTA DP synthesis only reaches ~4% on Geminon and ~26% on News, nowhere near train-on-real.

What's wrong with current evaluations for DP synthetic text?

The pitch for DP text synthesis is simple: take a sensitive corpus, e.g. clinical notes, and release synthetic text that others can safely use, for example to train models. But looking similar is not enough. The goal is to preserve the capabilities we would have gained from training on the real data: if the real data would teach a model what's in the records and what to do about it, the synthetic data should too*.

Existing evaluations often miss this for three reasons:

The knowledge already in the base model. On IMDB, Yelp, OpenReview, and other familiar classification-style benchmarks, modern LMs can already solve these out of the box, even at ~1B scale. Hence, if a modern pretrained LM gets good numbers after training on DP synthetic data, that does not necessarily mean the synthetic data taught it anything new.
The metrics reward style, not knowledge. MAUVE, BLEU, classification accuracy, etc. can all improve when the synthetic text only gets the right "vibe." A DP synthetic copy can score well on these metrics while losing the specific facts in the source corpus, so good metric scores do not necessarily mean real information transfer.
Teacher-student distillation muddies the signal. Synthetic data is often generated by a stronger model and then used to train a weaker model for eval*. The gains may come from the stronger model teaching the weaker one, rather than from knowledge learned from the private corpus.

The visible symptom is saturation. On the benchmarks the field actually uses, every method sits within a few points of the ceiling:

Standard benchmarks are nearly saturated: the base model (Gemma 3 4B) is already strong. ContinuousBench creates headroom to measure whether synthetic data transfers new knowledge.

ContinuousBench

ContinuousBench is released on a recurring schedule, so it stays ahead of new model knowledge cutoffs. Each release packages three things: a fresh training corpus, a derived short-answer QA set, and a standardized training/eval harness. The benchmark is built around three design goals:

Freshness. The corpus is constructed to be unseen by the tested model. In Geminon, we procedurally generate a Pokémon-inspired fictional world, and generate articles from it with Gemini. In News, we source news articles after the training cutoff from periodically released CommonCrawl-News dumps. The QA set tests knowledge from that corpus.
Grounded QA. Each question is linked to supporting records in the corpus and has a normalized short answer, so evaluation can be automatic and verifiable.
DP-learnable facts by construction. Each main test question is supported by at least 200 independent records, so the target is population-level knowledge rather than singleton memorization. We also keep a singleton split, where each fact appears in only one record, as a privacy sanity check.

To make sure downstream gains reflect knowledge the synthetic data carries, rather than the generator simply being more capable than the evaluator, we use the same model checkpoint for the generator and the downstream evaluator. Geminon gives us a fully controlled fictional domain for studying DP learnability, while News gives us a continuously collected real-world corpus with natural messiness and time-evolving facts.

Index entry structured source fact

name: Boreling
classification: Frost Geminon
type: ice
ability: Berserk
stats: HP 69 / Atk 60 / Def 63 / SpA 67 / SpD 68 / Spe 40 / BST 367
size: 12 m, 52 lbs
evolution: Boreling → Borelash → Borastat
move: Powder Snow - Has a chance to freeze the target

↓ sample Gemini for training articles grounded in the index, then dedup

Training article tagged with mentioned attributes

text: Boreling is an ice-type Geminon, also known as the Frost Geminon, and it is the first form of a three-stage evolution line that includes Borelash and Borastat. Physically, it is 12 meters tall and weighs 52 lbs. Boreling has a base stat total of 367, with its individual stats being 69 HP, 60 attack, 63 defense, 67 special attack, 68 special defense, and 40 speed. Its ability is Berserk, and it can learn the move Powder Snow, which has a chance to inflict the frozen status on an opponent.
type: wiki
tag: name, classification, type, ability, all six stats, evolution_line, size, move

How records are generated We prompt Gemini to write articles based on index entries, in several document types: single-Geminon wiki entries and field journals, head-to-head comparisons of two Geminons, and evolutionary-line analyses. During generation we also record which attributes of each Geminon the article actually mentions (the tag field).

↓ each attribute becomes a question

QA pair with support article ids

Q: What is the classification of Boreling?
A: Frost Geminon
supports: 7410, 8929, 10420, … # articles that support this answer

Q: What are the types of Boreling?
A: Ice
supports: 3, 1515, 5964, … # articles that support this answer

…

Source article collected post-cutoff news from CC-News + dedup

url: hindustantimes.com/cricket/pakistan-knock-on-iccs-door-…
date: 2025-09-15
title: Pakistan knocks on ICC's door, demands immediate removal of match referee…
text: Pakistan knocks on ICC's door, demands immediate removal of match referee for staying silent on India's no handshake. The PCB has escalated India's no-handshake controversy to the ICC, asking for the immediate removal of match referee Andy Pycroft…

↓ cluster, generate QA, then verify support with open-book retrieval

QA pair with closed- and open-book baselines

question: Who was the match referee for the India vs. Pakistan Asia Cup match on September 14, 2025?
answer: Andy Pycroft
support_count: 648
closedbook: Javagal Srinath ✘ # zeroshot Gemini
openbook: "Not mentioned" ✘ # Gemini with context of articles 126152

…

"Andy Pycroft" ✓ # Gemini with context of articles 747116

How News QA is built Cluster + extract. Cluster the deduped articles and keep the top 500 largest clusters. Per cluster, sample up to 50 articles and do QA generation with Gemini.

Filter. Drop questions that are not fully interpretable on their own, or that Gemini already answers correctly zero-shot.

Estimate supports. Embed every article and every surviving QA, retrieve the top-1000 candidate articles per QA, send each candidate with its associated QAs to Gemini for an open-book answer, and use Gemini to judge correctness. Articles judged correct count toward the question's support_count.

The gap under differential privacy

The main result is simple: synthesis works, while DP synthesis mostly does not. In the standard “fine-tune then sample” pipeline, public synthetic data (ε = ∞) nearly matches the real corpus. With DP, accuracy collapses. On Geminon, DP synthesis reaches only 4% at ε = 100 and 3% at ε = 10. On News, it reaches 26% at ε = 100 and 20% at ε = 10 — barely above no-training in both cases. Another promising training-free DP method, Private Evolution, does not close the gap either: it falls below no-training on both Geminon and News.

Downstream QA accuracy for the standard “fine-tune then sample” synthesis pipeline. Non-private synthesis nearly matches the real corpus, while DP synthesis falls far behind. Results use Gemma 3 4B as generator and evaluator.

The pattern is clear. Today's DP synthetic text can preserve the form of a corpus, its style, topics, and document structure, while losing the factual content that makes the data valuable. On Geminon, for example, going from ε = ∞ to ε = 10 only drops MAUVE by 0.10, but drops QA accuracy by 89%. . If a method cannot recover facts repeated across hundreds of independent records, it has not really preserved the corpus :(

Release. Closing this gap is a central challenge for DP text synthesis, and a key roadblock for deploying DP in the AI era. We release the corpora and QA sets, along with the evaluation harness, and we hope ContinuousBench can serve as a yardstick for progress. More curation and experimental details are in the paper.