← back

ContinuousBench: does DP synthetic text actually transfer knowledge?

TL;DR. We built ContinuousBench, a benchmark that measures whether differentially private (DP) synthetic text retains the corpus-specific knowledge that motivated using a sensitive dataset in the first place. On standard benchmarks like IMDB sentiment and OpenReview, every method looks roughly the same (mid-90s%) — there is no room left to distinguish them. On ContinuousBench, training on the real corpus jumps a Gemma 3 4B model from ~1% to ~96% on the Geminon track and from ~14% to ~70% on the News track. SOTA DP synthesis barely closes that gap — even at ε = 100. We release a corpus, QA set, and standardized harness, and we regenerate it quarterly to stay contamination-free.

What's wrong with current evaluations

DP text synthesis promises a way to learn from sensitive corpora — clinical notes, legal records, internal documents — through a privacy-preserving release. But the question that matters isn't "does the output look like the original?" It's: does it transmit the capabilities you would have gotten by training on the real data?

Existing evaluations don't answer this, for three reasons:

  1. The knowledge is already in the base model. IMDB sentiment, classification-style tasks on familiar domains — modern pretrained LLMs already solve these out of the box. Strong downstream numbers don't prove the corpus mattered; they prove pretraining was enough.
  2. The metrics measure style, not knowledge. MAUVE, BLEU, next-token loss, distributional similarity — all reward stylistic matching. A DP generator can ace these while losing every specific fact in the source corpus.
  3. Teacher–student distillation contaminates the signal. Synthetic data is usually generated by a model much stronger than the one trained on it. The downstream gains then conflate corpus signal with ordinary distillation from a more capable teacher.

The visible symptom is saturation. On the benchmarks the field actually uses, every method sits within a few points of the ceiling:

0 25 50 75 100 51 55 62 OpenReview 94 96 97 IMDB 1 3 97 Geminon (ours) 14 20 64 News (ours) No training Train on DP-synth Train on real
Downstream accuracy of Gemma 3 4B after three increasingly informative training regimes. IMDB and OpenReview are saturated (≤ 3 pp gap between “no training” and “real corpus”). ContinuousBench has +94 pp on Geminon and +50 pp on News. Hue encodes benchmark (gray = saturated, blue = Geminon, amber = News); shade encodes training regime (light → dark = no training → real corpus).

ContinuousBench

Each release packages three things: a fresh training corpus, a derived short-answer QA set, and a standardized training/eval harness. Three design choices fall out of the diagnosis above:

To eliminate teacher–student confounds, the harness uses the same base checkpoint for the generator and the downstream model. The two tracks differ in how the corpus is seeded — Geminon from a procedurally generated index, News from CC-News articles:

Index entry structured source fact

                {
                  "name": "Boreling",
                  "classification": "Frost Geminon",
                  "type1": "ice",
                  "type2": null,
                  "ability": "Berserk",
                  "hp": 69,
                  "attack": 60,
                  "defense": 63,
                  "special attack": 67,
                  "special defense": 68,
                  "speed": 40,
                  "base_stat_total": 367,
                  "weight": 52,
                  "height": 12,
                  "idx": 10003,
                  "evolution_line": ["Boreling", "Borelash", "Borastat"],
                  "move": {
                    "name": "Powder Snow",
                    "short_description": "Has a chance to freeze the target."
                  }
                }
              
rendered into prose for training
Synthesized record one of ≥200 paraphrases

                {
                  "text": "As the first stage in the evolutionary line that culminates in Borastat, Boreling is a quintessential Frost Geminon. This pure ice-type, known as Gemidex entry #10003, measures 12 meters in height and weighs 52 lbs.  Its passive ability is Berserk, and it can be taught the move Powder Snow, which has a percentage-based chance to freeze an opponent. Boreling's base stat total of 367 is comprised of 69 HP, 60 attack, 63 defense, 67 special attack, 68 special defense, and a speed rating of 40.",
                  "tag": [
                    {
                      "idx": 10003,
                      "info": [
                        "idx",
                        "name",
                        "classification",
                        "type1",
                        "ability",
                        "evolution_line",
                        "height",
                        "weight",
                        "base_stat_total",
                        "hp",
                        "attack",
                        "defense",
                        "special attack",
                        "special defense",
                        "speed",
                        "move.name",
                        "move.short_description"
                      ]
                    }
                  ],
                  "type": "wiki"
                }
              
derived as short-answer eval
QA pair normalized, exact-match scored

                {
                  "question": "What is the classification of Boreling?"
                  "answer": "Frost Geminon"
                }
                {
                  "question": "What are the types of Boreling?"
                  "answer": "Ice"
                }
                ...
              

Results

The headline number is the gap between non-private and DP synthesis. Non-private synthesis (ε = ∞) nearly matches training on the real corpus — 92.5% vs 96.4% on Geminon, 65.5% vs 70.4% on News. SOTA DP synthesis collapses: 13.7% / 20.6% at ε = 100, and barely above no-training at ε = 10. Private Evolution — a popular training-free DP method — falls below no-training on Geminon, because it can only resample from the base model's proposal distribution, which doesn't contain the fresh facts we test.

0 25 50 75 100 1 14 No training 4 6 DP-Syn ε=10 14 21 DP-Syn ε=100 93 66 Syn (ε=∞) 96 70 Train on real Geminon News
Downstream QA accuracy on ContinuousBench. Non-private synthesis nearly matches the real corpus; SOTA DP synthesis at ε = 100 leaves an 80-point gap on Geminon. Gemma 3 4B, matched generator × evaluator.

Put differently: DP generators today preserve the form of a corpus — its style, topics, document structure — while losing the specific factual content that makes the data valuable. Between ε = ∞ and ε = 10, MAUVE on Geminon drops by 0.10; QA accuracy drops by 88.6 percentage points. A method that cannot recover information attested across hundreds of independent records has not meaningfully preserved the corpus, and closing that gap is the central open problem for DP text synthesis. We hope ContinuousBench can serve as a yardstick.

Corpus, QA, and eval harness are released.
See more curation and experimental details in paper.