Dataset explainer

PDNA-1M was built to be trusted.

PDNA-1M exists to make XLerate™ DNA benchmark claims for Microsoft PhotoDNA3 inspectable, repeatable, and credible. It is not a random pile of vectors. It is an internal benchmark corpus built around clean geometry, explicit operating boundaries, and measurement discipline so speed and recall claims can be tested instead of merely asserted.

1,000,000 base images 10 qualifying variants per base image

Base corpus

1.0M

One million source images with hashes that respect the chosen uniqueness rule.

Full target

11M

Eleven million hashes after augmentations: one base hash plus ten qualifying variants per source image.

Uniqueness ceiling

199.0

The conservative L2 ceiling adopted after false positives were observed between 199.0 and 200.0.

Insertion-safe ceiling

99.5

The conservative insertion-side radius used for the primary clean-geometry benchmark band.

Why the dataset exists

The benchmark is only as credible as the data under it.

PhotoDNA matching lives in a narrow, high-trust operating window. The dataset has to control the things that would distort measurement: hidden near-duplicates, weak source diversity, accidental geometric collisions, and pathological images that generate low-value hashes.

Step 1

Collect a diverse internal benchmark corpus

The pipeline starts from a target of one million source images that pass the corpus uniqueness screen, broad enough to avoid a toy benchmark and clean enough to support trustworthy recall measurements.

Step 2

Remove duplicates and near-duplicates

Candidate images are checked with exact pixel hashes, perceptual hashes, neural embeddings, and high-recall review passes before they become part of the base corpus.

Step 3

Run a PhotoDNA sanity pass

Every base image is hashed and compared against the corpus so the base geometry can be screened for collisions before the benchmark measures system behavior.

Step 4

Generate variants that are meaningfully different

For each base image, augmentations are generated until ten variants land inside the intended PhotoDNA distance window.

Step 5

Split the variants by geometric difficulty

Easy, Medium, Hard, Hardest, and Extreme are shorthand for explicit L2 ranges with different implications for insertion-safe geometry.

Step 6

Separate system limits from data limits

A trustworthy benchmark should tell you whether a miss or slowdown comes from the system being tested, or from geometry that has already been pushed beyond safe operating conditions.

Rejected examples

Images that should not be hashed

These examples show low-signal failure modes that can generate low-value hashes and should be filtered during curation.

Under-saturation

Dim or low-information regions do not carry enough stable signal.

Under-saturated pathological image example A versus B.

Under-saturated pathological image example B versus A.

Repetitive texture

Repetitive, uninformative detail creates geometry you should not trust.

Repetitive texture pathological image example D versus C.

Distance bands

Five bands, one clean ceiling.

The augmented PhotoDNA hashes are split into five distance bands that are all within the 199.0 L2 uniqueness distance. However, when inserting these alias vectors into a database, we enforce an upper bound that is half of that clean separation distance.

The half-distance rule uses the triangle inequality to keep same-source alias clusters compact: two accepted variants within 99.5 L2 of the same base remain within the 199.0 L2 clean-separation ceiling. Global label purity still depends on the corpus screens and XLerate DNA's purity-preserving insertion modes. This is not a preprocessing shortcut — XLerate DNA offers multiple automatic, purity-preserving operating modes at runtime.

PDNA-1M distance-band counts

Eligible benchmark query counts in the split files. Hard is the mathematically safe insertion ceiling. Hardest and Extreme are robustness-oriented bands.

4M 3M 2M 1M 0

666k

Easy

1.27M

Medium

1.06M

Hard

3.67M

Hardest

3.32M

Extreme

What the band names mean

Easy Medium Hard Hardest Extreme

A system claiming full-recall PhotoDNA search should return every configured match through Hard on the clean primary benchmark. Hardest and Extreme remain useful stress-test bands, but they push beyond the clean 99.5 insertion-safe radius, so they are not part of the primary recall contract.

That restriction is not a limitation of PhotoDNA or XLerate DNA. Beyond the clean radius, some rows become geometrically ambiguous under the dataset's label-purity rules. The primary benchmark is designed to test high-trust retrieval, not to hide ambiguity inside the labels.

Split files in the bundle

Band	Eligible queries	Reading
Easy	666,086	Clear within-threshold variations. A full-recall system should return every configured match here.
Medium	1,274,098	Harder, but still inside the clean operating range. A full-recall system should still return every configured match here.
Hard	1,062,769	The primary clean-geometry edge for insertion-side benchmark rows in this dataset.
Hardest	3,673,946	Rows beyond the clean 99.5 insertion-safe radius; useful for robustness analysis, but outside the primary recall contract.
Extreme	3,322,827	Outer-limit material for understanding how far transformed hashes can move before label purity becomes ambiguous.

Visual examples

What the distance bands look like

These image pairs show how the bands progress from gentle transformations to larger movements in PhotoDNA space.

Easy

Brightness and intensity shift with highlight clipping.

Medium

Slight zoom, moderate blur, and mild darkening.

Hard

Perspective tilt, rotation, and black borders.

Hardest

Zoom, rotation, translation, borders, and resampling blur.

The geometric rule

Why 99.5 is the insertion limit.

The dataset's uniqueness ceiling is set at 199.0 L2 because false positives, as defined by our criteria, were observed between 199.0 and 200.0. That gives the clean insertion-side rule used for the primary benchmark: variants admitted as clean aliases of a base image must stay within 99.5 L2 of that base.

The ruler behind the benchmark

The clean insertion-safe zone ends at 99.5 L2 because the dataset's conservative uniqueness ceiling is 199.0. Beyond that radius, the data can still be useful for robustness testing, but it no longer provides the same clean label-purity guarantee for the primary benchmark.

Easy Medium Hard Hardest Extreme Hard safe ceiling: 99.5

0 25 40 60 99.5 155 199

Why it matters

If two variants are each within 99.5 L2 of the same base, the triangle inequality keeps that same-source alias cluster within the 199.0 L2 clean-separation ceiling. Beyond that radius, the data can still be useful for robustness testing, but it no longer provides the same clean label-purity guarantee for the primary benchmark.

Bias check

Private OOD validation checks that the result travels.

PDNA-1M is not the only validation signal behind the benchmark story. We also run private held-out out-of-distribution tests, including adult-content material, to verify that XLerate DNA's measured performance is not specific to PDNA-1M and translates to material closer to real PhotoDNA workflows.

Companion page

Measured performance

See the benchmark story in full.

The benchmark page walks through XLerate DNA's measured performance, including local SDK throughput, AWS single-query throughput, p95 latency, cost parity, stress behavior, and the deployment story behind the service results.

Open the benchmark summary