One million source images with hashes that respect the chosen uniqueness rule.
PDNA-1M was built to be trusted.
PDNA-1M exists to make XLerate™ DNA benchmark claims for Microsoft PhotoDNA3 inspectable, repeatable, and credible. It is not a random pile of vectors. It is an internal benchmark corpus built around clean geometry, explicit operating boundaries, and measurement discipline so speed and recall claims can be tested instead of merely asserted.
Eleven million hashes after augmentations: one base hash plus ten qualifying variants per source image.
The conservative L2 ceiling adopted after false positives were observed between 199.0 and 200.0.
The conservative insertion-side radius used for the primary clean-geometry benchmark band.
Why the dataset exists
The benchmark is only as credible as the data under it.
PhotoDNA matching lives in a narrow, high-trust operating window. The dataset has to control the things that would distort measurement: hidden near-duplicates, weak source diversity, accidental geometric collisions, and pathological images that generate low-value hashes.
Collect a diverse internal benchmark corpus
The pipeline starts from a target of one million source images that pass the corpus uniqueness screen, broad enough to avoid a toy benchmark and clean enough to support trustworthy recall measurements.
Remove duplicates and near-duplicates
Candidate images are checked with exact pixel hashes, perceptual hashes, neural embeddings, and high-recall review passes before they become part of the base corpus.
Run a PhotoDNA sanity pass
Every base image is hashed and compared against the corpus so the base geometry can be screened for collisions before the benchmark measures system behavior.
Generate variants that are meaningfully different
For each base image, augmentations are generated until ten variants land inside the intended PhotoDNA distance window.
Split the variants by geometric difficulty
Easy, Medium, Hard, Hardest, and Extreme are shorthand for explicit L2 ranges with different implications for insertion-safe geometry.
Separate system limits from data limits
A trustworthy benchmark should tell you whether a miss or slowdown comes from the system being tested, or from geometry that has already been pushed beyond safe operating conditions.
Rejected examples
Images that should not be hashed
These examples show low-signal failure modes that can generate low-value hashes and should be filtered during curation.
Under-saturation
Dim or low-information regions do not carry enough stable signal.


Repetitive texture
Repetitive, uninformative detail creates geometry you should not trust.


Distance bands
Five bands, one clean ceiling.
The augmented PhotoDNA hashes are split into five distance bands that are all within the 199.0 L2 uniqueness distance. However, when inserting these alias vectors into a database, we enforce an upper bound that is half of that clean separation distance.
The half-distance rule uses the triangle inequality to keep same-source alias clusters compact: two accepted variants within 99.5 L2 of the same base remain within the 199.0 L2 clean-separation ceiling. Global label purity still depends on the corpus screens and XLerate DNA's purity-preserving insertion modes. This is not a preprocessing shortcut — XLerate DNA offers multiple automatic, purity-preserving operating modes at runtime.
PDNA-1M distance-band counts
Eligible benchmark query counts in the split files. Hard is the mathematically safe insertion ceiling. Hardest and Extreme are robustness-oriented bands.
What the band names mean
A system claiming full-recall PhotoDNA search should return every configured match through Hard on the clean primary benchmark. Hardest and Extreme remain useful stress-test bands, but they push beyond the clean 99.5 insertion-safe radius, so they are not part of the primary recall contract.
That restriction is not a limitation of PhotoDNA or XLerate DNA. Beyond the clean radius, some rows become geometrically ambiguous under the dataset's label-purity rules. The primary benchmark is designed to test high-trust retrieval, not to hide ambiguity inside the labels.
Split files in the bundle
| Band | Eligible queries | Reading |
|---|---|---|
| Easy | 666,086 | Clear within-threshold variations. A full-recall system should return every configured match here. |
| Medium | 1,274,098 | Harder, but still inside the clean operating range. A full-recall system should still return every configured match here. |
| Hard | 1,062,769 | The primary clean-geometry edge for insertion-side benchmark rows in this dataset. |
| Hardest | 3,673,946 | Rows beyond the clean 99.5 insertion-safe radius; useful for robustness analysis, but outside the primary recall contract. |
| Extreme | 3,322,827 | Outer-limit material for understanding how far transformed hashes can move before label purity becomes ambiguous. |
Visual examples
What the distance bands look like
These image pairs show how the bands progress from gentle transformations to larger movements in PhotoDNA space.
Easy
Brightness and intensity shift with highlight clipping.


Medium
Slight zoom, moderate blur, and mild darkening.


Hard
Perspective tilt, rotation, and black borders.


Hardest
Zoom, rotation, translation, borders, and resampling blur.


The geometric rule
Why 99.5 is the insertion limit.
The dataset's uniqueness ceiling is set at 199.0 L2 because false positives, as defined by our criteria, were observed between 199.0 and 200.0. That gives the clean insertion-side rule used for the primary benchmark: variants admitted as clean aliases of a base image must stay within 99.5 L2 of that base.
The ruler behind the benchmark
The clean insertion-safe zone ends at 99.5 L2 because the dataset's conservative uniqueness ceiling is 199.0. Beyond that radius, the data can still be useful for robustness testing, but it no longer provides the same clean label-purity guarantee for the primary benchmark.
Why it matters
If two variants are each within 99.5 L2 of the same base, the triangle inequality keeps that same-source alias cluster within the 199.0 L2 clean-separation ceiling. Beyond that radius, the data can still be useful for robustness testing, but it no longer provides the same clean label-purity guarantee for the primary benchmark.
Bias check
Private OOD validation checks that the result travels.
PDNA-1M is not the only validation signal behind the benchmark story. We also run private held-out out-of-distribution tests, including adult-content material, to verify that XLerate DNA's measured performance is not specific to PDNA-1M and translates to material closer to real PhotoDNA workflows.
Measured performance
See the benchmark story in full.
The benchmark page walks through XLerate DNA's measured performance, including local SDK throughput, AWS single-query throughput, p95 latency, cost parity, stress behavior, and the deployment story behind the service results.
Open the benchmark summary