u/Tasty_Pressure_5618 — reddlx

▲ 2 r/datasets+1 crossposts

I built a dataset on SDXL + InstantID architecture and tested 14 popular deepfake detectors

I tested 14 popular deepfake detectors on SDXL + InstantID architecture. Six of them performed at or below random (dataset and blog below).

About a year removed for my last research project, I've gotten an itch to dip a toe back in. Releasing full blown papers would be a difficult task to sustain, so I've opted for a substack instead. Here is the TLDR:

What did I do?
I compiled 26K real + generated face crops across 12 demographic cells and benchmarked 14 popular open source models.

What were the results?
Only two detectors achieve near-perfect rank ordering. Only one is deployable as shipped.
Fairness drift is visible in 12 of 14 detectors. Per-cell AUC spread ranges from 0 (cell-invariant) to 0.54 (catastrophic). The aggregate AUC hides where they break.

I'll most likely be targeting liveness detection and working with a more frontier architecture. If you have a model in mind that for the next benchmark, please comment.

Read the full blog post here: https://babalolad.substack.com/p/i-tested-14-deepfake-detectors-on
Access the dataset here: https://huggingface.co/datasets/danb21/synthetic-face-sdxl-instantid-bench

u/Tasty_Pressure_5618 — 20 hours ago

▲ 4 r/datasets+1 crossposts

What deepfake detection models can I test my validation dataset on?

Hello, I built a validation dataset of real and generated images (with a vanilla SDXL+InstantID architecture). I'm running low on AWS credits/have a low budget, but I want to benchmark the performance detection models against it. Can anyone recommend open-source detection models that I can test?

I know there is a mix of ones created by universities and made by members of the open source community, but any opinions on which 4-5 I should test would be greatly appreciated.

u/Tasty_Pressure_5618 — 6 days ago

▲ 1 r/computervision

How are people evaluating demographic fairness of deepfake/synthetic-face detectors?

I keep finding that FF++, DFDC, and GenImage aren't balanced enough by skin tone/gender to get stable per-group accuracy numbers. Is there a balanced eval benchmark I'm missing, or does everyone just report aggregate AUC?

reddit.com

u/Tasty_Pressure_5618 — 7 days ago