İçeriğe geç
KAMPANYA

Logo Tasarım + Web Tasarım + 1 Yıl Domain + E-posta + Hosting — $299 +KDV

AIOR

Beyond MVTec AD: collecting your own anomaly dataset that survives reality

Sektör topluluğu — sorularınız, deneyimleriniz ve duyurularınız için.

Beyond MVTec AD: collecting your own anomaly dataset that survives reality

Aior

Administrator
Staff member
Joined
Apr 2, 2023
Messages
175
Reaction score
2
Points
18
Age
40
Location
Turkey
Website
aior.com
1/3
Thread owner
500


MVTec AD is a benchmark, not a dataset​

Every anomaly detection paper tops out near 99 % image-AUROC on MVTec AD. That number is the reason teams confidently deploy a model and then watch it fail in production. MVTec AD is small (~5k images), pristine (lab lighting, clean backgrounds, single object), and curated (anomalies are visible to a human in <1 s). Your factory floor is none of those things.

If you're collecting your own dataset, here's what we've learned the hard way.

Class imbalance is the whole problem​

Anomalies are rare by definition. A line that produces 1 % defect rate gives you, in a typical week, a hundred or so anomalies and tens of thousands of good parts. This isn't a "balance the loss" problem. It's a "you don't have enough anomalies for supervised learning, ever" problem. Hence unsupervised methods.

But: that 1 % includes maybe twenty distinct defect types. If you train on the data you have, you'll cover the common defects fine and miss the rare ones — the rare ones being, of course, the most expensive ones to miss.

The cold start problem​

On day one of a project you have no good images and no anomaly images. Two weeks of data collection later, you have a few hundred good images and zero confirmed anomalies. The decision: deploy a "good only" anomaly detector now and find out what it flags, or wait until a couple of confirmed anomalies show up?

We've converged on: deploy in shadow mode at the end of week 2. Use the operator's manual rejections as anomaly labels. Don't trust the labels until you've reviewed them.

Active learning loops that actually work​

  • Run inference on every part. Log score + image.
  • Human reviewer queues: highest-score good parts (potential false rejects), lowest-score bad parts (potential false accepts).
  • Operator labels in <30 s per image, in a UI built for it. Not a spreadsheet.
  • Daily delta: 50-100 new labels, weekly retrain on the cumulative set.

This is the pattern that took our worst-performing project from 92 % to 99.4 % image-AUROC over six weeks of production. No new model architecture; just a better dataset.

Synthetic anomalies (CutPaste, DRAEM)​

A surprisingly strong tool. The trick: paste random crops from the same image (CutPaste) or simulate Perlin-noise-driven structural anomalies (DRAEM-style). The model learns "this region is statistically inconsistent" rather than "this looks like the anomalies I've seen". Generalises better to unseen defect types than naive supervised approaches.

We don't ship synthetic-only models. We ship models trained on real good samples + synthetic perturbations.

Things to actually capture, beyond the image​

  • Camera ID, lens config, lighting state — different cameras drift differently
  • Shift, operator ID, line speed — operator-driven variance is a real signal
  • Upstream process variables (temperature, pressure) when available — sometimes the anomaly is upstream
  • Material lot — different supplier batches look different to the camera

These are the columns that let you debug a regression three months later instead of staring at a confusion matrix.

One last thing​

Don't compress your training images. Lossy JPEG compression hides exactly the kind of low-amplitude defects you're trying to detect. Keep the raw PNGs in cold storage, downsample at training time if the model needs it.

What's your dataset cadence? Weekly retrain, monthly, on-demand only?
 

Forum statistics

Threads
171
Messages
178
Members
27
Latest member
AIORAli

Members online

No members online now.

Featured content

AIOR
AIOR TEKNOLOJİ

Tüm ihtiyaçlarınız için Teklif alın

Hosting · Domain · Sunucu · Tasarım · Yazılım · Mühendislik · Sektörel Çözümler

Teklif al

7/24 Destek · Anında yanıt

Back
Top