SeamCam: Quantifying Seamless Camouflage via Multi-Cue Visual Detectability

A perceptually grounded metric for camouflage evaluation and generation.

Amin Karimi Monsefi^1† Abolfazl Meyarian^2† Mridul Khurana³ Shuheng Wang¹
Pouyan Navard^1‡ Cheng Zhang⁴ Anuj Karpatne³ Wei-Lun Chao⁵ Rajiv Ramnath¹

† Equal contribution · ‡ PhD alumni, now in industry

¹The Ohio State University · ²Path Robotics · ³Virginia Tech · ⁴Texas A&M University · ⁵Boston University

Paper arXiv Code CamFG-1.5K BibTeX

SeamCam vs CamOT comparison across diverse camouflage scenarios

Figure 1. SeamCam vs. CamOT. SeamCam produces consistent and accurate camouflage difficulty scores across diverse scenarios, whereas CamOT exhibits notable inconsistencies. As illustrated by the polar bear — where superficial color and lighting similarity between subject and background causes CamOT to erroneously overestimate camouflage effectiveness despite the subject being plainly visible — CamOT assigns a disproportionately high score relative to SeamCam.

01 — Abstract

What we built, and why

Camouflage is what prevents an observer from confidently localizing an animal — not what makes its colors match the background. SeamCam measures camouflage as the failure of that localization.

Animals are described as effectively camouflaged when they blend seamlessly with their surroundings, yet no standardized quantitative measure of this seamlessness exists. We address this gap by framing camouflage evaluation as a visual localization problem: a well-camouflaged animal is one that remains difficult to detect even when its category is known.

We introduce SeamCam, a metric that quantifies how detectable an animal is from the available visual evidence. Given an image and a target species, SeamCam generates category-conditioned detection proposals, extracts segmentation masks, and identifies the subset whose collective union yields the highest IoU with the ground-truth mask — the SeamCam score is one minus this maximum recoverable localization signal, where a higher score indicates stronger camouflage.

In a human two-alternative forced-choice study with 94 participants and 2,390 comparisons, SeamCam achieves 78.82% agreement with human camouflage difficulty judgments, outperforming the strongest baseline by ~25 points. We further use SeamCam as a preference signal for Direct Preference Optimization to fine-tune a diffusion-based inpainting model, and release CamFG-1.5K, a curated benchmark of 1,521 fully-visible animals for unbiased camouflage-generation evaluation.

02 — Highlights

Three numbers worth remembering

78.82%

Human Alignment

+25pt

Margin over CamOT

1,521

CamFG-1.5K Images

SeamCam is a category-aware ideal-observer metric. It cares about whether evidence integration succeeds, not about what made it fail — so it correctly rewards both standard background-matching camouflage and disruptive coloration, while remaining stable across detector/segmenter backbones (77.5–79.7% across four pipelines).

03 — Method

Camouflage as failed localization

Given an image \(I\), ground-truth mask \(M^*\), and species name \(c\), SeamCam produces category-conditioned proposals, segments each, and asks: what is the best subset of those segments at recovering the animal? Whatever that best subset can achieve is the upper bound on detectability — and one minus that is the score.

Propose

Grounding DINO queries the image with the species prompt, returning candidate boxes with text-alignment \(\alpha_i\) and confidence \(\beta_i\) scores.

Gate

Two thresholds (\(\tau_\alpha{=}0.5\), \(\tau_\beta{=}0.1\)) filter irrelevant proposals while keeping the suppressed-confidence true positives typical of camouflage.

Segment

SAM-2 turns each surviving box into a mask. We keep Top-\(K{=}7\) by confidence — the elbow of the accuracy/latency curve.

Aggregate

Across all \(2^K{-}1\) non-empty subsets, take the union and compute IoU against \(M^*\). The maximum is \(D\); the score is \(\zeta = 1 - D\).

\[ D(I, \mathcal{M}, c) \;=\; \max_{\emptyset \neq S \subseteq \{1,\ldots,K\}} \mathrm{IoU}\!\left(\bigcup_{i \in S}\widehat{M}_i,\, M^*\right) \qquad \zeta(I, \mathcal{M}, c) = 1 - D \]

This formulation is naturally robust to two common detector pathologies: duplicate proposals don't change the max-union IoU, and noisy weak detections are absorbed because the maximization only ever picks subsets that help.

SeamCam framework overview: Grounding DINO proposals, SAM-2 segmentation, subset IoU evaluation

Figure 3. SeamCam framework. Given an image and species name, we generate category-conditioned proposals via Grounding DINO, apply semantic and confidence gating, and obtain segmentation masks from SAM-2. All proposal subsets are evaluated; the maximum-IoU subset defines detectability \(D\). The camouflage score is \(1 - D\).

04 — Human Study

What 94 people thought

We ran a two-alternative forced-choice study: participants saw image pairs and chose "which is more camouflaged?". With ties dropped, attention-screened catch trials, and pairs with fewer than three valid responses excluded, we had 2,290 valid pairs and 94 participants.

Metric	Human Agreement	Δ vs. Chance
CamOT	53.89%	+3.89
SeamCam (ours)	78.82%	+28.82

McNemar's paired test on the 2,271-pair shared subset gives \(\chi^2_{cc} = 300.67\), \(p < 10^{-67}\) — SeamCam agrees with humans where CamOT disagrees roughly 3.2× more often than the reverse. Per-species Wilson intervals place SeamCam above chance for all 20 species in the study.

Per-species accuracy bar chart: SeamCam vs CamOT across 20 species

Figure 5. Per-species accuracy comparison. SeamCam (green) outperforms CamOT (brown) across all 20 categories. Largest gaps appear in Fish (Δ = 0.45), Sea Horse (Δ = 0.39), and Cat (Δ = 0.36); the smallest in Cicada and Toad.

05 — Generation

SeamCam as a training signal

We use SeamCam to mine hard negatives for Direct Preference Optimization of an SD-V2 inpainting model: real camouflage images are winners; the highest-scoring synthetic candidates (most convincingly camouflaged among generated alternatives) become losers.

SeamCam-based DPO pipeline: prompt generation, candidate synthesis, hard-negative selection

Figure 4. SeamCam-based DPO sample selection. InternVL3-14B generates 12 environmental prompt variants per training image; SD-V2 Inpainting synthesizes candidates; the highest-SeamCam candidate is selected as the hard negative against the real image (winner).

Ablation: training signal on SD-V2 base

Training Signal	KID ↓	FID ↓	HPS-v2 ↑	CamOT ↑	SeamCam ↑
Zero-shot	0.0258	88.28	0.190	0.3619	0.1824
LoRA + SFT	0.0248	84.28	0.193	0.3641	0.2519
DPO (Random)	0.0187	78.04	0.192	0.3874	0.2565
DPO (CamOT)	0.0168	77.26	0.193	0.4352	0.2995
DPO (SeamCam)	0.0139	71.25	0.195	0.4144	0.3811

DPO (CamOT) gets the highest CamOT score — but that's in-distribution reward hacking. The independent metrics (KID, FID, HPS-v2) all favor DPO (SeamCam), confirming the gains are real.

Comparison on CamFG-1.5K

Method	KID ↓	FID ↓	HPS-v2 ↑	CamOT ↑	SeamCam ↑
LCG-Net	0.1455	248.58	0.158	0.4263	0.2954
TFill	0.0669	131.30	0.148	0.5067	0.3571
LDM	0.0423	99.43	0.165	0.3751	0.3474
RePaint-L	0.0266	101.21	0.172	0.3930	0.3501
LAKE-RED	0.0157	70.65	0.171	0.3967	0.3639
DPO (SeamCam)	0.0139	71.25	0.195	0.4144	0.3811

Qualitative comparison of SeamCam-DPO vs baseline camouflage generation methods

Figure 6. Qualitative comparison. Given only the foreground mask, SeamCam-DPO produces backgrounds that match each animal's natural habitat. TFill loses color fidelity, RePaint-L generates implausible scenes, LDM over-saturates, LAKE-RED places animals in ecologically mismatched environments.

06 — Dataset

CamFG-1.5K

Existing camouflage datasets contain images that are already partially camouflaged, cropped, or occluded — confounds that systematically bias the evaluation of generation models. CamFG-1.5K eliminates these confounds by construction.

1,521 high-resolution iNaturalist images, each containing a complete, well-segmented animal with minimal occlusion. The dataset spans more than 1,000 species across six taxonomic groups (insects, reptiles, amphibians, fish, birds, mammals) and multiple biome types. It will be publicly released to support reproducible benchmarking.

CamFG-1.5K vs existing camouflage datasets: image quality comparison

Figure 2. Quality comparison between CamFG-1.5K and existing datasets. Existing collections often include subjects that are cropped or partially camouflaged, biasing the evaluation of camouflage models. CamFG-1.5K features clearly visible animals with minimal obstructions, enabling unbiased assessment of generation methods.

07 — Citation

Cite this work

If SeamCam, CamFG-1.5K, or our DPO recipe is useful in your research, please cite:

@inproceedings{monsefi2026seamcam,
  title     = {SeamCam: Quantifying Seamless Camouflage via Multi-Cue Visual Detectability},
  author    = {Monsefi, Amin Karimi and Meyarian, Abolfazl and Khurana, Mridul
               and Wang, Shuheng and Navard, Pouyan and Zhang, Cheng
               and Karpatne, Anuj and Chao, Wei-Lun and Ramnath, Rajiv},
  
  year      = {2026}
}