Condados
← All posts

CNN vs Transformer, quantized: YOLO26-seg and RF-DETR-Seg race for instance masks on an Intel iGPU

Two opposite small segmentation models — a 2026-era CNN and a DETR transformer — both quantized to INT8 with NNCF and run on a laptop Intel iGPU. YOLO26-seg's forward pass is ~7.4× faster; RF-DETR-Seg keeps its masks more faithful under INT8. The whole shootout, numbers first.

Luis Condados ·
computer-visioninstance-segmentationopenvinoIntelSoftwareInnovatorquantizationedge-aiyolorf-detr

A 2026-era CNN and a DETR transformer, both quantized to INT8 and run on the laptop iGPU you already own. On an Intel Iris Xe, YOLO26-seg’s forward pass is ~2.1× faster at INT8 than at FP32 on the same iGPU (9.3 ms, ~107/s) — and ~7.4× faster than RF-DETR-Seg. But the transformer keeps its masks more faithfully under INT8 (0.95 vs 0.87 IoU). And weight-only INT8? Zero iGPU speedup. Here’s the whole shootout.

Hero: YOLO26-seg vs RF-DETR-Seg, INT8 on the iGPU Same frame, both at INT8 on the Intel iGPU. Left: YOLO26n-seg, ~9.7 ms. Right: RF-DETR-Seg-Nano, ~70 ms. The CNN is an order of magnitude cheaper per frame.


Why bother doing this on an iGPU?

Instance segmentation — a mask per object, not just a box — is the expensive end of real-time vision. The reflex is to reach for an NVIDIA GPU. But most laptops, mini-PCs and industrial edge boxes ship with an Intel integrated GPU sitting idle, and OpenVINO turns it into a real inference accelerator for free: no cloud, no discrete card, no per-frame egress cost.

The open question for a practitioner is which model family. Two strong, very different small seg models landed in early 2026:

  • YOLO26n-seg — Ultralytics, a CNN, NMS-free and DFL-free, 640×640.
  • RF-DETR-Seg-Nano — Roboflow, a DETR transformer (DINOv2 backbone), 384×384.

CNN vs transformer, same iGPU, one shared INT8 pipeline. Which wins on latency, and do the masks survive quantization? This is a segmentation sequel to earlier detection benchmarks — same hardware, now with masks.

The approach in one flow

coco128-seg → export each model to OpenVINO FP32 IR → one NNCF pipeline
            → {weight-only INT8, full-PTQ INT8} per model
            → benchmark forward latency + mask-IoU retention on {CPU, iGPU}
  • Models: yolo26n-seg.pt (Ultralytics) and RFDETRSegNano (Roboflow), nano tier.
  • Data: coco128-seg (128 COCO images with masks) — calibration and eval.
  • Runtime: OpenVINO 2026.2, NNCF INT8.
  • Hardware: Intel Core i7-12700H — Iris Xe iGPU (GPU.0) and the CPU.

Everything is a uv run seg-<verb> command; the whole thing reproduces from uv sync.

Build walkthrough

1. Export — two export paths, one IR layout

The two families could not export more differently. YOLO26 exports to OpenVINO in one Ultralytics call; RF-DETR goes through ONNX and ov.convert_model (cli/export.py):

# YOLO26: Ultralytics emits the IR folder directly
YOLO("yolo26n-seg.pt").export(format="openvino", imgsz=640, half=False, dynamic=False)

# RF-DETR: export ONNX (opset 18, static shape) → convert to OpenVINO IR
model = RFDETRSegNano()
model.export(output_dir=tmp, opset_version=18, shape=(res, res))
ov_model = ov.convert_model(onnx_path)

Two gotchas surfaced immediately, both worth knowing:

YOLO26 is end-to-end. Its seg export does not produce the legacy (1, 116, 8400) raw grid. Being NMS-free, it emits (1, 300, 6 + 32) — up to 300 already-deduplicated instances, each [x1, y1, x2, y2, conf, cls, 32 mask coeffs], plus a (1, 32, 160, 160) prototype bank. No NMS, no transpose — just threshold and assemble masks (core/yolo26.py):

det = det[0]                  # (300, 38): [x1,y1,x2,y2,conf,cls,coeffs]
boxes, conf, cls, coeffs = det[:, :4], det[:, 4], det[:, 5], det[:, 6:38]
masks = sigmoid(coeffs @ proto.reshape(nm, mh * mw)).reshape(-1, mh, mw)

RF-DETR’s mask head won’t trace through the legacy ONNX exporter. It upsamples with antialiased bicubic (aten::_upsample_bicubic2d_aa), which the TorchScript exporter can’t emit. We disable antialiasing just for the export trace — masks are thresholded afterward, so the effect is nil:

def _no_antialias(*args, **kwargs):
    if kwargs.get("antialias"):
        kwargs["antialias"] = False
    return orig_interpolate(*args, **kwargs)

The RF-DETR IR then emits three tensors: boxes (1, 100, 4), logits (1, 100, 91) (COCO’s sparse 1-90 indexing, not contiguous-80 like YOLO), and mask logits at a compressed (1, 100, 96, 96). Those low-res logits must be bilinearly upsampled before thresholding — the documented RF-DETR gotcha, handled in core/rfdetr.py:

up = cv2.resize(mask_logits[q], (orig_w, orig_h), interpolation=cv2.INTER_LINEAR)
mask = sigmoid(up) >= 0.5

2. Quantize — one NNCF pipeline for both

The fair-comparison move: instead of letting each framework quantize its own way, both FP32 IRs go through the same NNCF code (cli/quantize.py), with weight-only (compress_weights) and full PTQ (nncf.quantize calibrated on coco128-seg). The only per-model difference is the preset — RF-DETR, being a transformer, gets ModelType.TRANSFORMER so SmoothQuant protects its attention and LayerNorm:

kwargs = {"model_type": nncf.ModelType.TRANSFORMER} if name == "rfdetr" else {}
quantized = nncf.quantize(core.read_model(src), nncf.Dataset(tensors), **kwargs)

Calibration reuses each model’s inference preprocessing (YOLO letterboxes; RF-DETR squashes + ImageNet-normalizes), so the quantized path sees the exact input distribution it will see at runtime.

3. The device-family trap

get_available_devices() returns GPU.0 / GPU.1, never a bare GPU. A naive "GPU" in available check fails, silently falling back to CPU — so your “iGPU” benchmark is really a CPU benchmark. The resolver in core/common.py matches families:

if any(d == requested or d.startswith(requested + ".") for d in available):
    return requested

On this machine GPU.0 is the Intel Iris Xe (our target) and GPU.1 is a discrete NVIDIA card OpenVINO also enumerates — GPU correctly resolves to the iGPU.

Results

Speed is the raw OpenVINO forward latency — the part quantization actually accelerates, measured identically for both families (mask post-processing is CPU-side Python either way). Each cell is the median of 3 runs of 50 timed passes after warmup, batch 1.

Speed — FP32 vs INT8, CPU vs iGPU

modelprecisiondevicemean msp50 msp90 msimg/s
yolo26fp32CPU56.8935.47101.9517.6
yolo26fp32GPU19.7419.2021.5250.7
yolo26int8_woqCPU57.4234.77103.9317.4
yolo26int8_woqGPU19.9619.5321.9550.1
yolo26int8_fullCPU32.9817.2264.9430.3
yolo26int8_fullGPU9.329.369.52107.3
rfdetrfp32CPU429.18302.36730.562.3
rfdetrfp32GPU86.6185.6391.6511.5
rfdetrint8_woqCPU322.23193.74596.093.1
rfdetrint8_woqGPU90.7587.33102.9511.0
rfdetrint8_fullCPU233.51142.05430.104.3
rfdetrint8_fullGPU68.8166.4677.4314.5

Forward latency: FP32 vs INT8, CPU vs iGPU

Mask quality retention (INT8 vs FP32 reference)

Foreground-mask IoU of each INT8 model against its own FP32 output, over 30 coco128-seg images, plus the mean instance count. (Computed in ACCURACY execution mode — see Measurement notes for why that matters.)

modelprecisionmask IoU vs fp32mean #instfp32 #inst
yolo26int8_woq0.9594.54.4
yolo26int8_full0.8703.54.4
rfdetrint8_woq0.9584.54.5
rfdetrint8_full0.9504.44.5

The headline numbers

I lead with same-device ratios, because those are the stable signal — the CPU baseline on this P/E-core part is noisy (see Measurement notes).

  • Full INT8 is ~2.1× faster than FP32 on the same iGPU (9.32 vs 19.74 ms) — the clean, apples-to-apples quantization win.
  • YOLO26 is ~7.4× faster than RF-DETR at INT8 on the iGPU (9.32 vs 68.81 ms), despite running at 640 while RF-DETR runs at 384.
  • In absolute terms that’s ~107 forward passes/s on the iGPU for YOLO26. Versus the CPU it’s very roughly 3.5–6×, but treat the CPU multiplier as a ballpark — its latency distribution is heavily tail-skewed on this machine.

Measurement notes

A few things to read the tables honestly — and to reproduce them fairly:

  • Forward-only. Latency and “img/s” are the raw OpenVINO forward pass (batch 1), not end-to-end FPS. Mask assembly (prototype matmul / query upsampling) and NMS-free decode run in CPU Python on top, so real per-frame FPS is lower. Forward-only is deliberate: it isolates what quantization actually changes and measures both families identically.
  • Run-to-run variance. Each cell is the median of 3 runs, but the iGPU forward latency still moves ±10–20% across runs (thermal + whatever else the laptop is doing) — FP32-iGPU in particular drifted between 19.7 and 21.6 ms across my runs. The same-device ratios are the stable signal — lean on them, not the third decimal.
  • Quality eval runs in ACCURACY mode for a reason. Speed uses the deployment-realistic PERFORMANCE hint (reduced precision — where INT8 gets its iGPU win). But PERFORMANCE mode can intermittently corrupt a weight-only-INT8 model’s outputs when the same IR was compiled on the GPU earlier in the same process — I caught it producing 300 garbage detections on one run. ACCURACY execution mode (full precision on non-INT8 ops) is immune and isolates the pure quantization effect, so all quality numbers use it. A real OpenVINO gotcha worth knowing if you mix devices in one process.
  • mean vs p50. We report mean. On the iGPU the distribution is tight so it barely matters; on the CPU it is tail-heavy (12700H P/E cores + thermal throttling), so meanp50. If you compare devices, p50 is the fairer central stat.
  • The RF-DETR decode is a simplified reimplementation. I do per-query argmax + threshold; RF-DETR’s official post-processor does a global top-K over the flattened (query × class) score matrix. I validated mine against the native .predict() — instance counts land within ~0.4/image — so it’s faithful, but it is not bit-identical. The within-model INT8-vs-FP32 retention is unaffected either way (same decode on both precisions).
  • The mask metric is retention, not accuracy. “Mask IoU vs fp32” is the instance-agnostic foreground-union IoU against each model’s own FP32 output — it answers “does INT8 still draw the same masks?”, not “are the masks correct?”. Absolute quality would need a COCO mask-mAP run.
  • Confidence thresholds differ (YOLO 0.25, RF-DETR 0.5 — each family’s native default), so cross-model instance counts (4.4 vs 4.5) aren’t directly comparable; the per-model retention is.

The honest findings

1. Weight-only INT8 buys you nothing on the iGPU. YOLO26 weight-only is 19.96 ms vs 19.74 ms FP32 — indistinguishable. RF-DETR weight-only is actually slower on the iGPU (90.8 vs 86.6 ms): the on-the-fly weight decompression costs more than it saves when activations stay FP. Only full PTQ (activations quantized) unlocks the iGPU’s INT8 math. Weight-only still has a place — it shrinks the model and helps a little on CPU — but it is not your iGPU speed lever.

2. INT8 helps the CNN far more than the transformer on the iGPU. YOLO26 gets a clean ~2.1× from full INT8 on the iGPU. RF-DETR gets only ~1.26× (86.6 → 68.8 ms). Most of RF-DETR’s win comes simply from being on the iGPU at all (CPU 429 ms → iGPU 87 ms); its attention matmuls and the einsum mask head don’t accelerate much under INT8. If you picked RF-DETR expecting INT8 to close the gap to YOLO, it doesn’t.

3. The surprise: the transformer keeps its masks better. I expected DETR’s quant-sensitive attention to degrade masks more. The opposite happened — RF-DETR-Seg held 0.95 foreground-mask IoU under full INT8 while YOLO26 fell to 0.87, dropping from 4.4 to 3.5 instances per image (it loses marginal-confidence objects). NNCF’s TRANSFORMER preset (SmoothQuant) earned its keep. So the choice is a genuine trade-off: YOLO26 for raw speed, RF-DETR when INT8 mask fidelity matters.

4. Resolution caveat, stated plainly. YOLO26 runs at 640 and RF-DETR at 384 — their native sizes. Even with that handicap the CNN is ~8× cheaper per frame, so the speed verdict is robust; but the IoU and instance-count numbers are each-model-vs-its-own-FP32, not cross-model accuracy, which would need a COCO mAP run.

Reproduce it

uv sync
uv run seg-download                              # coco128-seg
uv run seg-export   --model all                  # FP32 IR for both
uv run seg-quantize weight-only --model all
uv run seg-quantize full        --model all      # the iGPU speed lever
uv run seg-benchmark --devices CPU,GPU           # tables + latency.png
uv run seg-compare  --precision int8_full --device GPU   # the hero image

Stack: Python 3.11, ultralytics 8.4.56, rfdetr 1.7.x, OpenVINO 2026.2, NNCF, torch 2.12. Hardware: Intel Core i7-12700H, Iris Xe iGPU. Every number above comes from output/benchmark.md / speed.csv produced by seg-benchmark.

Takeaways

  • On an Intel iGPU, full-PTQ INT8 is the speed lever; weight-only is not. Always quantize activations if latency is the goal.
  • YOLO26-seg is the speed king at the nano tier — ~107 forward passes/s on a laptop iGPU (forward-only; see Measurement notes), no NVIDIA in sight.
  • RF-DETR-Seg trades speed for mask fidelity under INT8 (0.95 vs 0.87 IoU). Pick by your budget: throughput → YOLO26; mask quality under quantization → RF-DETR.
  • Match the export to the model. YOLO26 is end-to-end (no NMS to write); RF-DETR needs the antialias-export workaround and explicit mask upsampling.
  • Next: push the same two models onto the Intel NPU, and add a COCO mask-mAP pass to turn “IoU vs FP32” into absolute accuracy.