From Photons to Pixels: Image Formation and Color Spaces, with OpenCV in Python and C++

How a camera turns light into an array of numbers — projection, sampling, quantization, the Bayer sensor — and the color spaces (RGB/BGR, grayscale, HSV, YCrCb, Lab) you convert between every day. The equations, plus runnable OpenCV in both Python and C++.

Luis Condados · June 4, 2026 · Updated June 6, 2026

computer-vision image-processing opencv color-spaces fundamentals

From Photons to Pixels: Image Formation and Color Spaces, with OpenCV in Python and C++

Before any model, filter, or detector touches an image, that image has already been through a whole pipeline: light bounced off a scene, was focused by a lens, landed on a sensor, got sampled and quantized into integers, and was arranged into a grid you call an array. Understanding that pipeline — and the color spaces you reshuffle those integers into — is the foundation everything else sits on. Here it is end to end, with the math and runnable OpenCV in Python and C++.

The image formation pipeline. Every stage is lossy, and each one shows up later as noise, blur, aliasing, or banding.

1. What a digital image actually is

A scene in front of a camera is continuous: at every point and every wavelength there’s some amount of light. Mathematically we can write the image reaching the sensor as a continuous function

f(x, y) \in \mathbb{R}_{\ge 0}.

A computer can’t store a continuous function, so two things happen. Sampling reads $f$ only on a grid of points spaced $\Delta x, \Delta y$ apart, and quantization rounds each reading to one of a finite set of levels:

I[m, n] = Q\big(f(m\,\Delta x,\, n\,\Delta y)\big), \qquad Q : \mathbb{R} \to \{0, 1, \dots, L-1\}, \quad L = 2^{b}.

For a standard 8-bit image $b = 8$ , so $L = 256$ and every pixel is an integer in $[0, 255]$ . That’s the whole reason an image is a 2-D array of uint8 — sampling gives it width and height, quantization gives it the integer values.

2. Image formation inside the camera

Projection: from 3-D scene to 2-D plane

A lens (idealized as a pinhole) projects a 3-D point $(X, Y, Z)$ in camera coordinates onto the image plane at focal length $f$ :

x = f\,\frac{X}{Z}, \qquad y = f\,\frac{Y}{Z}.

Converting those metric coordinates to pixel indices adds the focal lengths in pixels $(f_x, f_y)$ and the principal point $(c_x, c_y)$ — the camera intrinsic matrix $K$ [3]:

\begin{bmatrix} u \\ v \\ 1 \end{bmatrix} \sim K \begin{bmatrix} X \\ Y \\ Z \end{bmatrix}, \qquad K = \begin{bmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}.

This is the matrix you recover from camera calibration, and it’s what lets you go back and forth between pixels and rays.

From light to numbers: the sensor

Each sensor pixel (“photosite”) collects photons over the exposure time and turns them into a charge. A simplified but useful model of the digital value is linear in the incident irradiance $E$ :

I \approx Q\big(g \cdot E \cdot t + n\big),

where $t$ is exposure time, $g$ is the analog/ISO gain, $n$ is noise, and $Q$ is the analog-to-digital quantizer from §1. Two practical consequences fall out immediately: more gain amplifies noise along with signal, and the final rounding to $2^b$ levels is where banding in smooth gradients comes from.

Color from a gray sensor: the Bayer filter

A silicon sensor only measures intensity — it’s colorblind. To capture color, manufacturers overlay a color filter array (CFA), most commonly the Bayer pattern: a mosaic of red, green, and blue filters with twice as many greens (your eye is most sensitive to green). Each photosite therefore records only one of R, G, or B; the missing two channels at every pixel are interpolated in a step called demosaicing. OpenCV does it for you:

import cv2

# `raw` is a single-channel Bayer mosaic from the sensor (here: BGGR layout).
bgr = cv2.cvtColor(raw, cv2.COLOR_BayerBG2BGR)   # demosaic -> 3-channel BGR

#include <opencv2/opencv.hpp>

// `raw` is a single-channel Bayer mosaic from the sensor (here: BGGR layout).
cv::Mat bgr;
cv::cvtColor(raw, bgr, cv::COLOR_BayerBG2BGR);   // demosaic -> 3-channel BGR

By the time you call imread, all of this has already happened — but it explains why your image is BGR, why greens look cleanest, and where demosaicing artifacts near sharp edges come from.

3. The image as an array

Loading an image hands you that grid of integers. The one detail that trips up everyone new to OpenCV: channels are ordered B, G, R, not R, G, B.

import cv2

img = cv2.imread("street.jpg")     # BGR, dtype uint8
print(img.shape, img.dtype)        # (1080, 1920, 3) uint8
h, w, c = img.shape                # rows, cols, channels

#include <opencv2/opencv.hpp>
#include <iostream>

int main() {
    cv::Mat img = cv::imread("street.jpg");   // BGR, type CV_8UC3
    std::cout << img.rows << "x" << img.cols
              << " channels=" << img.channels() << "\n";  // 1080x1920 channels=3
    int h = img.rows, w = img.cols, c = img.channels();
}

Reading and writing a single pixel

Indexing is (row, col) — i.e. (y, x) — and each pixel is a 3-vector in BGR order:

b, g, r = img[100, 200]            # one pixel at row 100, col 200 (uint8 each)
print(int(b), int(g), int(r))

img[100, 200] = (0, 0, 255)        # paint it pure red (B=0, G=0, R=255)

cv::Vec3b px = img.at<cv::Vec3b>(100, 200);   // (row, col), BGR order
uchar b = px[0], g = px[1], r = px[2];

img.at<cv::Vec3b>(100, 200) = cv::Vec3b(0, 0, 255);  // pure red

// `img` is a flat H*W*3 byte buffer, BGR interleaved (no OpenCV):
//   std::vector<unsigned char> img(H * W * 3);
// Index (row, col, channel) into it by hand:
auto at = [&](int y, int x, int c) -> unsigned char& {
    return img[(y * W + x) * 3 + c];   // (row, col), BGR order
};

unsigned char b = at(100, 200, 0), g = at(100, 200, 1), r = at(100, 200, 2);

at(100, 200, 0) = 0;                    // paint it pure red (B=0, G=0, R=255)
at(100, 200, 1) = 0;
at(100, 200, 2) = 255;

4. Color spaces

A color space is just a choice of axes for the same color information. You convert between them because some tasks are far easier in the right coordinate system. In OpenCV every conversion goes through one function, cvtColor.

Grayscale (luma)

Dropping color collapses three channels to one. It isn’t a plain average — the weights match human luminance sensitivity (Rec. 601):

Y = 0.299\,R + 0.587\,G + 0.114\,B.

gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)   # shape (H, W), single channel

cv::Mat gray;
cv::cvtColor(img, gray, cv::COLOR_BGR2GRAY);   // single-channel CV_8UC1

// img: flat H*W*3 BGR bytes. gray: H*W bytes. Just the Rec. 601 formula,
// applied pixel by pixel — exactly what cvtColor does internally.
std::vector<unsigned char> gray(H * W);
for (int i = 0; i < H * W; ++i) {
    float b = img[i*3 + 0], g = img[i*3 + 1], r = img[i*3 + 2];
    gray[i] = static_cast<unsigned char>(0.299f*r + 0.587f*g + 0.114f*b + 0.5f);
}

HSV — hue, saturation, value

RGB mixes color and brightness together, which makes “find the red things” hard when lighting changes. HSV separates what the color is (hue) from how vivid (saturation) and how bright (value). With $R,G,B \in [0,1]$ , let $M = \max(R,G,B)$ , $m = \min(R,G,B)$ , and chroma $C = M - m$ :

V = M, \qquad S = \begin{cases} 0 & M = 0 \\[2pt] C / M & \text{otherwise} \end{cases}

H = 60^\circ \times \begin{cases} 0 & C = 0 \\[2pt] \big((G - B)/C\big) \bmod 6 & M = R \\[2pt] (B - R)/C + 2 & M = G \\[2pt] (R - G)/C + 4 & M = B \end{cases}

A gotcha worth memorizing: in 8-bit OpenCV, hue is stored in $[0, 179]$ (degrees halved to fit a byte), while $S$ and $V$ use the full $[0, 255]$ .

Worked example — the same orange pixel in HSV

Normalize $R{=}240,G{=}120,B{=}30$ to $[0,1]$ : $R{=}0.94,G{=}0.47,B{=}0.12$ . Then $M{=}0.94$ (red is largest), $m{=}0.12$ , chroma $C{=}0.82$ :

V = M = 0.94, \quad S = \tfrac{C}{M} = \tfrac{0.82}{0.94} = 0.88, \quad H = 60° \times \tfrac{G-B}{C} = 60° \times \tfrac{0.47-0.12}{0.82} \approx 26°.

What it means: orange is hue ≈ 26° (just past red at 0°), highly saturated (88%), bright (94%) — and in OpenCV’s 8-bit frame that hue becomes 13 (26 ÷ 2). If you tried to threshold it expecting 26, you’d select nothing: the halved-hue gotcha, made concrete.

hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)     # H in [0,179], S,V in [0,255]

cv::Mat hsv;
cv::cvtColor(img, hsv, cv::COLOR_BGR2HSV);     // H in [0,179], S,V in [0,255]

// img: flat H*W*3 BGR bytes -> hsv: flat H*W*3 bytes (H in [0,179]).
// Needs <algorithm> and <cmath>. This is the §4 formula, byte by byte.
std::vector<unsigned char> hsv(H * W * 3);
for (int i = 0; i < H * W; ++i) {
    float b = img[i*3+0]/255.f, g = img[i*3+1]/255.f, r = img[i*3+2]/255.f;
    float mx = std::max({r, g, b}), mn = std::min({r, g, b});
    float c = mx - mn;                              // chroma
    float h = 0.f;
    if (c > 0.f) {
        if      (mx == r) h = std::fmod((g - b) / c, 6.f);
        else if (mx == g) h = (b - r) / c + 2.f;
        else              h = (r - g) / c + 4.f;
        h *= 60.f;
        if (h < 0.f) h += 360.f;
    }
    float s = (mx == 0.f) ? 0.f : c / mx;
    hsv[i*3+0] = static_cast<unsigned char>(h * 0.5f + 0.5f);   // degrees/2 -> [0,179]
    hsv[i*3+1] = static_cast<unsigned char>(s * 255.f + 0.5f);
    hsv[i*3+2] = static_cast<unsigned char>(mx * 255.f + 0.5f);
}

Now you try

Drag R, G, B and watch the same colour re-expressed as grayscale luma and HSV — and the OpenCV 8-bit hue tick along at half the degrees. Start from our orange (240, 120, 30) and slide only the sliders to confirm $Y = 146$ , $H \approx 26°$ , OpenCV hue 13.

R G B

YCrCb — luma plus chroma

This is the space behind JPEG and most video. It keeps the luma $Y$ and stores two color-difference channels (Rec. 601, 8-bit, with offset $\delta = 128$ ):

Y = 0.299R + 0.587G + 0.114B, \quad C_r = (R - Y)\cdot 0.713 + \delta, \quad C_b = (B - Y)\cdot 0.564 + \delta.

Because the eye is far more sensitive to luma than chroma, codecs subsample $C_r, C_b$ (4:2:0) and almost nobody notices — a direct, daily payoff of the camera→color-space chain.

ycrcb = cv2.cvtColor(img, cv2.COLOR_BGR2YCrCb)   # channels: Y, Cr, Cb

cv::Mat ycrcb;
cv::cvtColor(img, ycrcb, cv::COLOR_BGR2YCrCb);   // channels: Y, Cr, Cb

// img: flat H*W*3 BGR bytes -> ycrcb: flat H*W*3 bytes (Y, Cr, Cb).
// The linear Rec. 601 transform with offset delta = 128.
std::vector<unsigned char> ycrcb(H * W * 3);
const float delta = 128.f;
for (int i = 0; i < H * W; ++i) {
    float b = img[i*3+0], g = img[i*3+1], r = img[i*3+2];
    float y  = 0.299f*r + 0.587f*g + 0.114f*b;
    float cr = (r - y) * 0.713f + delta;
    float cb = (b - y) * 0.564f + delta;
    ycrcb[i*3+0] = static_cast<unsigned char>(y  + 0.5f);
    ycrcb[i*3+1] = static_cast<unsigned char>(cr + 0.5f);
    ycrcb[i*3+2] = static_cast<unsigned char>(cb + 0.5f);
}

CIELAB — perceptually uniform

Lab is designed so that equal numerical distances look like roughly equal color differences to a human — handy for color comparison and matching [4]. It’s a nonlinear transform through CIE XYZ, with $X_n, Y_n, Z_n$ the reference white:

L^* = 116\,f\!\left(\tfrac{Y}{Y_n}\right) - 16, \quad a^* = 500\left[f\!\left(\tfrac{X}{X_n}\right) - f\!\left(\tfrac{Y}{Y_n}\right)\right], \quad b^* = 200\left[f\!\left(\tfrac{Y}{Y_n}\right) - f\!\left(\tfrac{Z}{Z_n}\right)\right]

f(t) = \begin{cases} t^{1/3} & t > \delta^3 \\[2pt] \dfrac{t}{3\delta^2} + \dfrac{4}{29} & \text{otherwise} \end{cases}, \qquad \delta = \tfrac{6}{29}.

lab = cv2.cvtColor(img, cv2.COLOR_BGR2Lab)   # L in [0,255], a,b offset by 128

cv::Mat lab;
cv::cvtColor(img, lab, cv::COLOR_BGR2Lab);   // L in [0,255], a,b offset by 128

Splitting and merging channels

Whatever space you’re in, you can pull it apart and put it back:

b, g, r = cv2.split(img)        # three single-channel images
merged  = cv2.merge([b, g, r])  # back to one 3-channel image

std::vector<cv::Mat> ch;
cv::split(img, ch);             // ch[0]=B, ch[1]=G, ch[2]=R
cv::Mat merged;
cv::merge(ch, merged);

// Split an interleaved BGR buffer into three planar channels, then merge back.
int n = H * W;
std::vector<unsigned char> B(n), G(n), R(n);
for (int i = 0; i < n; ++i) {        // de-interleave
    B[i] = img[i*3 + 0];
    G[i] = img[i*3 + 1];
    R[i] = img[i*3 + 2];
}

std::vector<unsigned char> merged(n * 3);
for (int i = 0; i < n; ++i) {        // re-interleave
    merged[i*3 + 0] = B[i];
    merged[i*3 + 1] = G[i];
    merged[i*3 + 2] = R[i];
}

5. A practical payoff: segmenting by color in HSV

Here’s why all of this matters. Picking out red objects in RGB is fiddly; in HSV it’s a hue window. Red is the awkward case because its hue wraps around 0, so we union two ranges:

import cv2
import numpy as np

hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)

# Red wraps around hue = 0, so combine the low and high ends.
mask1 = cv2.inRange(hsv, np.array([0, 120, 70]),   np.array([10, 255, 255]))
mask2 = cv2.inRange(hsv, np.array([170, 120, 70]), np.array([179, 255, 255]))
mask  = mask1 | mask2

result = cv2.bitwise_and(img, img, mask=mask)   # keep only the red pixels

cv::Mat hsv, mask1, mask2, mask, result;
cv::cvtColor(img, hsv, cv::COLOR_BGR2HSV);

// Red wraps around hue = 0, so combine the low and high ends.
cv::inRange(hsv, cv::Scalar(0, 120, 70),   cv::Scalar(10, 255, 255),  mask1);
cv::inRange(hsv, cv::Scalar(170, 120, 70), cv::Scalar(179, 255, 255), mask2);
cv::bitwise_or(mask1, mask2, mask);

cv::bitwise_and(img, img, result, mask);        // keep only the red pixels

// hsv: flat H*W*3 bytes (from the conversion above). inRange + bitwise_and
// are just a per-pixel test and a copy — no library needed.
int n = H * W;
auto in = [](unsigned char x, int lo, int hi) { return x >= lo && x <= hi; };

std::vector<unsigned char> mask(n);
for (int i = 0; i < n; ++i) {
    unsigned char h = hsv[i*3+0], s = hsv[i*3+1], v = hsv[i*3+2];
    bool red = (in(h, 0, 10)   && in(s, 120, 255) && in(v, 70, 255))   // low end
            || (in(h, 170, 179) && in(s, 120, 255) && in(v, 70, 255)); // hue wraps
    mask[i] = red ? 255 : 0;
}

std::vector<unsigned char> result(n * 3, 0);    // keep only the red pixels
for (int i = 0; i < n; ++i)
    if (mask[i])
        for (int c = 0; c < 3; ++c) result[i*3+c] = img[i*3+c];

The same five lines that would be brittle in RGB are robust in HSV — purely because we chose better axes for the question.

Limitations & caveats

The pinhole model is an idealization. Real lenses add radial/tangential distortion, vignetting, and depth-of-field effects the projection equations ignore — calibration recovers $K$ and distortion coefficients [3].
The linear sensor model is simplified. Actual pipelines apply gamma/tone curves, white balance, and denoising before you see the pixels, so values aren’t linear in scene radiance by the time you imread them [1].
8-bit ranges and conventions bite. OpenCV is BGR, indexed (row, col), with hue in $[0, 179]$ — mixing these up is the most common color-space bug [2].
Color conversions assume a color space. cvtColor’s RGB↔Lab/YCrCb math assumes sRGB primaries and a reference white; feed it a wide-gamut or linear image and the perceptual guarantees no longer hold [4].

Takeaways

An image is sampling + quantization of continuous light — that’s why it’s a grid of integers in $[0, 255]$ , and where aliasing and banding originate.
The camera pipeline is lossy at every stage (projection, sensor noise, demosaicing, quantization); artifacts you fight later are born here.
OpenCV is BGR, indexed (row, col) — internalize this once and stop fighting it.
Color spaces are coordinate choices. Convert with cvtColor; reach for grayscale to drop color, HSV for color thresholding, YCrCb for compression, Lab for perceptual distance.
Watch the ranges: 8-bit hue lives in $[0, 179]$ , not $[0, 360]$ .

Once light is a clean array of numbers, the fun starts — like quantizing the models that consume those arrays and running them on an integrated GPU. That’s exactly what we do in YOLO26-seg vs RF-DETR-Seg: INT8 instance segmentation on an Intel iGPU.

References

[1] Szeliski, R. (2022). Computer Vision: Algorithms and Applications (2nd ed.), ch. 2 (image formation, sampling & quantization). Springer. Free PDF.

[2] Gonzalez, R. C., & Woods, R. E. (2018). Digital Image Processing (4th ed.), ch. 2 & 6 (digital images and color models). Pearson.

[3] Hartley, R., & Zisserman, A. (2004). Multiple View Geometry in Computer Vision (2nd ed.), ch. 6 (camera models). Cambridge University Press.

[4] Reinhard, E., Khan, E. A., Akyüz, A. O., & Johnson, G. M. (2008). Color Imaging: Fundamentals and Applications. A K Peters/CRC Press.

[5] OpenCV. Color conversions (cvtColor) and Bayer demosaicing. Docs.

Tagged