Vincent Dumoulin and Francesco Visin. “A Guide to Convolution Arithmetic for Deep Learning.” arXiv preprint arXiv:1603.07285, 2016. https://arxiv.org/abs/1603.07285

Diagrams drawn with Excalidraw.


Setup: Input and Kernel

We start with a $5 \times 5$ input image and a $2 \times 2$ kernel.

5×5 input image and 2×2 kernel


Convolution with Stride $s = 1$

The $2 \times 2$ kernel slides across the $5 \times 5$ input with stride $s = 1$, visiting all 16 positions ($4 \times 4$ grid) from top-left to bottom-right.

2×2 kernel sliding across 5×5 input at all 16 positions with stride 1

Computing one output value. Let

$$K = \begin{pmatrix} k_{11} & k_{12} \\ k_{21} & k_{22} \end{pmatrix}, \qquad I = \begin{pmatrix} i_{11} & i_{12} \\ i_{21} & i_{22} \end{pmatrix}$$

where $I$ contains the entries of the top-left $2 \times 2$ patch of the image. Applying $K$ to $I$ gives:

$$O_{11} = k_{11}\,i_{11} + k_{12}\,i_{12} + k_{21}\,i_{21} + k_{22}\,i_{22}$$

The convolution results in a feature map of size $4 \times 4$.

Kernel K applied to image patch I to compute output value O₁₁


Convolution with Stride $s = 2$

With stride $s = 2$, the kernel jumps two pixels at a time. The convolution results in a feature map of size $2 \times 2$.

Note: the kernel is not applied to the last row and the last column of the input image.

2×2 kernel sliding across 5×5 input with stride 2, last row and column not covered


Zero Padding

Zero padding is used so that we don’t lose the information at the edges of the image.

With padding $p = 1$, the $5 \times 5$ input becomes $7 \times 7$ after padding. In general, with padding $p$, the image grows to $(i + 2p) \times (i + 2p)$.

5×5 input zero-padded with p=1

Applying the $2 \times 2$ kernel with stride $s = 2$ to the zero-padded image:

2×2 kernel with stride 2 applied to the zero-padded image


Feature Map Size Formula (no padding, $s = 1$)

For an $i \times i$ input, $k \times k$ kernel, stride $s = 1$, and padding $p = 0$:

  • $i - k$ is the last index where the kernel can be applied.
  • The number of times the kernel slides horizontally is $i - k + 1$ (indices $0, 1, \ldots, i-k$).
  • The same holds vertically.
$$\boxed{o = (i - k + 1) \times (i - k + 1)}$$

Example: $5 \times 5$ input, $2 \times 2$ kernel $\Rightarrow o = (5 - 2 + 1)^2 = 4 \times 4$.

Derivation of feature map size formula for no padding and stride 1


Feature Map Size Formula (with padding, $s = 1$)

With stride $s = 1$ and non-zero padding $p$, the padded image has size $i + 2p$, so the kernel slides $(i + 2p) - k + 1$ times in each direction:

$$\boxed{o = (i + 2p - k + 1) \times (i + 2p - k + 1)}$$

Derivation of feature map size formula with non-zero padding and stride 1


Special Padding Cases ($s = 1$)

Half Padding

The feature map has the same dimensions as the input: $o = i$.

Substituting $o = i$ into $o = i + 2p - k + 1$:

$$i = i + 2p - k + 1 \implies 2p = k - 1 \implies p = \frac{k-1}{2}$$

When $k$ is odd this is an integer; in general $p = \lfloor (k-1)/2 \rfloor$.

Full Padding

Every pixel of the input is visited as the first element of some patch. Without padding the last kernel position is $i - k$, so the first $k - 1$ pixels of the input are never a patch’s starting element. Setting $p = k - 1$ lets the kernel start at $-(k-1)$, fixing that:

$$o = i + 2(k-1) - k + 1 = i + k - 1$$

This produces a larger feature map than the $p = 0$ case.

Examples:

$p$ Input Kernel $o$
1 $5 \times 5$ $3 \times 3$ $5 + 2 - 3 + 1 = 5$
0 $5 \times 5$ $2 \times 2$ $5 + 0 - 2 + 1 = 4$
1 $5 \times 5$ $2 \times 2$ $5 + 2 - 2 + 1 = 6$

Half padding and full padding cases with computed output size examples


Feature Map Size Formula (stride $s > 1$)

The last index covered is $i - k$. With stride $s > 1$ and no padding:

$$o = \left\lfloor \frac{i - k}{s} \right\rfloor + 1$$

With non-zero padding:

$$\boxed{o = \left\lfloor \frac{i + 2p - k}{s} \right\rfloor + 1}$$

Derivation of feature map size formula for stride greater than 1


Multi-Channel (RGB) Convolution

For an RGB image, the same kernel is applied to each channel independently, producing feature maps $C_1$, $C_2$, $C_3$. The combined output is:

$$C = C_1 + C_2 + C_3$$

RGB image convolved channel-independently producing feature maps C1, C2, C3


Pooling

Pooling achieves translation invariance. Two common types:

  • Max pooling (stride $s = 2$): takes the maximum pixel value from each $k \times k$ region.
  • Average pooling (stride $s = 2$): takes the average of all pixels in each $k \times k$ region.

The output size formula is the same as convolution with $p = 0$:

$$o = \left\lfloor \frac{i - k}{s} \right\rfloor + 1$$

Example ($4 \times 4$ input, $2 \times 2$ pool, $s = 2$):

$$\text{Max pooling} = \begin{pmatrix} 6 & 8 \\ 14 & 16 \end{pmatrix} \qquad \text{Average pooling} = \begin{pmatrix} 3.5 & 5.5 \\ 11.5 & 13.5 \end{pmatrix}$$

Max pooling and average pooling on a 4×4 input with 2×2 pool and stride 2


Convolution as Matrix Multiplication

Consider a $4 \times 4$ input $A$, $2 \times 2$ kernel $W$, and stride $s = 2$:

$$A = \begin{pmatrix} a_{11} & a_{12} & a_{13} & a_{14} \\ a_{21} & a_{22} & a_{23} & a_{24} \\ a_{31} & a_{32} & a_{33} & a_{34} \\ a_{41} & a_{42} & a_{43} & a_{44} \end{pmatrix}, \qquad W = \begin{pmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \end{pmatrix}$$

Flatten each $2 \times 2$ patch (left-to-right, top-to-bottom) into a row, and flatten the kernel into a column vector $\mathbf{w} = (w_{11}, w_{12}, w_{21}, w_{22})^\top$:

$$\begin{pmatrix} a_{11} & a_{12} & a_{21} & a_{22} \\ a_{13} & a_{14} & a_{23} & a_{24} \\ a_{31} & a_{32} & a_{41} & a_{42} \\ a_{33} & a_{34} & a_{43} & a_{44} \end{pmatrix} \begin{pmatrix} w_{11} \\ w_{12} \\ w_{21} \\ w_{22} \end{pmatrix} = \begin{pmatrix} a_{11}w_{11} + a_{12}w_{12} + a_{21}w_{21} + a_{22}w_{22} \\ a_{13}w_{11} + a_{14}w_{12} + a_{23}w_{21} + a_{24}w_{22} \\ a_{31}w_{11} + a_{32}w_{12} + a_{41}w_{21} + a_{42}w_{22} \\ a_{33}w_{11} + a_{34}w_{12} + a_{43}w_{21} + a_{44}w_{22} \end{pmatrix}$$

Reshaping this $4 \times 1$ result as $2 \times 2$ gives the feature map:

$$\text{feature map} = \begin{pmatrix} a_{11}w_{11}+a_{12}w_{12}+a_{21}w_{21}+a_{22}w_{22} & a_{13}w_{11}+a_{14}w_{12}+a_{23}w_{21}+a_{24}w_{22} \\ a_{31}w_{11}+a_{32}w_{12}+a_{41}w_{21}+a_{42}w_{22} & a_{33}w_{11}+a_{34}w_{12}+a_{43}w_{21}+a_{44}w_{22} \end{pmatrix}$$

Convolution expressed as matrix multiplication using im2col patch flattening


The Sparse Matrix $C$

Setup: $4 \times 4$ input, $2 \times 2$ kernel, stride $s = 2$.

$$o = \left\lfloor \frac{4-2}{2} \right\rfloor + 1 = 2 \quad \Rightarrow \quad \text{feature map is } 2 \times 2$$
  • Total pixels in feature map $= 2 \times 2 = 4$ → number of rows in $C$
  • Total pixels in input $= 4 \times 4 = 16$ → number of columns in $C$

So $C$ is a $4 \times 16$ sparse matrix:

$$C = \begin{pmatrix} w_1 & w_2 & 0 & 0 & w_3 & w_4 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & w_1 & w_2 & 0 & 0 & w_3 & w_4 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & w_1 & w_2 & 0 & 0 & w_3 & w_4 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & w_1 & w_2 & 0 & 0 & w_3 & w_4 \end{pmatrix}$$

$C \cdot I$ (where $I$ is the input flattened in row-major order) has dimensions $4 \times 1$, which reshapes to the output $O$ ($2 \times 2$).

  • $C$ is used in the forward pass
  • $C^\top$ is used in the backward pass

Sparse matrix C constructed from kernel weights, mapping 4×4 input to 2×2 output


Transposed Convolution

  • Convolution: downsampling
  • Transposed convolution: upscaling

Using the same $4 \times 4$ input, $2 \times 2$ kernel, stride $s = 2$, the feature map $O$ is $2 \times 2$, flattened as $O = (a,\, b,\, c,\, d)^\top$.

The transpose $C^\top$ has dimensions $16 \times 4$. The upscaled feature map is:

$$C^\top \cdot O = \begin{pmatrix} w_1 a \\ w_2 a \\ w_1 b \\ w_2 b \\ w_3 a \\ w_4 a \\ w_3 b \\ w_4 b \\ w_1 c \\ w_2 c \\ w_1 d \\ w_2 d \\ w_3 c \\ w_4 c \\ w_3 d \\ w_4 d \end{pmatrix}$$

Reshaped as $4 \times 4$:

$$\begin{pmatrix} w_1 a & w_2 a & w_1 b & w_2 b \\ w_3 a & w_4 a & w_3 b & w_4 b \\ w_1 c & w_2 c & w_1 d & w_2 d \\ w_3 c & w_4 c & w_3 d & w_4 d \end{pmatrix}$$

Transposed convolution via C-transpose multiplied by flattened feature map, reshaped to 4×4

Transposed Convolution Output Size

Let $[a\ b\ c]$ be the input. Each element is expanded by applying the kernel, and results are accumulated with overlaps determined by the stride.

  1. $k=3$, $s=1$, $p=0$: Since $s < k$, there is a $k - s = 2$ overlap between blocks.

    $$a_1\ (a_2+b_1)\ (a_3+b_2+c_1)\ (b_3+c_2)\ c_3 \qquad \text{output size} = 5$$
  2. $k=3$, $s=3$, $p=0$: Since $s = k$, there is no overlap.

    $$a_1\ a_2\ a_3\ b_1\ b_2\ b_3\ c_1\ c_2\ c_3 \qquad \text{output size} = 9$$
  3. $k=3$, $s=4$, $p=0$: Since $s > k$, zeros are inserted between blocks.

    $$a_1\ a_2\ a_3\ 0\ b_1\ b_2\ b_3\ 0\ c_1\ c_2\ c_3 \qquad \text{output size} = 11$$
  4. $k=3$, $s=3$, $p=2$: Expand to $a_1\,a_2\,a_3\,b_1\,b_2\,b_3\,c_1\,c_2\,c_3$, then remove the first and last $p=2$ elements (which interacted with padded zeros): $a_3\ b_1\ b_2\ b_3\ c_1$, output size $= 5$.

In general:

$$\boxed{o' = s(i - 1) + k - 2p}$$

Transposed convolution output size examples for different strides and padding values


Dilated Convolution

Dilation increases the effective kernel size. For a dilation rate $d$, $(d-1)$ zeros are inserted between each pair of kernel elements.

Example: let $k = [a\ b\ c]$ and $d = 3$. The dilated kernel becomes $[a\ d_1\ d_2\ b\ d_3\ d_4\ c]$, effectively increasing the size by $(k-1)(d-1)$.

New kernel size:

$$k' = k + (k-1)(d-1)$$

For input $i$, stride $s$, dilation rate $d$, padding $p$, kernel $k$:

$$\boxed{o = \left\lfloor \frac{i + 2p - k - (k-1)(d-1)}{s} \right\rfloor + 1}$$

Example: a $3 \times 3$ kernel with dilation $d = 3$ becomes a $7 \times 7$ dilated kernel (with zeros at non-kernel positions).

3×3 kernel with dilation rate 3 expanded to an effective 7×7 kernel