**THIS UNIT IS STILL UNDER CONSTRUCTION**

We begin with some background about convolutions and then move onto neural networks.

A **convolution** is an operation on two vectors, matrices, or tensors, that returns a third vector, matrix, or tensor.

LTI systems are concepts from control theory and signal processing that have influenced machine learning an led to the development of convolutional neural networks.
To get a feel for the the importance of convolutions lets first consider **linear time invariant** (LTI) systems where we focus on scalar valued, discrete time systems (e.g. \(t = \ldots,-2,-1,0,1,2,\ldots\)).

An LTI system, denoted \({\cal O}\Big(\cdot\Big)\), maps an input sequence to an output sequence (this is called a system), and has the following two properties:

**Linearity**: For input signals \(x_1(t)\) and \(x_2(t)\) and scalars \(\alpha_1\) and \(\alpha_2\), \[ {\cal O}\Big(\alpha_1 x_1(\cdot) + \alpha_2 x_2(\cdot) \Big)(t) = \alpha_1 {\cal O}\Big( x_1(\cdot)\Big)(t) + \alpha_2 {\cal O}\Big( x_2(\cdot) \Big)(t). \]**Time invariance**: Assume the input signal \(x(\cdot)\) has output \(y(\cdot) = {\cal O}\Big(x(\cdot) \Big)\). Take now the time shifted input \(\tilde{x}(t) = x(t-\tau)\) then output of \({\cal O}\Big(\tilde{x}(t)\Big)\), denoted \(\tilde{y}(\cdot)\), is the time shifted, namely \(\tilde{y}(t) = y(t - \tau)\).

It turns out that many operations on signals are well described by LTI systems. Further, it turns out that the operation of any LTI system on an input signal is equivalent to a convolution of the input signal with a special signal associated with the system, denoted the **impulse response**. We describe both convolutions and the impulse response shortly.

Observe that \(\delta(t-\tau)\) is the time shifted version of \(\delta(t)\) and is \(1\) at \(t = \tau\) and \(0\) everywhere else. Consider any signal \(x(t)\). It can be represented as, \[ x(t) = \sum_{\tau = -\infty}^\infty x(\tau)\delta(t-\tau), \] where \(\delta(t)\) is the sequence such that \(\delta(0) = 1\) and for \(t \neq 0\), \(\delta(t) = 0\).

We can now use this representation together with the LTI properties:
Here we used the linearity property since \(x(\tau)\) are constants with respect to the time variable \(t\).
\[
y(t) = {\cal O}\Big(x(t) \Big)
= {\cal O}\Big(\sum_{\tau = -\infty}^\infty x(\tau)\delta(t-\tau) \Big)
= \sum_{\tau = -\infty}^\infty x(\tau) {\cal O} \Big( \delta(t-\tau) \Big).
\]
Now denote the output of \(\delta(t)\) as \(h(t)\). This is called the impulse response. Then we due to time invariance \({\cal O}\Big( \delta(t-\tau) \Big) = h(t-\tau)\). We thus arrive at,
\[
y(t) = \sum_{\tau = -\infty}^\infty x(\tau) h(t-\tau).
\]
Analogous results exist for continuous time systems, where the impulse response requires considering the generealized function, \(\delta(t)\), called the Dirac delta function. In that case the convolution \(x \star h\) is determined via, \[
\int_{\tau = -\infty}^\infty x(\tau)h(t-\tau) \, d\tau.
\]
This is called the **convolution** of \(x(\cdot)\) and \(h(\cdot)\) and is denoted \(y = x \star h\). Hence we see that convolutions arise naturally when considering linear time invariant systems because the operation a system on any signal is simply given via the convlution of the signal with the system’s impulse response.

This video illustrates the operation of a (discrete time) convolution operation.

Convolutions also appear naturally in probability when considering the distribution of a random variable that is a sum of two independent random variables. For example consider the discrete valued independent random variables \(X\) and \(Y\) and define \(Z = X+Y\). For example, say that \(X\) is a payoff yielding \(1\), \(2\), or \(3\) with probabilities \(1/4\), \(1/2\), and \(1/4\) respectively. Further say that \(Y\) is the payoff based on a fair die throw yielding payoffs \(1,2,3,4,5\) or \(6\), each with probability \(1/6\). What is then the probability distribution of the total payoff \(Z\)?

\[\begin{align*} {\mathbb P}\big(Z = t \big) &= {\mathbb P}\big(X+Y = t)\\ &= \sum_{\tau=1}^6 {\mathbb P}\big(X + Y = t ~|~ Y=\tau){\mathbb P}\big(Y=\tau \big)\\ &= \sum_{\tau=1}^6 {\mathbb P}\big(X + y = t){\mathbb P}\big(Y=\tau \big)\\ &= \sum_{\tau=1}^6 {\mathbb P}\big(Y=\tau \big){\mathbb P} \big(X = t - \tau\big) \end{align*}\]

Note that a probability mass function is defined for all integer values and is zero for values outside of the support of the distribution. This allows to extend the summation outside of \(\tau \in \{1,\ldots,6\}\). Denote now the probability mass functions of \(X\), \(Y\), and \(Z\) (sometimes called probability density functions even in this discrete case) via \(p_X(\cdot)\), \(p_Y(\cdot)\), and \(p_Z(\cdot)\) respectivly. Then we see that, \[ p_Z(t) = \sum_{\tau=1}^6 p_Y(\tau) p_X(t-\tau) = \sum_{\tau=-\infty}^\infty p_Y(\tau) p_X(t-\tau) = (p_Y \star p_X)(t). \]

Note that \(p_Z(t)\) is non-zero for \(t \in \{2,\ldots,9\}\). For example, \[ p_Z(2) = \sum_{\tau=1}^6 \frac{1}{6} p_X(2-\tau) = \frac{1}{6} p_X(2-1) = \frac{1}{6} \cdot \frac{1}{4} = \frac{1}{24}. \]

or, \[ p_Z(4) = \sum_{\tau=1}^6 \frac{1}{6} p_X(4-\tau) = \frac{1}{6} \big(p_X(4-1) + p_X(4-2) + p_X(4-3) ) = \frac{1}{6} \]

or \[ p_Z(9) = \sum_{\tau=1}^6 \frac{1}{6} p_X(9-\tau) = \frac{1}{6} p_X(9-6) = \frac{1}{6} \cdot \frac{1}{4} = \frac{1}{24}. \] Exercise: Work out the distribution of \(Z\) for all other values of \(t \in \{2,\ldots,9\}\). We thus see that the convolution operation appears naturally in the context of probability.

The convolution also arises naturally when multiplying polynomials. Consider for example the polynomial, \[ x(t) = x_0 + x_1 t + x_2 t^2, \] and the polynomial, \[ y(t) = y_0 + y_1 t+ y_2 t^2 + \ldots + y_5 t^5. \] Now set \(z(t) = x(t)\times y(t)\). This is a polynomial of degree \(7 = 5+2\) with coefficients \(z_0,\ldots, z_7\) as follows: \[\begin{align*} z_0 &= x_0 y_0 \\ z_1 &= x_0 y_1 + x_1 y_0 \\ z_2 &= x_0 y_2 + x_1 y_1 + x_2 y_0 \\ z_3 &= x_0 y_3 + x_1 y_2 + x_2 y_1\\ z_4 &= x_0 y_4 + x_1 y_3 + x_2 y_2\\ z_5 &= x_0 y_5 + x_1 y_4 + x_2 y_3\\ z_6 &= x_1 y_5 + x_2 y_4\\ z_7 &= x_2 y_5 \\ \end{align*}\]

If we were to set \(x_\tau = 0\) for \(\tau \notin \{0,1,2\}\) and similarly \(y_\tau = 0\) for \(\tau \notin \{0,1,2,3,4,5\}\) then we could denote, \[ z_t = \sum_{\tau = -\infty}^\infty x_\tau y_{t-\tau} = (x \star y)_t. \] However it is also useful to consider the finite sums and denote, \[ z_t = \sum_{i + j = t} x_i y_j, \] where the is over \((i,j)\) pairs with \(i+j=t\) and further requring \(i \in \{0,1,2\}\) and \(j \in \{0,1,2,3,4,5\}\).

In this context we can also view \(z\) as a vector of length \(8\), \(x\) as a vector of length \(3\) and \(y\) as a vector of length \(6\). We can then create a \(8 \times 6\) matrix \(T(x)\) that encodes the values of \(x\) such that, \(z = T(x) y\). More specifically, \[ \left[ \begin{matrix} z_0 \\ z_1 \\ z_2 \\ z_3 \\ z_4 \\ z_5 \\ z_6 \\ z_7 \\ \end{matrix} \right] = \underbrace{\left[ \begin{matrix} x_0 & 0 & 0 & 0 & 0 & 0 \\ x_1 & x_0 & 0 & 0 & 0 & 0 \\ x_2 & x_1 & x_0 & 0 & 0 & 0 \\ 0 & x_2 & x_1 & x_0 & 0 & 0 \\ 0 & 0 & x_2 & x_1 & x_0 & 0 \\ 0 & 0 & 0 & x_2 & x_1 & x_0 \\ 0 & 0 & 0 & 0 & x_2 & x_1 \\ 0 & 0 & 0 & 0 & 0 & x_2 \\ \end{matrix} \right]}_{T(x)} \left[ \begin{matrix} y_0 \\ y_1 \\ y_2 \\ y_3 \\ y_4 \\ y_5 \end{matrix} \right] \]

In general a Toeplitz matrix is a matrix with constant entries on the diagonals.
This shows that the convolution of \(y\) with \(x\) is a linear transformation and the matrix of that linear transformation is \(T(x)\). This is called a **Toeplitz matrix**.

Convolutions are commutative operations. That is, \(x \star y = y \star x\). Hence we can also represent the output \(z\) as \(z = T(y) x\) where \(T(y)\) is a \(8 \times 3\) matrix in this case,

\[ \left[ \begin{matrix} z_0 \\ z_1 \\ z_2 \\ z_3 \\ z_4 \\ z_5 \\ z_6 \\ z_7 \\ \end{matrix} \right] = \underbrace{\left[ \begin{matrix} y_0 & 0 & 0 \\ y_1 & y_0 & 0 \\ y_2 & y_1 & y_0 \\ y_3 & y_2 & y_1 \\ y_4 & y_3 & y_2 \\ y_5 & y_4 & y_3 \\ 0 & y_5 & y_4 \\ 0 & 0 & y_5 \\ \end{matrix} \right]}_{T(y)} \left[ \begin{matrix} x_0 \\ x_1 \\ x_2 \end{matrix} \right] \]

In signal processing, filters are specified by means of LTI systems or convolutions and are often hand-crafted. One may specify filters for smoothing, edge detection, frequency reshaping, and similar operations. However with neural networks the idea is to automatically learn the filters.

As an example consider a neural network operating on sound sequence data. Say the input \(x\) is of size \(d=10^6\) and is based on a million sound samples (at \(8,000\) samples per second this translates to 125 seconds or just over 2 minutes of sound recording). Assume we wish to classify if the sound recording is human conversation (positive), or other sounds (negative).

A possible deep neural network architecture may involve several layers and result in a network such as, \[ \hat{f}(x)= \text{sign}(2\sigma_{\text{sigmoid}}(W_3\sigma_{\text{relu}}(W_2\sigma_{\text{relu}}(W_1 x + b_1)+b_2)+b_3)-1). \]

Such a network passes the input x through a first layer with relu activation, a second layer with relu activation, and finally a third layer with sigmoid activation returning eventuall \(+1\) for a positive (say human conversation) and \(-1\) for a negative sample. Here \(W_3\) is a matrix with a single row and \(b_3\) is a bias of length \(1\).

The best dimensions of the inner layers (and associated weight matrices and bias vectors) may depend on the nature of the problem and data at hand, however say we were to construct such a network and end up with the following:

- \(W_1\) has dimension \(10^4 \times 10^6\) and \(b_1\) has dimension \(10^4\). This means the first hidden layer has \(10,000\) neurons.
- \(W_2\) has dimension \(10^3 \times 10^4\) and \(b_2\) has dimension \(10^3\). This means the second hidden layer has \(1,000\) neurons.
- \(W_3\) has dimension \(1 \times 10^3\) and \(b_3\) has dimension \(1\). This means the output layer has a single neuron.

The total number of parameters of this network is then \[ 10^4\cdot 10^6 + 10^4 + 10^3\cdot10^4+ 10^3 + 1\cdot 10^3 + 1 \approx 10^{10} + 10^7 \approx 10~\text{billion}. \]

In today’s architecture one can train such neural networks, however this is a huge number of parameters for the task at hand. In general, it is a very wasteful and inefficient use of dense matrices as parameters.

The alternative is to use convolutions layers where \(W_1\) is replaced by a Toeplitz matrix, say \(T_1(h_1)\), and \(W_2\) is replaced by a Toeplitz matrix \(T_2(h_2)\). Here \(h_1\) and \(h_2\) represent filters, similar to the impulse response presented above. In this case we can still have \(T_1(h_1)\) with the same dimension as \(W_1\) and similarly \(T_2(h_2)\) can be set to have the same dimension as \(W_2\). However the number of parameters is greatly reduced as instead of \(W_1\) and \(W_2\) we only need the the filters \(h_1\) and \(h_2\). Each of these filters can be of length \(10\), or \(50\), or \(1,000\), still it is a much lower number of parameters.

The main principle is that of locality of information. Sound samples of the input in adjacent spots are much more likely to affect each other than those that are very far away. While slightly simplistic, this sound example demonstrates this.

Convolutional neural networks present an additional key idea: **multiple channels**. The idea is that in each layer, we don’t keep a single representation of the transformed input (voice in this case) via neurons but rather keep a collection of representations, each resulting from the output of different filters.

For example, in the input we have \(x\). We then operate on \(x\) via the filters \(h_1^1\), \(h_1^2\), and \(h_1^3\). Here the subscript \(1\) indicates the first layer whereas the superscripts \(1\), \(2\), and \(3\) indicate the channels. Then in the first hidden layer we have, \[ z_1^1 = \text{Hidden layer 1 channel 1} = \sigma_{\text{relu}}\big(T(h_1^1)x + b_1^1 \big) \]

\[ z_1^2 = \text{Hidden layer 1 channel 2} = \sigma_{\text{relu}}\big(T(h_1^2)x + b_1^2 \big) \]

\[ z_1^3 = \text{Hidden layer 1 channel 3} = \sigma_{\text{relu}}\big(T(h_1^3)x + b_1^3 \big) \]

This allows to use three different types of filters in the first layer, each passing through its own non-linearity. This intermediate representation (in layer 1) via the three channels \(z_1^1, z_1^2\) and \(z_1^3\) can then be further processed. When we do so, we apply convolutions filters that operate on all channels jointly. That is, the channels of layer \(2\) (of which we may have \(3\) again, or some other number) are each functions of the three channels of layer \(1\). We can represent this via, \[ z_2^k = \text{Hidden layer 2 channel~}k = \sigma_{\text{relu}}\big(\tilde{T}(H_2^k) \left[ \begin{matrix} z_1^1\\ z_1^2\\z_1^3\\ \end{matrix} \right]+ b_2^k \big), \] Later, when we move the the common CNN used for images, the filter will actually be a three dimensional filter (or volume filter). where \(k\) is the index of the channel in hidden layer 2. In this expression note that we use the matrix \(\tilde{T}(H_2^k)\) to represent the linear operation based on the two dimensional filter \(H_2^k\). We describe such two dimensional convolutions below, and also generalize to three dimensions.

With each layer \(i\) we then continue with \[ z_{i}^k = \text{Hidden layer}~i~\text{channel~}k = \sigma_{\text{relu}}\big(\tilde{T}(H_i^k) \left[ \begin{matrix} z_{i-1}^1\\ z_{i-1}^2 \\ \vdots\\z_{i-1}^{K_{i-1}}\\ \end{matrix} \right]+ b_i^k \big), \] for \(k \in \{1,\ldots,K_i\}\) where \(K_{i}\) denotes the number of channels in layer \(i\).

As motivated by the discussion above, even when we consider voice data, two dimensional convolutions arise due to channels. In the case of image data, two dimensional convolutions pop up naturally from the start, and when considering channels we even need volume (3 dimensional) convolutions. We now explain these concepts.

As an illustration consider a monocrhome image as a \(m \times n\) matrix \(x\) with \(x_{ij}\) the entry of the matrix for \(i=1,\ldots,m\) and \(j=1,\ldots,n\). Consider now a convolution kernel \(H\) which is taken as an \(\ell \times \ell\) matrix (with \(\ell\) typically a small integer such as \(3\), \(4\), etc..). The (two dimensional) convolution of the image \(x\) with the kernel \(H\) is another image matrix \(y\) where each \(y_{ij}\) is obtained via, \[ y_{ij} = \sum_{u=1}^\ell \sum_{s=1}^\ell H_{us}x_{a(i,u),b(j,s)}, \] where we can initially just think of \(a(i,u)\) and \(b(j,s)\) as \(i+u\) and \(j+s\), however with certain variants we can modify these index functions. We also treat \(x_{ij}\) as \(0\) for \(i\) and \(j\) that are out of range.

Hence the output of such a two dimensional convolution at index \(i\) and \(j\) “covers” a box in the input image \(x\) at the neighborhood of \((i,j)\) with the kernel \(H\). It then carries out the vectorized inner product of \(H\) and the covered area. See this illustrative animation of multiple such convolutions as used in a convolutions neural network:

The specific dimensions of the output image \(y\) and the specific way of determining the neighborhood of \((i,j)\) via the functions \(a(i)\) and \(a(j)\) can vary. There are two common parameters which play a role, **stride** and **padding**. The stride parameter, \(S\), indicates the jump sizes with which the convolution kernel needs to slide over the image. A default is \(S=1\) meaning that the kernel moves one pixel at a time, however \(S=2\) or slightly higher is also common. The padding parameter, \(P\) indicates how much zero padding should be placed around the input image.

To see these in action lets actually return to a one dimensional example with a very short audio signal \(x\) of length \(9\). Assume that \[ x = [1~~2~~3~~1~~2~~3~~1~~2~~3]^T. \] Assume we have a convolution kernel \(h = \frac{1}{3}[1~1~1]^T\). This is a moving average convolution. In the basic form, if we again think of \(x\) and \(h\) as coefficients of polynomials then we have an \(8\)th degree polynomial multiplied by a \(2\)nd degree polynomial so we get a polynomial of degree \(10\) (having \(11\) coefficients). Notice that the the length of \(y = x \star h\) is \(11 = 9+3-1\). In this case we implicitly had the maximal padding possible \(P=2\), which is one less than the size of the convolution kernel. We also implictly had a stride of \(S=1\). We can think of this convolution as sliding the filter as follows for creating \(y(t)\) for \(t=1,\ldots,11\): \[ y(1) = \left\{ \begin{array}{ccccccccccc} \frac{1}{3} & \frac{1}{3} & \frac{1}{3}\\ \times & \times & \times \\ 0 & 0 & 1 & 2 & 3 & 1 & 2 & 3 & 1 & 2 & 3 & 0 & 0 \end{array} \right\} = \frac{1}{3} \]

Note that the two zeros two the left and two the right of the sequence are the padding.

\[ y(2) = \left\{ \begin{array}{ccccccccccc} &\frac{1}{3} & \frac{1}{3} & \frac{1}{3}\\ &\times & \times & \times \\ 0 & 0 & 1 & 2 & 3 & 1 & 2 & 3 & 1 & 2 & 3 & 0 & 0 \end{array} \right\} = 1 \]

\[ y(3) = \left\{ \begin{array}{ccccccccccc} &&\frac{1}{3} & \frac{1}{3} & \frac{1}{3}\\ &&\times & \times & \times \\ 0 & 0 & 1 & 2 & 3 & 1 & 2 & 3 & 1 & 2 & 3 & 0 & 0 \end{array} \right\} = 2 \]

\[ y(4) = \left\{ \begin{array}{ccccccccccc} &&&\frac{1}{3} & \frac{1}{3} & \frac{1}{3}\\ &&&\times & \times & \times \\ 0 & 0 & 1 & 2 & 3 & 1 & 2 & 3 & 1 & 2 & 3 & 0 & 0 \end{array} \right\} = 2 \]

\[ \begin{array}\vdots \\\vdots \\ \vdots \end{array} \]

\[ y(9) = \left\{ \begin{array}{ccccccccccc} &&&&&&&&\frac{1}{3} & \frac{1}{3} & \frac{1}{3}\\ &&&&&&&&\times & \times & \times \\ 0 & 0 & 1 & 2 & 3 & 1 & 2 & 3 & 1 & 2 & 3 & 0 & 0 \end{array} \right\} = 2 \]

Remember that a convolution actually needs to reverse the kernel. However in this example our kernel is the same even when reversed. Further, in neural networks we typically don’t worry about reversing the kernel.

\[ y(10) = \left\{ \begin{array}{ccccccccccc} &&&&&&&&&\frac{1}{3} & \frac{1}{3} & \frac{1}{3}\\ &&&&&&&&&\times & \times & \times \\ 0 & 0 & 1 & 2 & 3 & 1 & 2 & 3 & 1 & 2 & 3 & 0 & 0 \end{array} \right\} = \frac{5}{3} \]

\[ y(11) = \left\{ \begin{array}{ccccccccccc} &&&&&&&&&&\frac{1}{3} & \frac{1}{3} & \frac{1}{3}\\ &&&&&&&&&&\times & \times & \times \\ 0 & 0 & 1 & 2 & 3 & 1 & 2 & 3 & 1 & 2 & 3 & 0 & 0 \end{array} \right\} = 1 \] Now in case we were to avoid using any padding (\(P=0\)) the result \(y\) would have been of length \(9-3+1 = 8\), namely,

\[ y(1) = \left\{ \begin{array}{ccccccccc} \frac{1}{3} & \frac{1}{3} & \frac{1}{3}\\ \times & \times & \times \\ 1 & 2 & 3 & 1 & 2 & 3 & 1 & 2 & 3 \end{array} \right\} = 2 \]

\[ y(2) = \left\{ \begin{array}{ccccccccc} &\frac{1}{3} & \frac{1}{3} & \frac{1}{3}\\ &\times & \times & \times \\ 1 & 2 & 3 & 1 & 2 & 3 & 1 & 2 & 3 \end{array} \right\} = 2 \]

\[ \begin{array}\vdots \\\vdots \\ \vdots \end{array} \]

\[ y(7) = \left\{ \begin{array}{ccccccccc} &&&&&&\frac{1}{3} & \frac{1}{3} & \frac{1}{3}\\ &&&&&&\times & \times & \times \\ 1 & 2 & 3 & 1 & 2 & 3 & 1 & 2 & 3 \end{array} \right\} = 2 \] The above two cases were for the case of a stride \(S=1\) meaning that we slide the convolution kernel by \(1\) step each time. However, we somtimes wish to get a (much) smaller output signal (and or run much faster). In this case we can slide the kernel by a bigger stride. For example lets consider now \(S=2\) and also demonstrate \(P=1\) in this case.

\[ y(1) = \left\{ \begin{array}{ccccccccccc} \frac{1}{3} & \frac{1}{3} & \frac{1}{3}\\ \times & \times & \times \\ 0& 1 & 2 & 3 & 1 & 2 & 3 & 1 & 2 & 3&0 \end{array} \right\} = 1 \]

\[ y(2) = \left\{ \begin{array}{ccccccccccc} &&\frac{1}{3} & \frac{1}{3} & \frac{1}{3}\\ &&\times & \times & \times \\ 0& 1 & 2 & 3 & 1 & 2 & 3 & 1 & 2 & 3&0 \end{array} \right\} = 2 \]

\[ y(3) = \left\{ \begin{array}{ccccccccccc} &&&&\frac{1}{3} & \frac{1}{3} & \frac{1}{3}\\ &&&&\times & \times & \times \\ 0& 1 & 2 & 3 & 1 & 2 & 3 & 1 & 2 & 3&0 \end{array} \right\} = 2 \]

\[ y(4) = \left\{ \begin{array}{ccccccccccc} &&&&&&\frac{1}{3} & \frac{1}{3} & \frac{1}{3}\\ &&&&&&\times & \times & \times \\ 0& 1 & 2 & 3 & 1 & 2 & 3 & 1 & 2 & 3&0 \end{array} \right\} = 2 \]

\[ y(5) = \left\{ \begin{array}{ccccccccccc} &&&&&&&&\frac{1}{3} & \frac{1}{3} & \frac{1}{3}\\ &&&&&&&&\times & \times & \times \\ 0& 1 & 2 & 3 & 1 & 2 & 3 & 1 & 2 & 3&0 \end{array} \right\} = \frac{5}{3} \] In this case the output is of length \(5\). In general the length of the output is, \[\begin{equation} \label{eq:convSize} \frac{n_x + 2P +n_h}{S}+1, \end{equation}\] where \(n_x\) is the length of the input signal and \(n_h\) is the length of the filter. For the first case above (\(P=2\) and \(S=1\)) we have: \[ \frac{9+2\cdot 2 - 3}{1}+1 = 11. \] For the second case, \[ \frac{9+2\cdot 0 - 3}{1}+1 = 7. \] And for the third case, \[ \frac{9+2\cdot 1 - 3}{2}+1 = 5. \]

Now in the case of a two dimensional convolution, this formula still holds however \(n_x\) is replaced by either the horizontal and vertical dimension of the image and in cases where the filter is not square, similarly with \(n_h\).

In addition to stride and padding there is a third element which is sometimes introduced: **dilation**. The idea of dilation is to “thin-out” the convolution by adding inserting holes within the kernel. The dilation parameter is \(D=1\) by default and then there is no dilation.

So far we have spoken about convolutions but with the excpetion of the averaging kernel used above haven’t shown useful kernels. This video presents a variety of examples:

The point is that we can ``hand craft’’ kernels to carry out a variety of tasks such as smoothing, sharpening, edge detection, and more. However, with CNN’s the network learns automatically about the best kernels for the task at hand.

- Convolutions on volumes
- Convolutions on all channels

- Usage of multiple convolutions layers with multiple channels
- One or more dense layers towards the output

- MNIST classification

We now survey four famous architectures that were both reserach landmarks and are used today. A complete lecture about these architectures is here:

Our purpuse is simply to highlight the key features.

Page built: 2021-01-17