Artificial neural networks form a class of functions used for tasks such as image classification. This project considers only the supervised learning scenario, wherein a neural network is trained on a set $\mathbf{X}_{\text{train}} = \{(\mathbf{x},y) \mid \mathbf{x} \in \mathbb{R}^k,\ y \in \mathbb{R}\}$ of labelled training data and we test the neural network's classification accuracy using a distinct test dataset $\mathbf{X}_{\text{test}}$ of the same form as $\mathbf{X}_{\text{train}}$, but containing fewer entries.

The process of training produces a neural network, which is a function $\hat{y} = f_{\Theta}(\mathbf{x})$, parameterized by a set $\Theta = \{(\mathbf{\omega}, b) \mid \mathbf{\omega} \in \mathbb{R}^k,\ b \in \mathbb{R} \}$ of weights and biases. The trained neural network $f_{\Theta}$ maps $\mathbf{x} \in \mathbb{R}^k$ to a prediction $\hat{y}$, and we can test its performance by evaluating the differences between $\hat{y}$ and the $y \in \mathbf{X}_{\text{test}}$.

Before we get to specific details about neural networks, it is elucidating to discuss one of the oldest classification algorithms, \textit{linear regression}. Like a neural network, a linear regression model is a function which maps an $\mathbf{x} \in \mathbb{R}^k$ to a prediction $\hat{y}$. The model is parameterized by a set of learnable weights $\mathbf{w} \in \mathbb{R}^k$ and an optional bias term $b \in \mathbb{R}$. Once we have trained the model, usually by minimizing the mean squared error, we can use it for predicting $y$ given any $\mathbf{x} \in \mathbf{X}$. The formula for linear regression is as follows

$$ \begin{align} \hat{y} = \sum_{k=1}^m x_kw_k + b. \end{align} $$The weight vector $\mathbf{w}$ of a linear regression model can be thought of as an expression for how important each entry of $\mathbf{x}$ is for a specific output $y$. In a sense, when we train a classifier it is like we are trying to figure out a recipe (i.e. the correct proportions of ingredients given a list of ingredients and examples of the desired final product) through trial and error.

At the end of training, the best set of proportions is stored in $\mathbf{w}$, which tells us in what proportion to combine ingredients so that we obtain a reasonably good approximation of the desired result.

In the following, we will see how linear regression relates to neural networks, which I will define now. An \textit{artificial neural network} is a function defined by a directed graph (the computations flow in a specified direction), whose connectivity implements function composition. A weighted summations occurs at each node of the graph, which are themselves wrapped in activation functions. The weighted summations of inputs inside each node are not unlike a linear regression. So in a sense, an artificial neural network describes a way of composing linear regression models.

Digit classification using the MNIST dataset of handwritten digits is one of oldest image classification benchmarks. The goal of the task is to correctly classify an image of a number. Since handwritten digits are variable, any classifier will need to see many examples of each digit in order to learn what features are essential to each digit class. The more variable the concept, the more data we usually need in order to learn its defining properties.

When using a classification algorithm other than a neural network, people usually have to hand-engineer kernels or ``feature maps''. This requires domain experts, which can be costly and time consuming. Deep neural networks, on the other hand, learn kernel maps from the data when we train them. This has the added advantage of not introducing additional bias, as hand-engineered feature maps may do.

It's like a kernel represents a class and we need to learn how to describe the data using objects from these classes, and objects from these classes are fit to the data using weights.

Neural network methods are currently the leaders in digit classification, but support vector machines (SVMs) are also pretty good. According to Wikipedia, the current state-of-the-art in digit classification accuracy is 99.79% with a convolutional neural network and for reference, the best SVM method achieves accuracy of 99.44%. The MNIST dataset contains 60,000 training examples (6,000 for each number between 0 and 9) and 10,000 test examples (1,000 for each number). The reason for having separate training and testing datasets will be made clear later on.

Some traditional methods for classification start by flattening the $28 \times 28$ pixel images into 784 dimensional vectors, since the method may require the input to be a vector; this enables a weighted sum to be computed. The problem with this is that we lose contextual information, we no longer know the coordinates of each pixel and we don't know much about the neighboring pixel values. Contextual information is important for identifying shapes, which gives a bit more meaning to pixel values, e.g. a dark pixel adjacent to light pixels has a different meaning to us than a dark pixel surrounded by other dark pixels. The way we retain contextual information about the pixel values is by using a convolutional kernel, which operates on the original (unflattened) $28 \times 28$ pixel image.

Convolutions are at the heart of many image transformations, for instance the GNU Image Manipulation Program has a generic convolution filter % where the user can enter desired values for a convolutional kernel to be applied % to the image in order to achieve edge detection, smoothing, embossing, etc...

In our context, the convolution operation is a way of adding neighboring pixel values (the number determined by the dimensions of the kernel) together, weighted by values stored in the convolution kernel. For example, given a matrix $M$ of pixel values and a kernel $K$ of weights, we can depict the discrete convolution operation as follows.

\begin{align} M & = \begin{bmatrix} m_{0,0} & m_{0,1} & \dots & m_{0,n} \\ m_{1,0} & m_{1,1} & \dots & m_{1,n} \\ \vdots & \vdots & \ddots & \vdots \\ m_{m,0} & m_{m,1} & \dots & m_{m,n} \end{bmatrix} \end{align} \begin{align} K & = \begin{bmatrix} k_{0,0} & k_{0,1} \\ k_{1,0} & k_{1,1} \end{bmatrix} \end{align} \begin{align} \left( M * K \right) [0,0]& = m_{0,0} \times k_{0,0} + m_{0,1} \times k_{0,1} + m_{1,0} \times k_{1,0} + m_{1,1} \times k_{1,1}. \end{align}In a convolutional neural network, each neuron computes the activation function of a convolution of a small patch of pixels with weights stored in a kernel. In (\ref{eq:convolution_3}), we convolved the matrix $M$ with the kernel $k$ at position $(0,0)$ of $M$.

One by one, each neuron of a given layer computes a weighted sum (more precisely, a convolution) for each coordinate $(x,y)$ of $M$. All neurons in the same layer use the same kernel, which means the neurons of each layer share the same weights. Weight sharing makes the neural network's ability to detect features invariant under translation (i.e. since the kernel passes over the entire input space it is able to identify features anywhere in the input space).

If we use padding, we can control the size of the output matrix, we can even obtain a matrix of the same dimension as the original as illustrated below. If we don't use padding we will produce a new matrix of smaller dimension. In either case, the output matrix's entries are weighted sums of the entries of $M$.

Design choices with respect to hyperparameters include the dimensions of the kernel $K$, the size of the stride (the number of coordinates to skip over each time we convolve the kernel with the image), and whether or not to use padding. Kernel size and stride also affect the dimension of and the number of channels returned by the convolution layer.

We have a set of weights, one for each convolutional layer and through training, some sets of weights become more sensitive to specific features. A layer and its kernel may become experts at identifying lines, curves etc.

Pooling is another common operation performed in convolutional neural nets. It is similar to convolution in the sense that both are methods of sub-sampling. Max pooling in particular is used to pick the max out of a region of some specified area. Pooling further decreases the number of degrees of freedom which helps prevent over-fitting. Most modern convolutional neural networks make use of the convolution stride to achieve a similar sub-sampling effect.

Because the output of layer $k$ is the input of layer $k+1$, it is like the lower layers learn simple, low level concepts and the higher layers learn higher order concepts, all of which are weighted sums of weighted sums of some input data.

The way I implemented classification is as follows. I first encoded the labels of the MNIST dataset in one-hot encoding. The result is each image in the training and test sets are accompanied by a vector of length 10. Each of the 10 indices corresponds to a unique number. If we let the zeroth entry correspond with the number zero and the first entry correspond with the number 1, we obtain the following encoding:

\begin{align} \large 0 = \begin{bmatrix} 1\\0\\0\\0\\0\\0\\0\\0\\0\\0\\ \end{bmatrix}, \quad 1 = \begin{bmatrix} 0\\1\\0\\0\\0\\0\\0\\0\\0\\0\\ \end{bmatrix}, \quad 2 = \begin{bmatrix} 0\\0\\1\\0\\0\\0\\0\\0\\0\\0\\ \end{bmatrix}, \quad \cdots \quad 9 = \begin{bmatrix} 0\\0\\0\\0\\0\\0\\0\\0\\0\\1\\ \end{bmatrix}. \end{align}The final layer has only ten neurons, one for each of the classes of concepts we wanted to learn; in our case they correspond with the digits 0 to 9. We can think of the final layer as any other layer and each neuron will respond to input from the layers before it. For example, the neuron corresponding to the number 9 will output a number close to 1 when the input to the neuron is of the right combination of concepts from the layers before it.

Once the neural net has returned its predicted label, we will want to compare it with the true label from our test dataset. We devise a notion of distance to measure how far away our prediction was from what it was supposed to be. We call this the loss function and we want to minimize this through iterative learning. Each iteration is called an epoch, and in our case, since we are working with a mini batch method, the epochs comprises many batch operations. Batch size is one of the hyperparameters.

Once we have obtained a loss, we use the back-propagation algorithm to send information from the loss backward through the network in order to compute the gradient, which is the derivative of the loss with respect to the weights of the graph. Since we can think of a neural network as a composite of many functions, when we compute the gradient, we compute it for a composition of functions of multiple variables. We can use the chain rule from calculus to derive the gradients for the nodes in each layer and once we have those, we can perform gradient descent, which pushes the weights down the error curve at a velocity set by the a learning rate, usually denoted by $\epsilon$.

The learning rate is a number, typically in $(0,1) \subset \mathbb{R}$.

\begin{align} \mathbf{\omega'} = \mathbf{\omega} - \epsilon \nabla_{\mathbf{\omega}}f(\mathbf{\omega}) \end{align}

If we choose $\epsilon$ too large, then gradient descent may not converge and if we choose $\epsilon$ too small, then gradient descent may converge too slowly or not at all.

In a sense, training a neural net involves increasing the number of degrees of freedom to maximize representation capacity. At the same time, we need to combat the tendency to over fit the training data through regularization, which effectively reduces the number of degrees freedom. Convolution and pooling greatly decrease the number of degrees of freedom while retaining some of the information about the pixels.

Regularization techniques help the graph learn the correct probability distribution for the data by forcing it to forget less common observations. In other words, when the dataset contains many instances of a feature, then we know a lot about it, and we can be more confident in the parameters learned from observing those features. But if we have a feature that rarely occurs in the dataset, then regularization will prevent the weights learned from that feature from influencing the classifier too strongly. Dropout decreases the number of degrees of freedom by simply forgetting a subset of the neurons' outputs. Concepts formed from features commonly observed by the network will be resistant to dropout.

There are many optimization algorithms for training a neural net, including various forms of gradient descent, but the focus of this research project is on determining the optimal values for the hyperparameters which define the graph's structure and control behavior.

An ad hoc approach to optimizing hyperparameters is to perform multiple experiments and learn from trial and error. The practitioner may gain insight and intuition which may be helpful for the learning task at hand. The problem with this approach is in the ability to transfer this insight to another practitioner.

Methodical approaches include grid search (this quickly becomes computationally infeasible if we want to optimize more than 2 or 3 hyperparameters) and random sampling from a predefined hyperparameter search space as in \cite{bb_2012}. Some background knowledge is helpful when defining the search space to improve efficiency. The experiments I performed used \textit{``hyperopt''} which is a hyperparameter optimization software package authored primarily by James Bergstra and which is the focus of \cite{bergstra_making_2012}.