Hyperparameter Optimization and Experimentation for Convolutional Neural Nets

This research project was inspired by my own experience building a deep learning image classifier in the winter semester of 2017. My goals for this project were to study hyperparameter optimization of convolutional neural networks in the context of digit classification using the MNIST dataset. I will begin with an introduction to artificial neural networks and hyperparameter optimization, the final section has been dedicated to experiments and their results.
\subsection{Artificial Neural Networks} Artificial neural networks form a class of functions which we use in machine learning for tasks such as image classification and machine translation. This project considers only the supervised learning scenario, wherein a neural network is trained on a set $\mathbf{X}_{\text{train}} = \{(\mathbf{x},y) \mid \mathbf{x} \in \mathbb{R}^k,\ y \in \mathbb{R}\}$ of labelled training data and we test the neural network's classification accuracy using a distinct test dataset $\mathbf{X}_{\text{test}}$ of the same form as $\mathbf{X}_{\text{train}}$, but containing fewer entries. The process of training produces a neural network, which is a function $\hat{y} = f_{\Theta}(\mathbf{x})$, parameterized by a set $\Theta = \{(\mathbf{\omega}, b) \mid \mathbf{\omega} \in \mathbb{R}^k,\ b \in \mathbb{R} \}$ of weights and biases. The trained neural network $f_{\Theta}$ maps $\mathbf{x} \in \mathbb{R}^k$ to a prediction $\hat{y}$, and we can test its performance by evaluating the differences between $\hat{y}$ and the $y \in \mathbf{X}_{\text{test}}$. Before we get to specific details about neural networks, it is elucidating to discuss one of the oldest classification algorithms, \textit{linear regression}. Like a neural network, a linear regression model is a function which maps an $\mathbf{x} \in \mathbb{R}^k$ to a prediction $\hat{y}$. The model is parameterized by a set of learnable weights $\mathbf{w} \in \mathbb{R}^k$ and an optional bias term $b \in \mathbb{R}$. Once we have trained the model, usually by minimizing the mean squared error, we can use it for predicting $y$ given any $\mathbf{x} \in \mathbf{X}$. The formula for linear regression is as follows \begin{equation} \hat{y} = \sum_{k=1}^m x_kw_k + b. \label{fig:lin_reg} \end{equation} The weight vector $\mathbf{w}$ of a linear regression model can be thought of as an expression for how important each entry of $\mathbf{x}$ is for a specific output $y$. In a sense, when we train a classifier it is like we are trying to figure out a recipe (i.e. the correct proportions of ingredients given a list of ingredients and examples of the desired final product) through trial and error. At the end of training, the best set of proportions is stored in $\mathbf{w}$, which tells us in what proportion to combine ingredients so that we obtain a reasonably good approximation of the desired result, as in (\ref{fig:lin_reg}). \begin{figure}[H] \centering \begin{tikzpicture} [ plain/.style={ draw=none, fill=none }, net/.style={ matrix of nodes, nodes={ draw, circle, inner sep = 6pt }, nodes in empty cells, column sep = 1cm, row sep = -6pt }, >=latex ] \matrix[net] (mat) { |[plain]| & |[plain]| & |[plain]| \\ $h_1$ & |[plain]| \\ |[plain]| & $g_1$ \\ $h_2$ & |[plain]| \\ |[plain]| & |[plain]| \\ $h_3$ & $g_2$ & $f$ \\ |[plain]| & |[plain]| \\ $h_4$ & |[plain]| \\ |[plain]| & $g_3$ \\ $h_5$ & |[plain]| \\ }; \foreach \ai [ count=\mi ] in {2,4,...,10} \draw[<-] (mat-\ai-1) -- node[above] {$x_\mi$} +(-2cm,0); \foreach \ai in {2,4,...,10} { \foreach \aii in {3,6,9} \draw[->] (mat-\ai-1) -- (mat-\aii-2); } \foreach \ai in {3,6,9} \draw[->] (mat-\ai-2) -- (mat-6-3); \draw[->] (mat-6-3) -- node[above] {$y$} +(2cm,0); \end{tikzpicture} \caption{A Simple Deep Neural Network with a Single Hidden Layer} \label{fig:simple_deep_nn} \end{figure} In the following, we will see how linear regression relates to neural networks, which I will define now. An \textit{artificial neural network} is a function defined by a directed graph (the computations flow in a specified direction), whose connectivity implements function composition. A weighted summations occurs at each node of the graph, which are themselves wrapped in activation functions. The weighted summations of inputs inside each node are not unlike a linear regression. So in a sense, an artificial neural network describes a way of composing linear regression models. A \textit{deep neural network} is a neural network with at least one hidden layer as depicted in Figure~\ref{fig:simple_deep_nn}. \begin{figure}[H] \centering \begin{tikzpicture} [ neuron/.style={ draw, circle, inner sep=1pt, join = by -latex }, activation/.style={ draw, inner sep=2pt, join = by -latex }, start chain=3, node distance = 6mm ] \begin{scope}[start chain=1] \node[on chain=1] at (-2,2cm) (x1) {$x_1$}; \node[on chain=1,join=by -latex] (w1) {$w_1$}; \end{scope} \begin{scope}[start chain=2] \node[on chain=2] at (-2,1cm) (x2) {$x_2$}; \node[on chain=2,join=by -latex] (w2) {$w_2$}; \end{scope} \node[on chain=3] at (-2, 0cm) (x3) {$\ \vdots\ $}; \node[on chain=3] (w3) {$\ \vdots\ \ $}; \node[on chain=3, neuron] at (1, 0) (sigma) {$\displaystyle \sum_{k=1}^m w_kx_k + b$}; \begin{scope}[start chain=4] \node[on chain=4] at (-2,-1cm) (x4) {$x_4$}; \node[on chain=4,join=by -latex] (w4) {$w_4$}; \end{scope} \begin{scope}[start chain=5] \node[on chain=5] at (-2,-2cm) (xm) {$x_m$}; \node[on chain=5,join=by -latex] (wm) {$w_m$}; \end{scope} \begin{scope}[start chain=6] \node[on chain=6, join=by -latex] at (4,0cm) (activation) {}; \node[on chain=6, join=by -latex] () { Activation } ; \node[on chain=6, join=by -latex] (activation) {$ \cdots $}; \end{scope} \draw[-latex] (w1) -- (sigma); \draw[-latex] (w2) -- (sigma); \draw[-latex] (w3) -- (sigma); \draw[-latex] (w4) -- (sigma); \draw[-latex] (wm) -- (sigma); \end{tikzpicture} \caption{A Neuron} \label{fig:node_of_nn} \end{figure} We can express the neural network in Figure~\ref{fig:simple_deep_nn} as a composition of functions. \begin{align} \label{eq:compition} y & = f(g_1(h_1, h_2, \dots, h_5),\ g_2(h_1, h_2, \dots, h_5),\ g_3(h_1, h_2, \dots, h_5)) \\ & = f\left(\sum_{j=1}^5\omega_{1,j}h_{j} + c_1,\ \sum_{k=1}^5\omega_{2,k}h_{k} + c_2,\ \sum_{l=1}^5\omega_{3,l}h_{l} + c_3\right) \\ & = \sum_{i=1}^3\omega_i\left(\sum_{j=1}^5\omega_{i,j}h_{j} + c_i \right) + b \\ & = \sum_{i=1}^3\sum_{j=1}^5\left[\omega_i\omega_{i,j}h_{j} + \omega_ic_i \right] + b \end{align} \subsection{Digit Classification} Digit classification using the MNIST dataset of handwritten digits is one of oldest image classification benchmarks. The goal of the task is to correctly classify an image of a number. Since handwritten digits are variable, any classifier will need to see many examples of each digit in order to learn what features are essential to each digit class. The more variable the concept, the more data we usually need in order to learn its defining properties. When using a classification algorithm other than a neural network, people usually have to hand-engineer kernels or ``feature maps''. This requires domain experts, which can be costly and time consuming. Deep neural networks, on the other hand, learn kernel maps from the data when we train them. This has the added advantage of not introducing additional bias, as hand-engineered feature maps may do. % It's like a kernel represents a class and we need to learn how to describe the % data using objects from these classes, and objects from these classes are fit to % the data using weights. Neural network methods are currently the leaders in digit classification, but support vector machines (SVMs) are also pretty good. According to Wikipedia, the current state-of-the-art in digit classification accuracy is 99.79\% with a convolutional neural network and for reference, the best SVM method achieves accuracy of 99.44\%. The MNIST dataset contains 60,000 training examples (6,000 for each number between 0 and 9) and 10,000 test examples (1,000 for each number). The reason for having separate training and testing datasets will be made clear later on. \begin{figure}[H] \centering \includegraphics[width=0.4\linewidth]{mnist} \caption{We need many examples to learn the general form of a number} \label{fig:mnist} \end{figure} Some traditional methods for classification start by flattening the $28 \times 28$ pixel images into 784 dimensional vectors, since the method may require the input to be a vector; this enables a weighted sum to be computed. The problem with this is that we lose contextual information, we no longer know the coordinates of each pixel and we don't know much about the neighboring pixel values. Contextual information is important for identifying shapes, which gives a bit more meaning to pixel values, e.g. a dark pixel adjacent to light pixels has a different meaning to us than a dark pixel surrounded by other dark pixels. The way we retain contextual information about the pixel values is by using a convolutional kernel, which operates on the original (unflattened) $28 \times 28$ pixel image. \begin{figure}[H] \centering \includegraphics[width=0.2\linewidth]{mnist_single} \caption{A single digit} \label{fig:mnist_single} \end{figure} % Convolutions are at the heart of many image transformations, for % instance the GNU Image Manipulation Program has a generic convolution filter % where the user can enter desired values for a convolutional kernel to be applied % to the image in order to achieve edge detection, smoothing, embossing, etc... In our context, the convolution operation is a way of adding neighboring pixel values (the number determined by the dimensions of the kernel) together, weighted by values stored in the convolution kernel. For example, given a matrix $M$ of pixel values and a kernel $K$ of weights, we can depict the discrete convolution operation as follows. \begin{align} \label{eq:convolution_1} M & = \begin{bmatrix} m_{0,0} & m_{0,1} & \dots & m_{0,n} \\ m_{1,0} & m_{1,1} & \dots & m_{1,n} \\ \vdots & \vdots & \ddots & \vdots \\ m_{m,0} & m_{m,1} & \dots & m_{m,n} \end{bmatrix} \end{align} \begin{align} \label{eq:convolution_2} K & = \begin{bmatrix} k_{0,0} & k_{0,1} \\ k_{1,0} & k_{1,1} \end{bmatrix} \end{align} \begin{align} \label{eq:convolution_3} \left( M * K \right) [0,0]& = m_{0,0} \times k_{0,0} + m_{0,1} \times k_{0,1} + m_{1,0} \times k_{1,0} + m_{1,1} \times k_{1,1}, \end{align} In a convolutional neural network, each neuron computes the activation function of a convolution of a small patch of pixels with weights stored in a kernel. In (\ref{eq:convolution_3}), we convolved the matrix $M$ with the kernel $k$ at position $(0,0)$ of $M$. One by one, each neuron of a given layer computes a weighted sum (more precisely, a convolution) for each coordinate $(x,y)$ of $M$. All neurons in the same layer use the same kernel, which means the neurons of each layer share the same weights. Weight sharing makes the neural network's ability to detect features invariant under translation (i.e. since the kernel passes over the entire input space it is able to identify features anywhere in the input space). If we use padding, we can control the size of the output matrix, we can even obtain a matrix of the same dimension as the original as illustrated in Figure~\ref{fig:conv_padding}. If we don't use padding we will produce a new matrix of smaller dimension. In either case, the output matrix's entries are weighted sums of the entries of $M$. Design choices with respect to hyperparameters include the dimensions of the kernel $K$, the size of the stride (the number of coordinates to skip over each time we convolve the kernel with the image), and whether or not to use padding. Kernel size and stride also affect the dimension of and the number of channels returned by the convolution layer. \begin{figure}[H] \centering \begin{tikzpicture} \node [fill=green!25] at (-0.75,+0.75) {$a$}; \node [fill=green!25!yellow!25] at (-0.25,+0.75) {$b$}; \node [fill=green!25!blue!25] at (-0.75,+0.25) {$e$}; \node [fill=green!75!yellow!75!blue!75!red!75] at (-0.25,+0.25) {$f$}; \node [fill=green!25!yellow!25] at (+0.25,+0.75) {$c$}; \node [fill=yellow!25] at (+0.75,+0.75) {$d$}; \node [fill=green!75!yellow!75!blue!75!red!75] at (+0.25,+0.25) {$g$}; \node [fill=red!25!yellow!25] at (+0.75,+0.25) {$h$}; \node [fill=green!25!blue!25] at (-0.75,-0.25) {$i$}; \node [fill=green!75!yellow!75!blue!75!red!75] at (-0.25,-0.25) {$j$}; \node [fill=blue!25] at (-0.75,-0.75) {$l$}; \node [fill=red!25!blue!25] at (-0.25,-0.75) {$m$}; \node [fill=green!75!yellow!75!blue!75!red!75] at (+0.25,-0.25) {$k$}; \node [fill=red!25!yellow!25] at (+0.75,-0.25) {$l$}; \node [fill=red!25!blue!25] at (+0.25,-0.75) {$n$}; \node [fill=red!25] at (+0.75,-0.75) {$o$}; \draw [->] (+1.25,0) -- node[above] {3x3} (+3.5,0); \draw [->] (+1.25,0) -- node[below] {Kernel} (+3.5,0); \node at (+4,+0.25) [fill=green!25] {$a'$}; \node at (+4.5,+0.25) [fill=yellow!25] {$b'$}; \node at (+4,-0.25) [fill=blue!25] {$c'$}; \node at (+4.5,-0.25) [fill=red!25] {$d'$}; \end{tikzpicture} \\ \bigskip \begin{tikzpicture} % zeros left \node at (-1.25,+1.25) [fill=gray!10] {0}; \node at (-1.25,+0.75) [fill=gray!10] {0}; \node at (-1.25,+0.25) [fill=gray!10] {0}; \node at (-1.25,-0.25) [fill=gray!10] {0}; \node at (-1.25,-0.75) [fill=gray!10] {0}; \node at (-1.25,-1.25) [fill=gray!10] {0}; % zeros right \node at (+1.25,+1.25) [fill=gray!10] {0}; \node at (+1.25,+0.75) [fill=gray!10] {0}; \node at (+1.25,+0.25) [fill=gray!10] {0}; \node at (+1.25,-0.25) [fill=gray!10] {0}; \node at (+1.25,-0.75) [fill=gray!10] {0}; \node at (+1.25,-1.25) [fill=gray!10] {0}; % zeros bottom \node at (-0.75,-1.25) [fill=gray!10] {0}; \node at (-0.25,-1.25) [fill=gray!10] {0}; \node at (+0.25,-1.25) [fill=gray!10] {0}; \node at (+0.75,-1.25) [fill=gray!10] {0}; % zeros top \node at (-0.75,+1.25) [fill=gray!10] {0}; \node at (-0.25,+1.25) [fill=gray!10] {0}; \node at (+0.25,+1.25) [fill=gray!10] {0}; \node at (+0.75,+1.25) [fill=gray!10] {0}; \draw [->] (+1.6,0) -- node[above] {3x3} (+3.5,0); \draw [->] (+1.6,0) -- node[below] {Kernel} (+3.5,0); \node at (-0.75,+0.75) {$a$}; \node at (-0.25,+0.75) {$b$}; \node at (-0.75,+0.25) {$e$}; \node at (-0.25,+0.25) {$f$}; \node at (+0.25,+0.75) {$c$}; \node at (+0.75,+0.75) {$d$}; \node at (+0.25,+0.25) {$g$}; \node at (+0.75,+0.25) {$h$}; \node at (-0.75,-0.25) {$i$}; \node at (-0.25,-0.25) {$j$}; \node at (-0.75,-0.75) {$l$}; \node at (-0.25,-0.75) {$m$}; \node at (+0.25,-0.25) {$k$}; \node at (+0.75,-0.25) {$l$}; \node at (+0.25,-0.75) {$n$}; \node at (+0.75,-0.75) {$o$}; \begin{scope}[xshift=4.6cm] \node at (-0.75,+0.75) {$a'$}; \node at (-0.25,+0.75) {$b'$}; \node at (-0.75,+0.25) {$e'$}; \node at (-0.25,+0.25) {$f'$}; \node at (+0.25,+0.75) {$c'$}; \node at (+0.75,+0.75) {$d'$}; \node at (+0.25,+0.25) {$g'$}; \node at (+0.75,+0.25) {$h'$}; \node at (-0.75,-0.25) {$i'$}; \node at (-0.25,-0.25) {$j'$}; \node at (-0.75,-0.75) {$l'$}; \node at (-0.25,-0.75) {$m'$}; \node at (+0.25,-0.25) {$k'$}; \node at (+0.75,-0.25) {$l'$}; \node at (+0.25,-0.75) {$n'$}; \node at (+0.75,-0.75) {$o'$}; \end{scope} \end{tikzpicture} \caption{Convolution (with stride 1) with and without padding of size 1} \label{fig:conv_padding} \end{figure} We have a set of weights, one for each convolutional layer and through training, some sets of weights become more sensitive to specific features. A layer and its kernel may become experts at identifying lines, curves etc. Pooling is another common operation performed in convolutional neural nets. It is similar to convolution in the sense that both are methods of sub-sampling. Max pooling in particular is used to pick the max out of a region of some specified area, as seen in Figure~\ref{fig:maxpooling}. Pooling further decreases the number of degrees of freedom which helps prevent over-fitting. Most modern convolutional neural networks make use of the convolution stride to achieve a similar sub-sampling effect. \begin{figure}[H] \centering \begin{tikzpicture} \node [fill=green!25] at (-0.75,+0.75) {1}; \node [fill=green!25] at (-0.25,+0.75) {2}; \node [fill=green!25] at (-0.75,+0.25) {3}; \node [fill=green!25] at (-0.25,+0.25) {4}; \node [fill=yellow!25] at (+0.25,+0.75) {5}; \node [fill=yellow!25] at (+0.75,+0.75) {6}; \node [fill=yellow!25] at (+0.25,+0.25) {7}; \node [fill=yellow!25] at (+0.75,+0.25) {8}; \node [fill=blue!25] at (-0.75,-0.25) {9}; \node [fill=blue!25] at (-0.25,-0.25) {9}; \node [fill=blue!25] at (-0.75,-0.75) {0}; \node [fill=blue!25] at (-0.25,-0.75) {9}; \node [fill=red!25] at (+0.25,-0.25) {3}; \node [fill=red!25] at (+0.75,-0.25) {4}; \node [fill=red!25] at (+0.25,-0.75) {5}; \node [fill=red!25] at (+0.75,-0.75) {6}; \draw [->] (+1.25,0) -- node[above] {2x2} (+3.5,0); \draw [->] (+1.25,0) -- node[below] {Max Pooling} (+3.5,0); \node at (+4,+0.25) [fill=green!25] {4}; \node at (+4.5,+0.25) [fill=yellow!25] {8}; \node at (+4,-0.25) [fill=blue!25] {9}; \node at (+4.5,-0.25) [fill=red!25] {6}; \end{tikzpicture} \\ \bigskip \caption{Max Pooling} \label{fig:maxpooling} \end{figure} Because the output of layer $k$ is the input of layer $k+1$, it is like the lower layers learn simple, low level concepts and the higher layers learn higher order concepts, all of which are weighted sums of weighted sums of some input data. The way I implemented classification is as follows. I first encoded the labels of the MNIST dataset in one-hot encoding. The result is each image in the training and test sets are accompanied by a vector of length 10. Each of the 10 indices corresponds to a unique number. If we let the zeroth entry correspond with the number zero and the first entry correspond with the number 1, we obtain the following encoding: \begin{equation} \tiny{ \label{eq:onehot} 0 = \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}, \quad 1 = \begin{bmatrix} 0 \\ 1 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}, \quad 2 = \begin{bmatrix} 0 \\ 0 \\ 1 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}, \quad \cdots \quad 9 = \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 1 \end{bmatrix}}. \end{equation} The final layer has only ten neurons, one for each of the classes of concepts we wanted to learn; in our case they correspond with the digits 0 to 9. We can think of the final layer as any other layer and each neuron will respond to input from the layers before it. For example, the neuron corresponding to the number 9 will output a number close to 1 when the input to the neuron is of the right combination of concepts from the layers before it. Once the neural net has returned its predicted label, we will want to compare it with the true label from our test dataset. We devise a notion of distance to measure how far away our prediction was from what it was supposed to be. We call this the loss function and we want to minimize this through iterative learning. Each iteration is called an epoch, and in our case, since we are working with a mini batch method, the epochs comprises many batch operations. Batch size is one of the hyperparameters. Once we have obtained a loss, we use the back-propagation algorithm to send information from the loss backward through the network in order to compute the gradient, which is the derivative of the loss with respect to the weights of the graph. Since we can think of a neural network as a composite of many functions, when we compute the gradient, we compute it for a composition of functions of multiple variables. We can use the chain rule from calculus to derive the gradients for the nodes in each layer and once we have those, we can perform \textit{gradient descent}, which pushes the weights down the error curve at a velocity set by the a \textit{learning rate}, usually denoted by $\epsilon$, see (\ref{eq:grad_update}) below. The learning rate is a number, typically in $(0,1) \subset \mathbb{R}$. \begin{equation} \label{eq:grad_update} \mathbf{\omega'} = \mathbf{\omega} - \epsilon \nabla_{\mathbf{\omega}}f(\mathbf{\omega}) \end{equation} If we choose $\epsilon$ too large, then gradient descent may not converge and if we choose $\epsilon$ too small, then gradient descent may converge too slowly or not at all. \subsubsection{Regularization} In a sense, training a neural net involves increasing the number of degrees of freedom to maximize representation capacity. At the same time, we need to combat the tendency to over fit the training data through regularization, which effectively reduces the number of degrees freedom. Convolution and pooling greatly decrease the number of degrees of freedom while retaining some of the information about the pixels. Regularization techniques help the graph learn the correct probability distribution for the data by forcing it to forget less common observations. In other words, when the dataset contains many instances of a feature, then we know a lot about it, and we can be more confident in the parameters learned from observing those features. But if we have a feature that rarely occurs in the dataset, then regularization will prevent the weights learned from that feature from influencing the classifier too strongly. Dropout decreases the number of degrees of freedom by simply forgetting a subset of the neurons' outputs. Concepts formed from features commonly observed by the network will be resistant to dropout. \subsection{Hyperparameter Optimization} There are many optimization algorithms for training a neural net, including various forms of gradient descent, but the focus of this research project is on determining the optimal values for the hyperparameters which define the graph's structure and control behavior. For the purposes of this research project, I have limited my scope to found in Figure~\ref{fig:hyperparams}. \begin{figure}[H] \centering \scriptsize \begin{tabular}{l|l} Symbol & Description \\ \hline \\ $K_i \in \mathbb{N}$ & Dimension of the convolution kernel for each convolutional layer. \\ $S_i \in \mathbb{N}$ & Stride for the convolution kernel for each convolutional layer. \\ $BS \in \mathbb{N}$ & Training batch size. \\ $E \in \mathbb{N}$ & Number of training epochs. \\ $DP_i \in [0,1] \subset \mathbb{R}$ & The dropout probability for each dropout layer \\ $PS \in \mathbb{N}$ & The dimensions of pooling for the pooling layer \\ $BI \in \text{Keras Initializers}$ & The method for initializing the biases of each layer \\ $WI \in \text{Keras Initializers}$ & The method for initializing the weights of each layer \\ $TP \in [0,1] \subset \mathbb{R}$ & penalty for training time \\ $LR \in (0,1) \subset \mathbb{R}$ & The learning rate. \\ \end{tabular} \caption{Hyperparameters} \label{fig:hyperparams} \end{figure} An ad hoc approach to optimizing hyperparameters is to perform multiple experiments and learn from trial and error. The practitioner may gain insight and intuition which may be helpful for the learning task at hand. The problem with this approach is in the ability to transfer this insight to another practitioner. Methodical approaches include grid search (this quickly becomes computationally infeasible if we want to optimize more than 2 or 3 hyperparameters) and random sampling from a predefined hyperparameter search space as in \cite{bb_2012}. Some background knowledge is helpful when defining the search space to improve efficiency. The experiments I performed used \textit{``hyperopt''} which is a hyperparameter optimization software package authored primarily by James Bergstra and which is the focus of \cite{bergstra_making_2012}. \subsubsection{Random Sampling} Bergstra and Bengio explore random sampling from a hyperparameter search space in \cite{bb_2012}, denoted $\Lambda$, as an alternative to grid search. We start with a learning algorithm $\mathcal{A_{\lambda^*}}$ which maps a given dataset $\mathbf{X}^{\text{train}}$ to a function $f$ where $\lambda^* \in \Lambda$ and $\Lambda$ is the hyperparameter search space for $\mathcal{A}$. I have written $\lambda^{*}$ because that is what we hope our hyperparameter optimization algorithm to find, where $\lambda^{*}$ is \begin{equation} \label{eq:argmin_of_expectation} \lambda^{*} = \text{argmin}_{\lambda \in \Lambda} \mathbb{E}_{x \sim \mathcal{G}_x}\left[ \mathcal{L}\left( x, \mathcal{A}_{\lambda}\left( \mathbf{X}^{\text{train}} \right) \right) \right] ; x_k \in \mathbf{X}^{\text{test}}. \end{equation} In writing, we have a learning algorithm whose behaviour is partially determined by a set of hyperparameters. We hope to have to optimal set of hyperparameters, by optimal, we mean the set of hyperparameters which minimizes the expected loss on the test dataset. Since we never know the true distribution from which our data originate, $\mathcal{G}_x$, we must accept our limitations and instead we seek \begin{equation} \label{eq:argmin_of_expectation} \lambda^{\dagger} \approx \text{argmin}_{\lambda \in \Lambda} {1 \over m} \sum_{k=1}^m \mathcal{L}\left( x_k, \mathcal{A}_{\lambda}\left( \mathbf{X}^{\text{train}} \right) \right); x_k \in \mathbf{X}^{\text{test}}, \end{equation} where $m$ is the number of examples in the test set. We need a way to find $\lambda^\dagger$. Bergstra and Bengio write about what they call the \textit{hyper-parameter response function}, $\Psi$ is the empirical loss obtained over the test dataset $\mathbf{X}^{\text{test}}$ from the learning algorithm $\mathcal{A}_\lambda$ trained on the training dataset $\mathbf{X}^{\text{train}}$ and hyperparameterized by $\lambda \in \Lambda$. \begin{equation} \label{eq:hyper_parameter_loss_function} \Psi\left( \lambda \right) = {1 \over m} \sum_{k=1}^m \mathcal{L}\left( x_k, \mathcal{A}_{\lambda}\left( \mathbf{X}^{\text{train}} \right) \right) ; x_k \in \mathbf{X}^{\text{test}}. \end{equation} The point of defining a hyperparameter response function is to see how the empirical loss changes as the response function varies over hyperparameter configurations. In the case where the dataset is fixed, the learning algorithm $\mathcal{A}$ and the hyperparameters $\lambda$ are still free to change, we can optimize the algorithm (inner optimization) and we can optimize the hyperparameters (outer optimization). The task of hyperparameter selection then becomes the task of minimizing the response function, i.e. we desire \begin{equation} \label{eq:out_optimization_task} \lambda^\dagger = \text{argmin}_{\lambda \in \Lambda} \Psi(\lambda). \end{equation} If we search through all of $\Lambda$, this would be equivalent to performing a grid search. In reality $\Psi$ and $\Lambda$ depend on the dataset, goals, and learning algorithm. Which brings us to the main hypothesis of Bergstra and Bengio's paper, that the response function is more sensitive to changes in some hyperparameters than others, thus grid search wastes time exploring less sensitive hyperparameter values. The reason for this non-uniformly distributed sensitivity, as they claim, is that $\Psi$ has a low effective dimensionality. In a most basic sense, we can say that a function $f(x,y)$ has a low effective dimensionality if it can be approximated with sufficient accuracy by another function of lower dimension $g(x)$, i.e. $f(x, y) \approx g(x)$. One way to characterize the shape of a high-dimensional function is to look at how much it varies in each dimension. Bergstra and Bengio used Gaussian process regression to gather statistics on the response function and then measured its effective dimensionality. From the experimental data collected from the Gaussian process regression, they were able to measure the relevance of each hyperparameter to the performance of $\Psi$. Their conclusion was that only a small fraction of the hyperparameters were influential to $\Psi$ for a given dataset. % bergstra and bengio also claim that grid search wastes too much time % exploring the irrelevant dimensions and they claim that random search is more % efficient than grid search in higher dimensions because it samples more % effectively from the sensitive dimensions. \section{Software} Using any number of deep learning software libraries can greatly reduce the programming time to build a state-of-the-art classifier, but given the task at hand and the dataset, the programmer will find there are still many design decisions that need to be made with respect to the model's structure. Once a model has been decided upon, the values of the hyperparameters need to be optimized and may need to be reoptimized if the model is to be reused for a different task with a different datast. To quote Bergstra, ``It is sometimes difficult to know whether a given [model] is genuinely better, or simply better tuned''. \subsection{Keras} I chose to use Keras as my programming interface to the Tensorflow engine. It worked better on my machine than Tensorflow on its own. \subsection{Tensorflow} Tensorflow is an open source software library authored by members of the Google Brain Team. It is commonly used in industry and research, which contributed to my choice to work with it. There are other good libraries such as Caffe, Microsoft Cognitive Toolkit, MXNet, Theano, Torch, etc... Software libraries like Tensorflow are used to perform numerical computations on directed graphs. Within this framework nodes represent mathematical operations and edges represent multi-dimensional arrays of arguments to be passed to the nodes. \subsection{Hyperopt} For my experiments I used a hyperparameter optimization packaged called ``Hyperopt''. Hyperopt employs a novel \textit{Null Distribution Specification Language}, which is used to define the hyperparameter search space. The language is made up of an arbitrary number of dictionary entries, which may be arbitrarily nested. The user of this package must know a priori useful intervals and distributions for the hyperparameters, so while hyperopt does take some of the guess work out of hyperparameter selection and optimization, some knowledge is still required. The Null Distribution Specification Language describes the distributions that would be used should we perform a simple random unoptimized search of the configuration space. A \textit{null prior distribution} is a expression written in this specification language Hyperopt replaces the nodes of the null prior distribution with Guassian mixture models derived from the experimental history. The Gaussian mixture models' means, and variances are refined after each experimental trial. The experimental history is stored by hyperopt in a database which tracks the hyperparameter configuration and the loss obtained at the end of the experiment. From this history it can learn which hyperparameters the learning algorithm is more sensitive too. \subsection{Tree of Parzen Estimators} Hyperopt provides the user with two algorithms for sampling from the null prior distribution, Random Search and Tree of Parzen Estimators (TPE). I used the latter algorithm for my experiments. TPE sequentially constructs tree-like models from the null prior distribution (which can have a nested, tree-like structure) to compute the expected performance of a set of hyperparameters based on historical measurements, and then subsequently chooses new hyper-parameters to test based on this model. These tree like models are made by replacing the entries in the null prior distribution with ratios of Gaussian Mixture Models (GMM). A GMM is a probabilistic model, based on the assumption that each datum is generated by a mixture of a finite number of Gaussian distributions with unknown parameters. The Scikit Learn documentation likens GMMs to a more generalized $k$-means clustering algorithm (which is a method for partitioning $n$ data points into $k$ clusters and each data point is assigned to the cluster with the mean nearest to the data point's value). % TPE is an iterative algorithm which fits one GMM $f(x)$ to the configuration of % hyperparameters which previously minimized the loss function on the training % set. Another GMM $g(x)$ is fit to the remaining hyperparameter values. TPE then % chooses the hyperparameter value that maximizes $f(x)/g(x)$. % Parzen windows classification is a technique for nonparametric density % estimation, which can also be used for classification. Given a kernel, the % technique approximates a given training set distribution via a linear % combination of kernels. % http://optunity.readthedocs.io/en/latest/user/solvers/TPE.html % https://en.wikipedia.org/wiki/K-means_clustering % http://scikit-learn.org/stable/modules/mixture.html % https://compbio.soe.ucsc.edu/genex/genexTR2html/node11.html % http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html \section{Experiments} The MNIST dataset does not require a lot of computing power to train classifiers on. I took advantage of this and ran multiple experiments. I used Keras as an interface to Tensorflow to build a basic deep neural network. There is nothing novel about this architecture and using the default hyper parameter settings from Keras' github page, the model achieves a test accuracy of 99.25\% after 12 epochs. The purpose of these experiments wasn't to achieve an extremely accurate classifier, rather, I was interested in exploring the surfaces of hyperparameter configurations over which we can train the model and achieve good results. \subsection{Experiment 1} In my first experiment, I trained my networks using Adadelta, which is an optimizer equipped with its own optimization procedure for finding an ideal learning rate. I chose Adadelta because I wanted to optimize the learning rate (and related hyperparameters) after I had optimized all other hyperparameters found in Figure~\ref{fig:hyperparams}. As an alternative to using Adadelta, I could have chosen a fixed learning rate and while optimizing the other hyperparameters. I realize this may have been a better choice. For my first experiment, I used an initial search space $\mathcal{H}$ defined by the following python dictionary. \begin{figure}[H] \centering \begin{minted}{python} H = { 'keep_prob_1': hp.uniform('keep_prob_1', 0, 1), 'keep_prob_2': hp.uniform('keep_prob_2', 0, 1), 'batch_size': hp.choice('batch_size', list(range(20, 81))), 'epochs': hp.choice('epochs', list(range(6, 21))), 'time_penalty': hp.uniform('time_penalty', 0, 0.5), 'pool_size': hp.choice('pool_size', [2, 3]), 'kernel_size_1': hp.choice('kernel_size_1', [1, 2, 3, 4, 5]), 'kernel_size_2': hp.choice('kernel_size_2', [1, 2, 3, 4, 5]), 'strides_1': hp.choice('strides_1', [1, 2, 3]), 'strides_2': hp.choice('strides_2', [1, 2, 3]), 'bias_initializer': hp.choice('bias_initializer', [ 'zeros', 'ones', 'lecun_uniform', 'lecun_normal', 'glorot_normal', 'he_normal', 'he_uniform', 'glorot_uniform' ]), 'weight_initializer': hp.choice('weight_initializer', [ 'zeros', ..., 'glorot_uniform' ])} \end{minted} \caption{Initial Hyper-parameter Search Space} \label{fig:initial_search_space} \end{figure} \subsubsection{Results} Figure~\ref{fig:results_exp_1_good} presents statistics from the top 20 performing hyperparameter configurations, all of which achieved test accuracy within 99.00\% to 99.29\%. \begin{figure}[H] \footnotesize \centering \begin{tabular}{l l l l l l l l l l l l l} & E & K1 & K2 & PS & TP & S1 & S2 & BS & DP1 & DP2 & VA \\ \hline \\ Mean & 11.45 & 4.2 & 3.4 & 2.1 & 0.13 & 1.15 & 1.40 & 56.65 & 0.23 & 0.25 & 0.991170 \\ SD & 4.35 & 1.0052 & 1.3 & 0.31 & 0.12 & 0.37 & 0.60 & 18.48 & 0.17 & 0.17 & 0.000866 \\ Min & 6 & 2 & 2 & 2 & 0.0011 & 1 & 1 & 23 & 0.0063 & 0.0019 & 0.990000 \\ Max & 20 & 5 & 5 & 3 & 0.4 & 2 & 3 & 78 & 0.68 & 0.64 & 0.992900 \end{tabular} \caption{Top Performing Results from Experiment 1 (150 Trials)} \label{fig:results_exp_1_good} \end{figure} % \begin{figure}[H] % \small % \begin{centering} % \begin{tabular}{l l l l l l l l l l l l l} % \end{tabular} % \caption{Bottom Performing Results from Experiment 1 (150 Trials)} % \label{fig:results_exp_1_poor} % \end{centering} % \end{figure} \begin{figure}[H] \centering \scriptsize \begin{tabular}{l l l l} Legend: \\ \hline \\ VA : & Test Accuracy & K1 : & Kernel 1 Dimensions \\ K2 : & Kernel 2 Dimensions & BS : & Batch Size \\ DP1 : & Dropout Prob. 1 & DP2 : & Dropout Prob. 2 \\ PS : & Pooling Size & S1 : & Strides 1 \\ S2 : & Strides 2 & TP : & Time Penalty \\ E : & Epochs \\ \end{tabular} \end{figure} \newpage The results from Experiment 1 are depicted graphically in Figure~\ref{fig:bs_vs_dropout} to~\ref{fig:pooling_vs_kernel}. In those figures, I plotted three hyperparameters against each other and used a colormap to depict the test accuracy. Figure~\ref{fig:exp1overtime} shows that the test accuracy under hyperopt does not increase with the number of trials. \begin{figure}[H] \centering \begin{tikzpicture} \begin{axis} [ % title=Experiment 1,every axis title/.style={above right,at={(0,1)}}, width=0.5*\linewidth, grid=major, grid style={dashed, gray!30}, xlabel=Trial Number, ylabel=Test Accuracy, xmin=0, xmax=114, ymin=0, ymax=1, mark=o ] \addplot [only marks] table[x=trial, y=valacc, col sep=comma] {exp1.csv}; \end{axis} \end{tikzpicture} \caption{Experiment 1: Test Accuracy vs. Trial Number} \label{fig:exp1overtime} \end{figure} \subsection{Experiment 2} I changed the learning rate optimizer to standard Gradient Descent, and tuned the learning rate, momentum, and decay while keeping the other hyperparameters constant (which were learned from Experiment 1). My hyperparameter search space was: \begin{centering} \begin{minted}{python3} H = { 'lr': hyperopt.hp.uniform('lr', 0.0001, 0.1), 'nesterov': hyperopt.hp.choice('nesterov', ['True', 'False']), 'momentum': hyperopt.hp.uniform('momentum', 0.0001, 1), 'decay': hyperopt.hp.uniform('decay', 0.0001, 1) } \end{minted} \end{centering} The hyperparameters carried over from Experiment 1 are shown in Figure~\ref{fig:hyp_exp2_onward}. \begin{figure}[H] \footnotesize \centering \begin{tabular}{l l} Kernel 1 Dimensions: & 3x3 \\ Kernel 2 Dimensions: & 5x5 \\ Batch Size: & 55 \\ Number of Epochs: & 6 \\ Keep Probability 1: & 0.0094 \\ Keep Probability 2: & 0.3078 \\ Pooling Dimensions: & 2x2 \\ Strides 1: & 1 \\ Strides 2: & 1 % Bias Initializer & He Normal \\ % Weight Initializer & He Normal \end{tabular} \caption{Hyperparameters used in Experiment 2 Onward} \label{fig:hyp_exp2_onward} \end{figure} For reference, the hyperparameter configuration used by the Keras authors to achieve a 92.25\% classification accuracy is shown in Figure~\ref{fig:hyp_keras}. \begin{figure}[H] \footnotesize \centering \begin{tabular}{l l } Kernel 1 Dimensions: & 3x3 \\ Kernel 2 Dimensions: & 3x3 \\ Batch Size: & 128 \\ Number of Epochs: & 12 \\ Keep Probability 1: & 0.25 \\ Keep Probability 2: & 0.5 \\ Pooling Dimensions: & 2x2 \\ Strides 1: & 1 \\ Strides 2: & 1 \end{tabular} \caption{Hyperparameters used by Keras Authors} \label{fig:hyp_keras} \end{figure} \subsubsection{Results} I ran this experiment four times, once with model $A$ which was made from the top performing hyperparameter configuration arrived at in Experiment 1. Model $B$ used the average of the top performing hyperparameter configurations. Model $C$ used the average of the worst performing hyperparameter configurations and model $D$ used the worst performing hyperparameter configuration. The results of the experiment can be seen in Figures~\ref{fig:exp2_75_best} to~\ref{fig:exp2_75_worst}. Figures~\ref{fig:exp2_best_nesterov_true} to~\ref{fig:exp2_worst_nesterov_false} may also be of interest. \begin{figure}[H] \centering \begin{tikzpicture} \begin{axis} [ width=0.5*\linewidth, grid=major, grid style={dashed, gray!30}, xlabel=Trial Number, ylabel=Test Accuracy, xmin=0, xmax=75, ymin=0, ymax=1, mark=o ] \addplot [only marks] table[x=trial, y=valacc, col sep=comma] {exp2_75_best.csv}; \end{axis} \end{tikzpicture} \caption{Experiment 2 (Model A): Test Accuracy vs. Trial Number} \label{fig:exp2_75_best} \end{figure} \begin{figure}[H] \centering \begin{tikzpicture} \begin{axis} [ width=0.5*\linewidth, grid=major, grid style={dashed, gray!30}, xlabel=Trial Number, ylabel=Test Accuracy, xmin=0, xmax=75, ymin=0, ymax=1, mark=o ] \addplot [only marks] table[x=trial, y=valacc, col sep=comma] {exp2_75_average.csv}; \end{axis} \end{tikzpicture} \caption{Experiment 2 (Model B): Test Accuracy vs. Trial Number} \label{fig:exp2_75_average} \end{figure} \begin{figure}[H] \centering \begin{tikzpicture} \begin{axis} [ width=0.5*\linewidth, grid=major, grid style={dashed, gray!30}, xlabel=Trial Number, ylabel=Test Accuracy, xmin=0, xmax=75, ymin=0, ymax=1, mark=o ] \addplot [only marks] table[x=trial, y=valacc, col sep=comma] {exp2_75_average_bad.csv}; \end{axis} \end{tikzpicture} \caption{Experiment 2 (Model C): Test Accuracy vs. Trial Number} \label{fig:exp2_75_average_bad} \end{figure} \begin{figure}[H] \centering \begin{tikzpicture} \begin{axis} [ width=0.5*\linewidth, grid=major, grid style={dashed, gray!30}, xlabel=Trial Number, ylabel=Test Accuracy, xmin=0, xmax=75, ymin=0, ymax=1, mark=o ] \addplot [only marks] table[x=trial, y=valacc, col sep=comma] {exp2_75_worst.csv}; \end{axis} \end{tikzpicture} \caption{Experiment 2 (Model D): Test Accuracy vs. Trial Number} \label{fig:exp2_75_worst} \end{figure} \subsection{Experiment 3} I optimized Learning Rate and Decay for 150 trials. I used the following null prior distribution and the other hyperparameters were as found in Figure~\ref{fig:hyp_exp2_onward}. \begin{centering} \begin{minted}{python3} H = { 'lr': hyperopt.hp.uniform('lr', 0.03, 0.05), 'decay': hyperopt.hp.uniform('decay', 0.0001, 1) } \end{minted} \end{centering} The results can be seen in Figure~\ref{fig:exp3overtime}, \ref{fig:exp3_150_trials}, and~\ref{fig:exp3_150_trials_zoomed}. As with Experiments 1 and 2, the test loss did not decrease over time. The probability of choosing a good configuration seems to be the same over time. \begin{figure}[H] \centering \begin{tikzpicture} \begin{axis} [ width=0.5*\linewidth, grid=major, grid style={dashed, gray!30}, xlabel=Trial Number, ylabel=Test Accuracy, xmin=0, xmax=151, ymin=0, ymax=1, mark=o ] \addplot [only marks] table[x=trial, y=valacc, col sep=comma] {exp4_150.csv}; \end{axis} \end{tikzpicture} \caption{Experiment 3: Test Accuracy vs. Trial Number} \label{fig:exp3overtime} \end{figure} \section{Discussion} The experimental part of this project gave me the time I needed to develop a feel for how hyperparameter selection affects performance. I see now that while hyperparameter optimization can make all the difference in classification performance there is not a single best hyperparameter configuration. Rather, there is a distribution of good hyperparameter configurations. For example, the classification performance achieved in Experiment 1 was only slightly better than the out of the box configuration, but the hyperparameter values were quite a bit different from the default values. % The results depicted in Figure~\ref{fig:exp3_150_trials} could be % misinterpreted. It looks like hyperopt is choosing better and better % hyperparameter configurations over time, but as we saw in % Figure~\ref{fig:exp1overtime}, \ref{fig:exp2_75_best}, % \ref{fig:exp2_75_average}, \ref{fig:exp2_75_average_bad}, % \ref{fig:exp2_75_worst}, and~\ref{fig:exp3overtime}, hyperopt does not choose % better configurations over time. I would rather use a hyperparameter optimizing % algorithm that converges to an optimal set of hyperparameters. Figures~\ref{fig:exp3_150_trials}, and~\ref{fig:exp3_150_trials_zoomed} probably do support the claims made by Bergstra and Bengio that only a relatively small fraction of all possible hyperparameter configurations lead to high performing models. I would like to spend more time with Tree of Parzen Estimators and other hyper-parameter optimization algorithms during my master's degree. I will implement my own version of a neural network with back-propagation, gradient descent, and a reinforcement learning algorithm to tune the hyper-parameters. I will write about it in a blog dedicated to documenting the process.