# Hyperparameter Optimization and Experimentation for Convolutional Neural Nets

This research project was inspired by my own experience building a
deep learning image classifier in the winter semester of 2017. My
goals for this project were to study hyperparameter optimization of
convolutional neural networks in the context of digit classification
using the MNIST dataset. I will begin with an introduction to
artificial neural networks and hyperparameter optimization, the final
section has been dedicated to experiments and their results.
\subsection{Artificial Neural Networks}
Artificial neural networks form a class of functions which we use in machine
learning for tasks such as image classification and machine translation. This
project considers only the supervised learning scenario, wherein a neural
network is trained on a set $\mathbf{X}_{\text{train}} = \{(\mathbf{x},y) \mid
\mathbf{x} \in \mathbb{R}^k,\ y \in \mathbb{R}\}$ of labelled training data and
we test the neural network's classification accuracy using a distinct test
dataset $\mathbf{X}_{\text{test}}$ of the same form as
$\mathbf{X}_{\text{train}}$, but containing fewer entries.
The process of training produces a neural network, which is a function $\hat{y}
= f_{\Theta}(\mathbf{x})$, parameterized by a set $\Theta = \{(\mathbf{\omega}, b) \mid
\mathbf{\omega} \in \mathbb{R}^k,\ b \in \mathbb{R} \}$ of weights and biases. The
trained neural network $f_{\Theta}$ maps $\mathbf{x} \in \mathbb{R}^k$ to a
prediction $\hat{y}$, and we can test its performance by evaluating the
differences between $\hat{y}$ and the $y \in \mathbf{X}_{\text{test}}$.
Before we get to specific details about neural networks, it is elucidating to
discuss one of the oldest classification algorithms, \textit{linear regression}.
Like a neural network, a linear regression model is a function which maps an
$\mathbf{x} \in \mathbb{R}^k$ to a prediction $\hat{y}$. The model is
parameterized by a set of learnable weights $\mathbf{w} \in \mathbb{R}^k$ and an
optional bias term $b \in \mathbb{R}$. Once we have trained the model, usually
by minimizing the mean squared error, we can use it for predicting $y$ given any
$\mathbf{x} \in \mathbf{X}$. The formula for linear regression is as follows
\begin{equation}
\hat{y} = \sum_{k=1}^m x_kw_k + b.
\label{fig:lin_reg}
\end{equation}
The weight vector $\mathbf{w}$ of a linear regression model can be thought of as
an expression for how important each entry of $\mathbf{x}$ is for a specific
output $y$. In a sense, when we train a classifier it is like we are trying to
figure out a recipe (i.e. the correct proportions of ingredients given a list of
ingredients and examples of the desired final product) through trial and error.
At the end of training, the best set of proportions is stored in $\mathbf{w}$,
which tells us in what proportion to combine ingredients so that we obtain a
reasonably good approximation of the desired result, as in (\ref{fig:lin_reg}).
\begin{figure}[H]
\centering
\begin{tikzpicture}
[
plain/.style={
draw=none, fill=none
},
net/.style={
matrix of nodes,
nodes={
draw,
circle,
inner sep = 6pt
},
nodes in empty cells,
column sep = 1cm,
row sep = -6pt
},
>=latex
]
\matrix[net] (mat) {
|[plain]| & |[plain]| & |[plain]| \\
$h_1$ & |[plain]| \\
|[plain]| & $g_1$ \\
$h_2$ & |[plain]| \\
|[plain]| & |[plain]| \\
$h_3$ & $g_2$ & $f$ \\
|[plain]| & |[plain]| \\
$h_4$ & |[plain]| \\
|[plain]| & $g_3$ \\
$h_5$ & |[plain]| \\
};
\foreach \ai [ count=\mi ] in {2,4,...,10} \draw[<-] (mat-\ai-1) -- node[above] {$x_\mi$} +(-2cm,0);
\foreach \ai in {2,4,...,10} { \foreach \aii in {3,6,9} \draw[->] (mat-\ai-1) -- (mat-\aii-2); }
\foreach \ai in {3,6,9} \draw[->] (mat-\ai-2) -- (mat-6-3);
\draw[->] (mat-6-3) -- node[above] {$y$} +(2cm,0);
\end{tikzpicture}
\caption{A Simple Deep Neural Network with a Single Hidden Layer}
\label{fig:simple_deep_nn}
\end{figure}
In the following, we will see how linear regression relates to neural networks,
which I will define now. An \textit{artificial neural network} is a function
defined by a directed graph (the computations flow in a specified direction),
whose connectivity implements function composition. A weighted summations occurs
at each node of the graph, which are themselves wrapped in activation functions.
The weighted summations of inputs inside each node are not unlike a linear
regression. So in a sense, an artificial neural network describes a way of
composing linear regression models. A \textit{deep neural network} is a neural
network with at least one hidden layer as depicted in
Figure~\ref{fig:simple_deep_nn}.
\begin{figure}[H]
\centering
\begin{tikzpicture}
[
neuron/.style={
draw, circle, inner sep=1pt, join = by -latex
},
activation/.style={
draw, inner sep=2pt, join = by -latex
},
start chain=3,
node distance = 6mm
]
\begin{scope}[start chain=1]
\node[on chain=1] at (-2,2cm) (x1) {$x_1$};
\node[on chain=1,join=by -latex] (w1) {$w_1$};
\end{scope}
\begin{scope}[start chain=2]
\node[on chain=2] at (-2,1cm) (x2) {$x_2$};
\node[on chain=2,join=by -latex] (w2) {$w_2$};
\end{scope}
\node[on chain=3] at (-2, 0cm) (x3) {$\ \vdots\ $};
\node[on chain=3] (w3) {$\ \vdots\ \ $};
\node[on chain=3, neuron] at (1, 0) (sigma) {$\displaystyle \sum_{k=1}^m w_kx_k + b$};
\begin{scope}[start chain=4]
\node[on chain=4] at (-2,-1cm) (x4) {$x_4$};
\node[on chain=4,join=by -latex] (w4) {$w_4$};
\end{scope}
\begin{scope}[start chain=5]
\node[on chain=5] at (-2,-2cm) (xm) {$x_m$};
\node[on chain=5,join=by -latex] (wm) {$w_m$};
\end{scope}
\begin{scope}[start chain=6]
\node[on chain=6, join=by -latex] at (4,0cm) (activation) {};
\node[on chain=6, join=by -latex] () { Activation } ;
\node[on chain=6, join=by -latex] (activation) {$ \cdots $};
\end{scope}
\draw[-latex] (w1) -- (sigma);
\draw[-latex] (w2) -- (sigma);
\draw[-latex] (w3) -- (sigma);
\draw[-latex] (w4) -- (sigma);
\draw[-latex] (wm) -- (sigma);
\end{tikzpicture}
\caption{A Neuron}
\label{fig:node_of_nn}
\end{figure}
We can express the neural network in Figure~\ref{fig:simple_deep_nn} as a
composition of functions.
\begin{align}
\label{eq:compition}
y & = f(g_1(h_1, h_2, \dots, h_5),\ g_2(h_1, h_2, \dots, h_5),\ g_3(h_1, h_2, \dots, h_5)) \\
& = f\left(\sum_{j=1}^5\omega_{1,j}h_{j} + c_1,\ \sum_{k=1}^5\omega_{2,k}h_{k} + c_2,\ \sum_{l=1}^5\omega_{3,l}h_{l} + c_3\right) \\
& = \sum_{i=1}^3\omega_i\left(\sum_{j=1}^5\omega_{i,j}h_{j} + c_i \right) + b \\
& = \sum_{i=1}^3\sum_{j=1}^5\left[\omega_i\omega_{i,j}h_{j} + \omega_ic_i \right] + b
\end{align}
\subsection{Digit Classification}
Digit classification using the MNIST dataset of handwritten digits is one of
oldest image classification benchmarks. The goal of the task is to correctly
classify an image of a number. Since handwritten digits are variable, any
classifier will need to see many examples of each digit in order to learn what
features are essential to each digit class. The more variable the concept, the
more data we usually need in order to learn its defining properties.
When using a classification algorithm other than a neural network, people
usually have to hand-engineer kernels or ``feature maps''. This requires domain
experts, which can be costly and time consuming. Deep neural networks, on the
other hand, learn kernel maps from the data when we train them. This has the
added advantage of not introducing additional bias, as hand-engineered feature
maps may do.
% It's like a kernel represents a class and we need to learn how to describe the
% data using objects from these classes, and objects from these classes are fit to
% the data using weights.
Neural network methods are currently the leaders in digit classification, but
support vector machines (SVMs) are also pretty good. According to Wikipedia, the
current state-of-the-art in digit classification accuracy is 99.79\% with a
convolutional neural network and for reference, the best SVM method achieves
accuracy of 99.44\%. The MNIST dataset contains 60,000 training examples (6,000
for each number between 0 and 9) and 10,000 test examples (1,000 for each
number). The reason for having separate training and testing datasets will be
made clear later on.
\begin{figure}[H]
\centering \includegraphics[width=0.4\linewidth]{mnist}
\caption{We need many examples to learn the general form of a number}
\label{fig:mnist}
\end{figure}
Some traditional methods for classification start by flattening the $28 \times
28$ pixel images into 784 dimensional vectors, since the method may require the
input to be a vector; this enables a weighted sum to be computed. The problem
with this is that we lose contextual information, we no longer know the
coordinates of each pixel and we don't know much about the neighboring pixel
values. Contextual information is important for identifying shapes, which gives
a bit more meaning to pixel values, e.g. a dark pixel adjacent to light pixels
has a different meaning to us than a dark pixel surrounded by other dark pixels.
The way we retain contextual information about the pixel values is by using a
convolutional kernel, which operates on the original (unflattened) $28 \times
28$ pixel image.
\begin{figure}[H]
\centering \includegraphics[width=0.2\linewidth]{mnist_single}
\caption{A single digit}
\label{fig:mnist_single}
\end{figure}
% Convolutions are at the heart of many image transformations, for
% instance the GNU Image Manipulation Program has a generic convolution filter
% where the user can enter desired values for a convolutional kernel to be applied
% to the image in order to achieve edge detection, smoothing, embossing, etc...
In our context, the convolution operation is a way of adding neighboring pixel
values (the number determined by the dimensions of the kernel) together,
weighted by values stored in the convolution kernel. For example, given a matrix
$M$ of pixel values and a kernel $K$ of weights, we can depict the discrete
convolution operation as follows.
\begin{align}
\label{eq:convolution_1}
M & =
\begin{bmatrix}
m_{0,0} & m_{0,1} & \dots & m_{0,n} \\
m_{1,0} & m_{1,1} & \dots & m_{1,n} \\
\vdots & \vdots & \ddots & \vdots \\
m_{m,0} & m_{m,1} & \dots & m_{m,n}
\end{bmatrix}
\end{align}
\begin{align}
\label{eq:convolution_2}
K & =
\begin{bmatrix}
k_{0,0} & k_{0,1} \\
k_{1,0} & k_{1,1}
\end{bmatrix}
\end{align}
\begin{align}
\label{eq:convolution_3}
\left( M * K \right) [0,0]& = m_{0,0} \times k_{0,0} + m_{0,1} \times k_{0,1} + m_{1,0} \times k_{1,0} + m_{1,1} \times k_{1,1},
\end{align}
In a convolutional neural network, each neuron computes the activation function
of a convolution of a small patch of pixels with weights stored in a kernel. In
(\ref{eq:convolution_3}), we convolved the matrix $M$ with the kernel $k$ at
position $(0,0)$ of $M$.
One by one, each neuron of a given layer computes a weighted sum (more
precisely, a convolution) for each coordinate $(x,y)$ of $M$. All neurons in the
same layer use the same kernel, which means the neurons of each layer share the
same weights. Weight sharing makes the neural network's ability to detect
features invariant under translation (i.e. since the kernel passes over the
entire input space it is able to identify features anywhere in the input space).
If we use padding, we can control the size of the output matrix, we can even
obtain a matrix of the same dimension as the original as illustrated in
Figure~\ref{fig:conv_padding}. If we don't use padding we will produce a new
matrix of smaller dimension. In either case, the output matrix's entries are
weighted sums of the entries of $M$.
Design choices with respect to hyperparameters include the dimensions of the
kernel $K$, the size of the stride (the number of coordinates to skip over each
time we convolve the kernel with the image), and whether or not to use padding.
Kernel size and stride also affect the dimension of and the number of channels
returned by the convolution layer.
\begin{figure}[H]
\centering
\begin{tikzpicture}
\node [fill=green!25] at (-0.75,+0.75) {$a$}; \node
[fill=green!25!yellow!25] at (-0.25,+0.75) {$b$}; \node
[fill=green!25!blue!25] at (-0.75,+0.25) {$e$}; \node
[fill=green!75!yellow!75!blue!75!red!75] at (-0.25,+0.25) {$f$};
\node [fill=green!25!yellow!25] at (+0.25,+0.75) {$c$}; \node
[fill=yellow!25] at (+0.75,+0.75) {$d$}; \node
[fill=green!75!yellow!75!blue!75!red!75] at (+0.25,+0.25) {$g$}; \node
[fill=red!25!yellow!25] at (+0.75,+0.25) {$h$};
\node [fill=green!25!blue!25] at (-0.75,-0.25) {$i$}; \node
[fill=green!75!yellow!75!blue!75!red!75] at (-0.25,-0.25) {$j$}; \node
[fill=blue!25] at (-0.75,-0.75) {$l$}; \node [fill=red!25!blue!25] at
(-0.25,-0.75) {$m$};
\node [fill=green!75!yellow!75!blue!75!red!75] at (+0.25,-0.25) {$k$}; \node
[fill=red!25!yellow!25] at (+0.75,-0.25) {$l$}; \node [fill=red!25!blue!25]
at (+0.25,-0.75) {$n$}; \node [fill=red!25] at (+0.75,-0.75) {$o$};
\draw [->] (+1.25,0) -- node[above] {3x3} (+3.5,0); \draw [->] (+1.25,0) --
node[below] {Kernel} (+3.5,0);
\node at (+4,+0.25) [fill=green!25] {$a'$}; \node at (+4.5,+0.25)
[fill=yellow!25] {$b'$};
\node at (+4,-0.25) [fill=blue!25] {$c'$}; \node at (+4.5,-0.25)
[fill=red!25] {$d'$};
\end{tikzpicture} \\ \bigskip
\begin{tikzpicture}
% zeros left
\node at (-1.25,+1.25) [fill=gray!10] {0}; \node at (-1.25,+0.75)
[fill=gray!10] {0}; \node at (-1.25,+0.25) [fill=gray!10] {0}; \node at
(-1.25,-0.25) [fill=gray!10] {0}; \node at (-1.25,-0.75) [fill=gray!10] {0};
\node at (-1.25,-1.25) [fill=gray!10] {0};
% zeros right
\node at (+1.25,+1.25) [fill=gray!10] {0}; \node at (+1.25,+0.75)
[fill=gray!10] {0}; \node at (+1.25,+0.25) [fill=gray!10] {0}; \node at
(+1.25,-0.25) [fill=gray!10] {0}; \node at (+1.25,-0.75) [fill=gray!10] {0};
\node at (+1.25,-1.25) [fill=gray!10] {0};
% zeros bottom
\node at (-0.75,-1.25) [fill=gray!10] {0}; \node at (-0.25,-1.25)
[fill=gray!10] {0}; \node at (+0.25,-1.25) [fill=gray!10] {0}; \node at
(+0.75,-1.25) [fill=gray!10] {0};
% zeros top
\node at (-0.75,+1.25) [fill=gray!10] {0}; \node at (-0.25,+1.25)
[fill=gray!10] {0}; \node at (+0.25,+1.25) [fill=gray!10] {0}; \node at
(+0.75,+1.25) [fill=gray!10] {0};
\draw [->] (+1.6,0) -- node[above] {3x3} (+3.5,0); \draw [->] (+1.6,0) --
node[below] {Kernel} (+3.5,0);
\node at (-0.75,+0.75) {$a$};
\node at (-0.25,+0.75) {$b$};
\node at (-0.75,+0.25) {$e$};
\node at (-0.25,+0.25) {$f$};
\node at (+0.25,+0.75) {$c$};
\node at (+0.75,+0.75) {$d$};
\node at (+0.25,+0.25) {$g$};
\node at (+0.75,+0.25) {$h$};
\node at (-0.75,-0.25) {$i$};
\node at (-0.25,-0.25) {$j$};
\node at (-0.75,-0.75) {$l$};
\node at (-0.25,-0.75) {$m$};
\node at (+0.25,-0.25) {$k$};
\node at (+0.75,-0.25) {$l$};
\node at (+0.25,-0.75) {$n$};
\node at (+0.75,-0.75) {$o$};
\begin{scope}[xshift=4.6cm]
\node at (-0.75,+0.75) {$a'$}; \node at (-0.25,+0.75) {$b'$}; \node at
(-0.75,+0.25) {$e'$}; \node at (-0.25,+0.25) {$f'$};
\node at (+0.25,+0.75) {$c'$}; \node at (+0.75,+0.75) {$d'$}; \node at
(+0.25,+0.25) {$g'$}; \node at (+0.75,+0.25) {$h'$};
\node at (-0.75,-0.25) {$i'$}; \node at (-0.25,-0.25) {$j'$}; \node at
(-0.75,-0.75) {$l'$}; \node at (-0.25,-0.75) {$m'$};
\node at (+0.25,-0.25) {$k'$}; \node at (+0.75,-0.25) {$l'$}; \node at
(+0.25,-0.75) {$n'$}; \node at (+0.75,-0.75) {$o'$};
\end{scope}
\end{tikzpicture}
\caption{Convolution (with stride 1) with and without padding of size 1}
\label{fig:conv_padding}
\end{figure}
We have a set of weights, one for each convolutional layer and through training,
some sets of weights become more sensitive to specific features. A layer and its
kernel may become experts at identifying lines, curves etc.
Pooling is another common operation performed in convolutional neural nets. It
is similar to convolution in the sense that both are methods of sub-sampling.
Max pooling in particular is used to pick the max out of a region of some
specified area, as seen in Figure~\ref{fig:maxpooling}. Pooling further
decreases the number of degrees of freedom which helps prevent over-fitting.
Most modern convolutional neural networks make use of the convolution stride to
achieve a similar sub-sampling effect.
\begin{figure}[H]
\centering
\begin{tikzpicture}
\node [fill=green!25] at (-0.75,+0.75) {1}; \node [fill=green!25] at
(-0.25,+0.75) {2}; \node [fill=green!25] at (-0.75,+0.25) {3}; \node
[fill=green!25] at (-0.25,+0.25) {4};
\node [fill=yellow!25] at (+0.25,+0.75) {5}; \node [fill=yellow!25] at
(+0.75,+0.75) {6}; \node [fill=yellow!25] at (+0.25,+0.25) {7}; \node
[fill=yellow!25] at (+0.75,+0.25) {8};
\node [fill=blue!25] at (-0.75,-0.25) {9}; \node [fill=blue!25] at
(-0.25,-0.25) {9}; \node [fill=blue!25] at (-0.75,-0.75) {0}; \node
[fill=blue!25] at (-0.25,-0.75) {9};
\node [fill=red!25] at (+0.25,-0.25) {3}; \node [fill=red!25] at
(+0.75,-0.25) {4}; \node [fill=red!25] at (+0.25,-0.75) {5}; \node
[fill=red!25] at (+0.75,-0.75) {6};
\draw [->] (+1.25,0) -- node[above] {2x2} (+3.5,0); \draw [->] (+1.25,0) --
node[below] {Max Pooling} (+3.5,0);
\node at (+4,+0.25) [fill=green!25] {4}; \node at (+4.5,+0.25)
[fill=yellow!25] {8};
\node at (+4,-0.25) [fill=blue!25] {9}; \node at (+4.5,-0.25) [fill=red!25]
{6};
\end{tikzpicture} \\ \bigskip
\caption{Max Pooling}
\label{fig:maxpooling}
\end{figure}
Because the output of layer $k$ is the input of layer $k+1$, it is like the
lower layers learn simple, low level concepts and the higher layers learn higher
order concepts, all of which are weighted sums of weighted sums of some input
data.
The way I implemented classification is as follows. I first encoded the labels
of the MNIST dataset in one-hot encoding. The result is each image in the
training and test sets are accompanied by a vector of length 10. Each of the 10
indices corresponds to a unique number. If we let the zeroth entry correspond
with the number zero and the first entry correspond with the number 1, we obtain
the following encoding:
\begin{equation}
\tiny{
\label{eq:onehot}
0 =
\begin{bmatrix}
1 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0
\end{bmatrix}, \quad 1 =
\begin{bmatrix}
0 \\ 1 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0
\end{bmatrix}, \quad 2 =
\begin{bmatrix}
0 \\ 0 \\ 1 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0
\end{bmatrix}, \quad \cdots \quad 9 =
\begin{bmatrix}
0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 1
\end{bmatrix}}.
\end{equation}
The final layer has only ten neurons, one for each of the classes of concepts we
wanted to learn; in our case they correspond with the digits 0 to 9. We can
think of the final layer as any other layer and each neuron will respond to
input from the layers before it. For example, the neuron corresponding to the
number 9 will output a number close to 1 when the input to the neuron is of the
right combination of concepts from the layers before it.
Once the neural net has returned its predicted label, we will want to compare it
with the true label from our test dataset. We devise a notion of distance to
measure how far away our prediction was from what it was supposed to be. We call
this the loss function and we want to minimize this through iterative learning.
Each iteration is called an epoch, and in our case, since we are working with a
mini batch method, the epochs comprises many batch operations. Batch size is one
of the hyperparameters.
Once we have obtained a loss, we use the back-propagation algorithm to send
information from the loss backward through the network in order to compute the
gradient, which is the derivative of the loss with respect to the weights of the
graph. Since we can think of a neural network as a composite of many functions,
when we compute the gradient, we compute it for a composition of functions of
multiple variables. We can use the chain rule from calculus to derive the
gradients for the nodes in each layer and once we have those, we can perform
\textit{gradient descent}, which pushes the weights down the error curve at a
velocity set by the a \textit{learning rate}, usually denoted by $\epsilon$, see
(\ref{eq:grad_update}) below. The learning rate is a number, typically in $(0,1)
\subset \mathbb{R}$.
\begin{equation}
\label{eq:grad_update}
\mathbf{\omega'} = \mathbf{\omega} - \epsilon \nabla_{\mathbf{\omega}}f(\mathbf{\omega})
\end{equation}
If we choose $\epsilon$ too large, then gradient descent may not converge and if
we choose $\epsilon$ too small, then gradient descent may converge too slowly or
not at all.
\subsubsection{Regularization}
In a sense, training a neural net involves increasing the number of degrees of
freedom to maximize representation capacity. At the same time, we need to combat
the tendency to over fit the training data through regularization, which
effectively reduces the number of degrees freedom. Convolution and pooling
greatly decrease the number of degrees of freedom while retaining some of the
information about the pixels.
Regularization techniques help the graph learn the correct probability
distribution for the data by forcing it to forget less common observations. In
other words, when the dataset contains many instances of a feature, then we know
a lot about it, and we can be more confident in the parameters learned from
observing those features. But if we have a feature that rarely occurs in the
dataset, then regularization will prevent the weights learned from that feature
from influencing the classifier too strongly. Dropout decreases the number of
degrees of freedom by simply forgetting a subset of the neurons' outputs.
Concepts formed from features commonly observed by the network will be resistant
to dropout.
\subsection{Hyperparameter Optimization}
There are many optimization algorithms for training a neural net, including
various forms of gradient descent, but the focus of this research project is on
determining the optimal values for the hyperparameters which define the graph's
structure and control behavior. For the purposes of this research project, I
have limited my scope to found in Figure~\ref{fig:hyperparams}.
\begin{figure}[H]
\centering
\scriptsize
\begin{tabular}{l|l}
Symbol & Description \\
\hline \\
$K_i \in \mathbb{N}$ & Dimension of the convolution kernel for each convolutional layer. \\
$S_i \in \mathbb{N}$ & Stride for the convolution kernel for each convolutional layer. \\
$BS \in \mathbb{N}$ & Training batch size. \\
$E \in \mathbb{N}$ & Number of training epochs. \\
$DP_i \in [0,1] \subset \mathbb{R}$ & The dropout probability for each dropout layer \\
$PS \in \mathbb{N}$ & The dimensions of pooling for the pooling layer \\
$BI \in \text{Keras Initializers}$ & The method for initializing the biases of each layer \\
$WI \in \text{Keras Initializers}$ & The method for initializing the weights of each layer \\
$TP \in [0,1] \subset \mathbb{R}$ & penalty for training time \\
$LR \in (0,1) \subset \mathbb{R}$ & The learning rate. \\
\end{tabular}
\caption{Hyperparameters}
\label{fig:hyperparams}
\end{figure}
An ad hoc approach to optimizing hyperparameters is to perform multiple
experiments and learn from trial and error. The practitioner may gain insight
and intuition which may be helpful for the learning task at hand. The problem
with this approach is in the ability to transfer this insight to another
practitioner.
Methodical approaches include grid search (this quickly becomes computationally
infeasible if we want to optimize more than 2 or 3 hyperparameters) and random
sampling from a predefined hyperparameter search space as in \cite{bb_2012}.
Some background knowledge is helpful when defining the search space to improve
efficiency. The experiments I performed used \textit{``hyperopt''} which is a
hyperparameter optimization software package authored primarily by James
Bergstra and which is the focus of \cite{bergstra_making_2012}.
\subsubsection{Random Sampling}
Bergstra and Bengio explore random sampling from a hyperparameter search space
in \cite{bb_2012}, denoted $\Lambda$, as an alternative to grid search. We start
with a learning algorithm $\mathcal{A_{\lambda^*}}$ which maps a given dataset
$\mathbf{X}^{\text{train}}$ to a function $f$ where $\lambda^* \in \Lambda$ and
$\Lambda$ is the hyperparameter search space for $\mathcal{A}$. I have written
$\lambda^{*}$ because that is what we hope our hyperparameter optimization
algorithm to find, where $\lambda^{*}$ is
\begin{equation}
\label{eq:argmin_of_expectation}
\lambda^{*} = \text{argmin}_{\lambda \in \Lambda} \mathbb{E}_{x \sim \mathcal{G}_x}\left[
\mathcal{L}\left( x, \mathcal{A}_{\lambda}\left( \mathbf{X}^{\text{train}}
\right) \right) \right] ; x_k \in \mathbf{X}^{\text{test}}.
\end{equation}
In writing, we have a learning algorithm whose behaviour is partially
determined by a set of hyperparameters. We hope to have to optimal set of
hyperparameters, by optimal, we mean the set of hyperparameters which
minimizes the expected loss on the test dataset.
Since we never know the true distribution from which our data
originate, $\mathcal{G}_x$, we must accept our limitations and instead we seek
\begin{equation}
\label{eq:argmin_of_expectation} \lambda^{\dagger} \approx
\text{argmin}_{\lambda \in \Lambda} {1 \over m} \sum_{k=1}^m
\mathcal{L}\left( x_k, \mathcal{A}_{\lambda}\left(
\mathbf{X}^{\text{train}} \right) \right); x_k \in \mathbf{X}^{\text{test}},
\end{equation}
where $m$ is the number of examples in the test set. We need a way to
find $\lambda^\dagger$.
Bergstra and Bengio write about what they call the \textit{hyper-parameter
response function}, $\Psi$ is the empirical loss obtained over the test
dataset $\mathbf{X}^{\text{test}}$ from the learning algorithm
$\mathcal{A}_\lambda$ trained on the training dataset
$\mathbf{X}^{\text{train}}$ and hyperparameterized by $\lambda \in \Lambda$.
\begin{equation}
\label{eq:hyper_parameter_loss_function}
\Psi\left( \lambda \right) = {1 \over m} \sum_{k=1}^m
\mathcal{L}\left( x_k, \mathcal{A}_{\lambda}\left( \mathbf{X}^{\text{train}} \right) \right)
; x_k \in \mathbf{X}^{\text{test}}.
\end{equation}
The point of defining a hyperparameter response function is to see how the
empirical loss changes as the response function varies over hyperparameter
configurations. In the case where the dataset is fixed, the learning algorithm
$\mathcal{A}$ and the hyperparameters $\lambda$ are still free to change, we can
optimize the algorithm (inner optimization) and we can optimize the
hyperparameters (outer optimization). The task of hyperparameter selection then
becomes the task of minimizing the response function, i.e. we desire
\begin{equation}
\label{eq:out_optimization_task}
\lambda^\dagger = \text{argmin}_{\lambda \in \Lambda} \Psi(\lambda).
\end{equation}
If we search through all of $\Lambda$, this would be equivalent to performing a
grid search. In reality $\Psi$ and $\Lambda$ depend on the dataset, goals, and
learning algorithm. Which brings us to the main hypothesis of Bergstra and
Bengio's paper, that the response function is more sensitive to changes in some
hyperparameters than others, thus grid search wastes time exploring less
sensitive hyperparameter values.
The reason for this non-uniformly distributed sensitivity, as they claim, is
that $\Psi$ has a low effective dimensionality. In a most basic sense, we can
say that a function $f(x,y)$ has a low effective dimensionality if it can be
approximated with sufficient accuracy by another function of lower dimension
$g(x)$, i.e. $f(x, y) \approx g(x)$.
One way to characterize the shape of a high-dimensional function is to look at
how much it varies in each dimension. Bergstra and Bengio used Gaussian process
regression to gather statistics on the response function and then measured its
effective dimensionality.
From the experimental data collected from the Gaussian process regression, they
were able to measure the relevance of each hyperparameter to the performance of
$\Psi$. Their conclusion was that only a small fraction of the hyperparameters
were influential to $\Psi$ for a given dataset.
% bergstra and bengio also claim that grid search wastes too much time
% exploring the irrelevant dimensions and they claim that random search is more
% efficient than grid search in higher dimensions because it samples more
% effectively from the sensitive dimensions.
\section{Software}
Using any number of deep learning software libraries can greatly reduce the
programming time to build a state-of-the-art classifier, but given the task at
hand and the dataset, the programmer will find there are still many design
decisions that need to be made with respect to the model's structure. Once a
model has been decided upon, the values of the hyperparameters need to be
optimized and may need to be reoptimized if the model is to be reused for a
different task with a different datast. To quote Bergstra, ``It is sometimes
difficult to know whether a given [model] is genuinely better, or simply better
tuned''.
\subsection{Keras}
I chose to use Keras as my programming interface to the Tensorflow engine. It
worked better on my machine than Tensorflow on its own.
\subsection{Tensorflow}
Tensorflow is an open source software library authored by members of the Google
Brain Team. It is commonly used in industry and research, which contributed to
my choice to work with it. There are other good libraries such as Caffe,
Microsoft Cognitive Toolkit, MXNet, Theano, Torch, etc... Software libraries
like Tensorflow are used to perform numerical computations on directed graphs.
Within this framework nodes represent mathematical operations and edges
represent multi-dimensional arrays of arguments to be passed to the nodes.
\subsection{Hyperopt}
For my experiments I used a hyperparameter optimization packaged called
``Hyperopt''. Hyperopt employs a novel \textit{Null Distribution Specification
Language}, which is used to define the hyperparameter search space. The
language is made up of an arbitrary number of dictionary entries, which may be
arbitrarily nested. The user of this package must know a priori useful intervals
and distributions for the hyperparameters, so while hyperopt does take some of
the guess work out of hyperparameter selection and optimization, some knowledge
is still required.
The Null Distribution Specification Language describes the distributions that
would be used should we perform a simple random unoptimized search of the
configuration space. A \textit{null prior distribution} is a expression written
in this specification language
Hyperopt replaces the nodes of the null prior distribution with Guassian mixture
models derived from the experimental history. The Gaussian mixture models'
means, and variances are refined after each experimental trial. The experimental
history is stored by hyperopt in a database which tracks the hyperparameter
configuration and the loss obtained at the end of the experiment. From this
history it can learn which hyperparameters the learning algorithm is more
sensitive too.
\subsection{Tree of Parzen Estimators}
Hyperopt provides the user with two algorithms for sampling from the null prior
distribution, Random Search and Tree of Parzen Estimators (TPE). I used the
latter algorithm for my experiments. TPE sequentially constructs tree-like
models from the null prior distribution (which can have a nested, tree-like
structure) to compute the expected performance of a set of hyperparameters based
on historical measurements, and then subsequently chooses new hyper-parameters
to test based on this model.
These tree like models are made by replacing the entries in the null prior
distribution with ratios of Gaussian Mixture Models (GMM). A GMM is a
probabilistic model, based on the assumption that each datum is generated by a
mixture of a finite number of Gaussian distributions with unknown parameters.
The Scikit Learn documentation likens GMMs to a more generalized $k$-means
clustering algorithm (which is a method for partitioning $n$ data points into
$k$ clusters and each data point is assigned to the cluster with the mean
nearest to the data point's value).
% TPE is an iterative algorithm which fits one GMM $f(x)$ to the configuration of
% hyperparameters which previously minimized the loss function on the training
% set. Another GMM $g(x)$ is fit to the remaining hyperparameter values. TPE then
% chooses the hyperparameter value that maximizes $f(x)/g(x)$.
% Parzen windows classification is a technique for nonparametric density
% estimation, which can also be used for classification. Given a kernel, the
% technique approximates a given training set distribution via a linear
% combination of kernels.
% http://optunity.readthedocs.io/en/latest/user/solvers/TPE.html
% https://en.wikipedia.org/wiki/K-means_clustering
% http://scikit-learn.org/stable/modules/mixture.html
% https://compbio.soe.ucsc.edu/genex/genexTR2html/node11.html
% http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html
\section{Experiments}
The MNIST dataset does not require a lot of computing power to train classifiers
on. I took advantage of this and ran multiple experiments. I used Keras as an
interface to Tensorflow to build a basic deep neural network. There is nothing
novel about this architecture and using the default hyper parameter settings
from Keras' github page, the model achieves a test accuracy of 99.25\%
after 12 epochs. The purpose of these experiments wasn't to achieve an extremely
accurate classifier, rather, I was interested in exploring the surfaces of
hyperparameter configurations over which we can train the model and achieve good
results.
\subsection{Experiment 1}
In my first experiment, I trained my networks using Adadelta, which is an
optimizer equipped with its own optimization procedure for finding an ideal
learning rate. I chose Adadelta because I wanted to optimize the learning rate
(and related hyperparameters) after I had optimized all other hyperparameters
found in Figure~\ref{fig:hyperparams}. As an alternative to using Adadelta, I
could have chosen a fixed learning rate and while optimizing the other
hyperparameters. I realize this may have been a better choice.
For my first experiment, I used an initial search space $\mathcal{H}$ defined by
the following python dictionary.
\begin{figure}[H]
\centering
\begin{minted}{python}
H = { 'keep_prob_1': hp.uniform('keep_prob_1', 0, 1),
'keep_prob_2': hp.uniform('keep_prob_2', 0, 1),
'batch_size': hp.choice('batch_size', list(range(20, 81))),
'epochs': hp.choice('epochs', list(range(6, 21))),
'time_penalty': hp.uniform('time_penalty', 0, 0.5),
'pool_size': hp.choice('pool_size', [2, 3]),
'kernel_size_1': hp.choice('kernel_size_1', [1, 2, 3, 4, 5]),
'kernel_size_2': hp.choice('kernel_size_2', [1, 2, 3, 4, 5]),
'strides_1': hp.choice('strides_1', [1, 2, 3]),
'strides_2': hp.choice('strides_2', [1, 2, 3]),
'bias_initializer': hp.choice('bias_initializer',
[ 'zeros',
'ones',
'lecun_uniform',
'lecun_normal',
'glorot_normal',
'he_normal',
'he_uniform',
'glorot_uniform' ]),
'weight_initializer': hp.choice('weight_initializer',
[ 'zeros',
...,
'glorot_uniform' ])}
\end{minted}
\caption{Initial Hyper-parameter Search Space}
\label{fig:initial_search_space}
\end{figure}
\subsubsection{Results}
Figure~\ref{fig:results_exp_1_good} presents statistics from the top 20
performing hyperparameter configurations, all of which achieved test
accuracy within 99.00\% to 99.29\%.
\begin{figure}[H]
\footnotesize
\centering
\begin{tabular}{l l l l l l l l l l l l l}
& E & K1 & K2 & PS & TP & S1 & S2 & BS & DP1 & DP2 & VA \\
\hline \\
Mean & 11.45 & 4.2 & 3.4 & 2.1 & 0.13 & 1.15 & 1.40 & 56.65 & 0.23 & 0.25 & 0.991170 \\
SD & 4.35 & 1.0052 & 1.3 & 0.31 & 0.12 & 0.37 & 0.60 & 18.48 & 0.17 & 0.17 & 0.000866 \\
Min & 6 & 2 & 2 & 2 & 0.0011 & 1 & 1 & 23 & 0.0063 & 0.0019 & 0.990000 \\
Max & 20 & 5 & 5 & 3 & 0.4 & 2 & 3 & 78 & 0.68 & 0.64 & 0.992900
\end{tabular}
\caption{Top Performing Results from Experiment 1 (150 Trials)}
\label{fig:results_exp_1_good}
\end{figure}
% \begin{figure}[H]
% \small
% \begin{centering}
% \begin{tabular}{l l l l l l l l l l l l l}
% \end{tabular}
% \caption{Bottom Performing Results from Experiment 1 (150 Trials)}
% \label{fig:results_exp_1_poor}
% \end{centering}
% \end{figure}
\begin{figure}[H]
\centering
\scriptsize
\begin{tabular}{l l l l}
Legend: \\
\hline \\
VA : & Test Accuracy & K1 : & Kernel 1 Dimensions \\
K2 : & Kernel 2 Dimensions & BS : & Batch Size \\
DP1 : & Dropout Prob. 1 & DP2 : & Dropout Prob. 2 \\
PS : & Pooling Size & S1 : & Strides 1 \\
S2 : & Strides 2 & TP : & Time Penalty \\
E : & Epochs \\
\end{tabular}
\end{figure}
\newpage
The results from Experiment 1 are depicted graphically in
Figure~\ref{fig:bs_vs_dropout} to~\ref{fig:pooling_vs_kernel}. In those figures,
I plotted three hyperparameters against each other and used a colormap to depict
the test accuracy. Figure~\ref{fig:exp1overtime} shows that the test
accuracy under hyperopt does not increase with the number of trials.
\begin{figure}[H]
\centering
\begin{tikzpicture}
\begin{axis}
[
% title=Experiment 1,every axis title/.style={above right,at={(0,1)}},
width=0.5*\linewidth,
grid=major,
grid style={dashed, gray!30},
xlabel=Trial Number,
ylabel=Test Accuracy,
xmin=0,
xmax=114,
ymin=0,
ymax=1,
mark=o
]
\addplot [only marks] table[x=trial, y=valacc, col sep=comma] {exp1.csv};
\end{axis}
\end{tikzpicture}
\caption{Experiment 1: Test Accuracy vs. Trial Number}
\label{fig:exp1overtime}
\end{figure}
\subsection{Experiment 2}
I changed the learning rate optimizer to standard Gradient Descent, and tuned
the learning rate, momentum, and decay while keeping the other hyperparameters
constant (which were learned from Experiment 1). My hyperparameter search space
was:
\begin{centering}
\begin{minted}{python3}
H = { 'lr': hyperopt.hp.uniform('lr', 0.0001, 0.1),
'nesterov': hyperopt.hp.choice('nesterov', ['True', 'False']),
'momentum': hyperopt.hp.uniform('momentum', 0.0001, 1),
'decay': hyperopt.hp.uniform('decay', 0.0001, 1) }
\end{minted}
\end{centering}
The hyperparameters carried over from Experiment 1 are shown in
Figure~\ref{fig:hyp_exp2_onward}.
\begin{figure}[H]
\footnotesize
\centering
\begin{tabular}{l l}
Kernel 1 Dimensions: & 3x3 \\
Kernel 2 Dimensions: & 5x5 \\
Batch Size: & 55 \\
Number of Epochs: & 6 \\
Keep Probability 1: & 0.0094 \\
Keep Probability 2: & 0.3078 \\
Pooling Dimensions: & 2x2 \\
Strides 1: & 1 \\
Strides 2: & 1
% Bias Initializer & He Normal \\
% Weight Initializer & He Normal
\end{tabular}
\caption{Hyperparameters used in Experiment 2 Onward}
\label{fig:hyp_exp2_onward}
\end{figure}
For reference, the hyperparameter configuration used by the Keras authors to
achieve a 92.25\% classification accuracy is shown in
Figure~\ref{fig:hyp_keras}.
\begin{figure}[H]
\footnotesize
\centering
\begin{tabular}{l l }
Kernel 1 Dimensions: & 3x3 \\
Kernel 2 Dimensions: & 3x3 \\
Batch Size: & 128 \\
Number of Epochs: & 12 \\
Keep Probability 1: & 0.25 \\
Keep Probability 2: & 0.5 \\
Pooling Dimensions: & 2x2 \\
Strides 1: & 1 \\
Strides 2: & 1
\end{tabular}
\caption{Hyperparameters used by Keras Authors}
\label{fig:hyp_keras}
\end{figure}
\subsubsection{Results}
I ran this experiment four times, once with model $A$ which was made from the
top performing hyperparameter configuration arrived at in Experiment 1. Model
$B$ used the average of the top performing hyperparameter configurations. Model
$C$ used the average of the worst performing hyperparameter configurations and
model $D$ used the worst performing hyperparameter configuration. The results of
the experiment can be seen in Figures~\ref{fig:exp2_75_best}
to~\ref{fig:exp2_75_worst}. Figures~\ref{fig:exp2_best_nesterov_true}
to~\ref{fig:exp2_worst_nesterov_false} may also be of interest.
\begin{figure}[H]
\centering
\begin{tikzpicture}
\begin{axis}
[
width=0.5*\linewidth,
grid=major,
grid style={dashed, gray!30},
xlabel=Trial Number,
ylabel=Test Accuracy,
xmin=0,
xmax=75,
ymin=0,
ymax=1,
mark=o
]
\addplot [only marks] table[x=trial, y=valacc, col sep=comma] {exp2_75_best.csv};
\end{axis}
\end{tikzpicture}
\caption{Experiment 2 (Model A): Test Accuracy vs. Trial Number}
\label{fig:exp2_75_best}
\end{figure}
\begin{figure}[H]
\centering
\begin{tikzpicture}
\begin{axis}
[
width=0.5*\linewidth,
grid=major,
grid style={dashed, gray!30},
xlabel=Trial Number,
ylabel=Test Accuracy,
xmin=0,
xmax=75,
ymin=0,
ymax=1,
mark=o
]
\addplot [only marks] table[x=trial, y=valacc, col sep=comma] {exp2_75_average.csv};
\end{axis}
\end{tikzpicture}
\caption{Experiment 2 (Model B): Test Accuracy vs. Trial Number}
\label{fig:exp2_75_average}
\end{figure}
\begin{figure}[H]
\centering
\begin{tikzpicture}
\begin{axis}
[
width=0.5*\linewidth,
grid=major,
grid style={dashed, gray!30},
xlabel=Trial Number,
ylabel=Test Accuracy,
xmin=0,
xmax=75,
ymin=0,
ymax=1,
mark=o
]
\addplot [only marks] table[x=trial, y=valacc, col sep=comma] {exp2_75_average_bad.csv};
\end{axis}
\end{tikzpicture}
\caption{Experiment 2 (Model C): Test Accuracy vs. Trial Number}
\label{fig:exp2_75_average_bad}
\end{figure}
\begin{figure}[H]
\centering
\begin{tikzpicture}
\begin{axis}
[
width=0.5*\linewidth,
grid=major,
grid style={dashed, gray!30},
xlabel=Trial Number,
ylabel=Test Accuracy,
xmin=0,
xmax=75,
ymin=0,
ymax=1,
mark=o
]
\addplot [only marks] table[x=trial, y=valacc, col sep=comma] {exp2_75_worst.csv};
\end{axis}
\end{tikzpicture}
\caption{Experiment 2 (Model D): Test Accuracy vs. Trial Number}
\label{fig:exp2_75_worst}
\end{figure}
\subsection{Experiment 3}
I optimized Learning Rate and Decay for 150 trials. I used the following null
prior distribution and the other hyperparameters were as found in
Figure~\ref{fig:hyp_exp2_onward}.
\begin{centering}
\begin{minted}{python3}
H = { 'lr': hyperopt.hp.uniform('lr', 0.03, 0.05),
'decay': hyperopt.hp.uniform('decay', 0.0001, 1) }
\end{minted}
\end{centering}
The results can be seen in Figure~\ref{fig:exp3overtime},
\ref{fig:exp3_150_trials}, and~\ref{fig:exp3_150_trials_zoomed}. As with
Experiments 1 and 2, the test loss did not decrease over time. The
probability of choosing a good configuration seems to be the same over time.
\begin{figure}[H]
\centering
\begin{tikzpicture}
\begin{axis}
[
width=0.5*\linewidth,
grid=major,
grid style={dashed, gray!30},
xlabel=Trial Number,
ylabel=Test Accuracy,
xmin=0,
xmax=151,
ymin=0,
ymax=1,
mark=o
]
\addplot [only marks] table[x=trial, y=valacc, col sep=comma] {exp4_150.csv};
\end{axis}
\end{tikzpicture}
\caption{Experiment 3: Test Accuracy vs. Trial Number}
\label{fig:exp3overtime}
\end{figure}
\section{Discussion}
The experimental part of this project gave me the time I needed to develop a
feel for how hyperparameter selection affects performance. I see now that while
hyperparameter optimization can make all the difference in classification
performance there is not a single best hyperparameter configuration. Rather,
there is a distribution of good hyperparameter configurations. For example, the
classification performance achieved in Experiment 1 was only slightly better
than the out of the box configuration, but the hyperparameter values were quite
a bit different from the default values.
% The results depicted in Figure~\ref{fig:exp3_150_trials} could be
% misinterpreted. It looks like hyperopt is choosing better and better
% hyperparameter configurations over time, but as we saw in
% Figure~\ref{fig:exp1overtime}, \ref{fig:exp2_75_best},
% \ref{fig:exp2_75_average}, \ref{fig:exp2_75_average_bad},
% \ref{fig:exp2_75_worst}, and~\ref{fig:exp3overtime}, hyperopt does not choose
% better configurations over time. I would rather use a hyperparameter optimizing
% algorithm that converges to an optimal set of hyperparameters.
Figures~\ref{fig:exp3_150_trials}, and~\ref{fig:exp3_150_trials_zoomed}
probably do support the claims made by Bergstra and Bengio that only a
relatively small fraction of all possible hyperparameter configurations lead to
high performing models.
I would like to spend more time with Tree of Parzen Estimators and other
hyper-parameter optimization algorithms during my master's degree. I will
implement my own version of a neural network with back-propagation, gradient
descent, and a reinforcement learning algorithm to tune the hyper-parameters. I
will write about it in a blog dedicated to documenting the process.