Add hyperparameter section

2023-11-22 11:00:50 +01:00 · 2023-11-22 11:00:50 +01:00 · 785435e82c
commit 785435e82c
parent 2267b5ef25
1 changed files with 117 additions and 7 deletions
--- a/thesis/thesis.tex
+++ b/thesis/thesis.tex
@ -1661,14 +1661,124 @@ train a classifier as well as an object detection model.
 \section{Hyperparameter Optimization}
 \label{sec:background-hypopt}

-Give a definition of hyperparameter optimization, why it is done and
-which improvements can be expected. Mention the possible approaches
-(grid search, random search, bayesian optimization, gradient-based
-optimization, evolutionary optimization) and discuss the used ones
-(random search (classifier) and evolutionary optimization (object
-detector) in detail.
+While a network is learning, the parameters of its layers are
+updated. These parameters are \emph{learnable} in the sense that
+changing them should bring the model closer to solving a
+problem. Updating these parameters happens during the
+learning/training phase. Hyperparameters, on the other hand, are not
+included in the learning process because they are fixed before the
+model starts to train. They are fixed because hyperparameters concern
+the structure, architecture and learning parameters of the model and
+without having those in place, a model cannot start training.

-Estimated 3 pages for this section.
+Model designers have to carefully define values for a wide range of
+hyperparameters. Which hyperparameters have to be set is determined by
+the type of model which is being used. A \gls{svm}, for example, has a
+penalty parameter $C$ which indicates to the network how lenient it
+should be when misclassifying training examples. The type of kernel to
+use is also a hyperparameter for any \gls{svm} and can only be
+answered by looking at the distribution of the underlying data. In
+neural networks the range of hyperparameters is even greater because
+every part of the network architecture such as how many layers to
+stack, which layers to stack, which kernel sizes to use in each
+\gls{cnn} layer and which activation function(s) to use in-between the
+layers is a parameter which can be altered. Finding the best
+combination of some or all of the available hyperparameters is called
+\emph{hyperparameter tuning}.
+
+Hyperparameter tuning can be and is often done manually by researchers
+where they select values which \emph{have been known to work
+well}. This approach—while it works to some extent—is not optimal
+because adhering to \emph{best practice} precludes parameter
+configurations which would be closer to optimality for a given data
+set. Furthermore, manual tuning requires a deep understanding of the
+model itself and how each parameter influences it. Biases present in a
+researcher's understanding are detrimental to finding optimal
+hyperparameters and the amount of possible combinations can quickly
+get intractable. Instead, automated methods to search the
+hyperparameter space offer an unbiased and more efficient approach to
+hyperparameter tuning. This type of algorithmic search is called
+\emph{hyperparameter optimization}.
+
+\subsection{Grid Search}
+\label{ssec:grid-search}
+
+There are multiple possible strategies to opt for when optimizing
+hyperparameters. The straightforward approach is to do grid search. In
+grid search, all hyperparameters are discretized and all possible
+combinations mapped to a search space. The search space is then
+sampled for configurations at evenly spaced points and the resulting
+vectors of hyperparameter values are evaluated. For example, if a
+model has seven hyperparameters and three of those can take on a
+continuous value, these three variables have to be discretized. In
+practical terms this means that the model engineer chooses suitable
+discrete values for said hyperparameters. Once all hyperparameters are
+discrete, all possible combinations of the hyperparameters are
+evaluated. If each of the seven hyperparameters has three discrete
+values, the number of possible combinations is
+
+\begin{equation}
+  \label{eq:hypopt-nums}
+  3\cdot3\cdot3\cdot3\cdot3\cdot3\cdot3 = 3^{7} = 2187.
+\end{equation}
+
+For this example, evaluating $2187$ possible combinations can already
+be intractable depending on the time required for each run. Further,
+grid search requires that the resolution of the grid is determined
+beforehand. If the points on the grid (combinations) are spaced too
+far apart, the chance of finding a global optimum is lower than if the
+grid is dense. However, a dense grid results in a higher number of
+possible combinations and thus more time is required for an exhaustive
+search. Additionally, grid search suffers from the \emph{curse of
+dimensionality} because the number of evaluations scales exponentially
+with the number of hyperparameters.
+
+\subsection{Random Search}
+\label{ssec:hypopt-random-search}
+
+Random search \cite{pinto2009} is an alternative to grid search which
+often provides configurations which are similar or better in the same
+amount of time than ones obtained with grid search
+\cite{bergstra2012}. Random search performs especially well in
+high-dimensional environments because the hyperparameter response
+surface is often of \emph{low effective dimensionality}
+\cite{bergstra2012}. That is, a low number of hyperparameters
+disproportionately affects the performance of the resulting model and
+the rest has a negligible effect. We use random search in this work to
+improve the hyperparameters of our classification model.
+
+\subsection{Evolution Strategies}
+\label{ssec:hypopt-evo}
+
+Evolution strategies follow a population-based model where the search
+strategy starts from initial random configurations and evolves the
+hyperparameters through \emph{mutation} and \emph{crossover}. Mutation
+randomly changes the value of a hyperparameter and crossover creates a
+new configuration by mixing the values of two
+configurations. Hyperparameter optimization with evolutionary
+strategies roughly goes through the following stages
+\cite{bischl2023}.
+
+\begin{enumerate}
+\item Set the hyperparameters to random initial values and create a
+  starting population of configurations.
+\item Evaluate each configuration.
+\item Rank all configurations according to a fitness function.
+\item The best-performing configurations are selected as
+  \emph{parents}.
+\item Child configurations are created from the parent configurations
+  by mutation and crossover.
+\item Evaluate the child configurations.
+\item Go to step three and repeat the process until a termination
+  condition is reached.
+\end{enumerate}
+
+This strategy is more efficient than grid search or random search, but
+requires a substantial amount of iterations for good solutions and can
+thus be too expensive for hyperparameter optimization
+\cite{bischl2023}. We use an evolution strategy based on a genetic
+algorithm in this work to optimize the hyperparameters of our object
+detection model.

 \section{Related Work}
 \label{sec:related-work}