Add hyperparameter section

2023-11-22 11:00:50 +01:00 · 2023-11-22 11:00:50 +01:00 · 785435e82c
commit 785435e82c
parent 2267b5ef25
1 changed files with 117 additions and 7 deletions
--- a/thesis/thesis.tex
+++ b/thesis/thesis.tex
@ -1661,14 +1661,124 @@ train a classifier as well as an object detection model.
 \section{Hyperparameter Optimization}
 \label{sec:background-hypopt}
-Give a definition of hyperparameter optimization, why it is done and
+While a network is learning, the parameters of its layers are
-which improvements can be expected. Mention the possible approaches
+updated. These parameters are \emph{learnable} in the sense that
-(grid search, random search, bayesian optimization, gradient-based
+changing them should bring the model closer to solving a
-optimization, evolutionary optimization) and discuss the used ones
+problem. Updating these parameters happens during the
-(random search (classifier) and evolutionary optimization (object
+learning/training phase. Hyperparameters, on the other hand, are not
-detector) in detail.
+included in the learning process because they are fixed before the
 model starts to train. They are fixed because hyperparameters concern
 the structure, architecture and learning parameters of the model and
 without having those in place, a model cannot start training.
-Estimated 3 pages for this section.
+Model designers have to carefully define values for a wide range of
 hyperparameters. Which hyperparameters have to be set is determined by
 the type of model which is being used. A \gls{svm}, for example, has a
 penalty parameter $C$ which indicates to the network how lenient it
 should be when misclassifying training examples. The type of kernel to
 use is also a hyperparameter for any \gls{svm} and can only be
 answered by looking at the distribution of the underlying data. In
 neural networks the range of hyperparameters is even greater because
 every part of the network architecture such as how many layers to
 stack, which layers to stack, which kernel sizes to use in each
 \gls{cnn} layer and which activation function(s) to use in-between the
 layers is a parameter which can be altered. Finding the best
 combination of some or all of the available hyperparameters is called
 \emph{hyperparameter tuning}.
 Hyperparameter tuning can be and is often done manually by researchers
 where they select values which \emph{have been known to work
 well}. This approach—while it works to some extent—is not optimal
 because adhering to \emph{best practice} precludes parameter
 configurations which would be closer to optimality for a given data
 set. Furthermore, manual tuning requires a deep understanding of the
 model itself and how each parameter influences it. Biases present in a
 researcher's understanding are detrimental to finding optimal
 hyperparameters and the amount of possible combinations can quickly
 get intractable. Instead, automated methods to search the
 hyperparameter space offer an unbiased and more efficient approach to
 hyperparameter tuning. This type of algorithmic search is called
 \emph{hyperparameter optimization}.
 \subsection{Grid Search}
 \label{ssec:grid-search}
 There are multiple possible strategies to opt for when optimizing
 hyperparameters. The straightforward approach is to do grid search. In
 grid search, all hyperparameters are discretized and all possible
 combinations mapped to a search space. The search space is then
 sampled for configurations at evenly spaced points and the resulting
 vectors of hyperparameter values are evaluated. For example, if a
 model has seven hyperparameters and three of those can take on a
 continuous value, these three variables have to be discretized. In
 practical terms this means that the model engineer chooses suitable
 discrete values for said hyperparameters. Once all hyperparameters are
 discrete, all possible combinations of the hyperparameters are
 evaluated. If each of the seven hyperparameters has three discrete
 values, the number of possible combinations is
 \begin{equation}
  \label{eq:hypopt-nums}
  3\cdot3\cdot3\cdot3\cdot3\cdot3\cdot3 = 3^{7} = 2187.
 \end{equation}
 For this example, evaluating $2187$ possible combinations can already
 be intractable depending on the time required for each run. Further,
 grid search requires that the resolution of the grid is determined
 beforehand. If the points on the grid (combinations) are spaced too
 far apart, the chance of finding a global optimum is lower than if the
 grid is dense. However, a dense grid results in a higher number of
 possible combinations and thus more time is required for an exhaustive
 search. Additionally, grid search suffers from the \emph{curse of
 dimensionality} because the number of evaluations scales exponentially
 with the number of hyperparameters.
 \subsection{Random Search}
 \label{ssec:hypopt-random-search}
 Random search \cite{pinto2009} is an alternative to grid search which
 often provides configurations which are similar or better in the same
 amount of time than ones obtained with grid search
 \cite{bergstra2012}. Random search performs especially well in
 high-dimensional environments because the hyperparameter response
 surface is often of \emph{low effective dimensionality}
 \cite{bergstra2012}. That is, a low number of hyperparameters
 disproportionately affects the performance of the resulting model and
 the rest has a negligible effect. We use random search in this work to
 improve the hyperparameters of our classification model.
 \subsection{Evolution Strategies}
 \label{ssec:hypopt-evo}
 Evolution strategies follow a population-based model where the search
 strategy starts from initial random configurations and evolves the
 hyperparameters through \emph{mutation} and \emph{crossover}. Mutation
 randomly changes the value of a hyperparameter and crossover creates a
 new configuration by mixing the values of two
 configurations. Hyperparameter optimization with evolutionary
 strategies roughly goes through the following stages
 \cite{bischl2023}.
 \begin{enumerate}
 \item Set the hyperparameters to random initial values and create a
  starting population of configurations.
 \item Evaluate each configuration.
 \item Rank all configurations according to a fitness function.
 \item The best-performing configurations are selected as
  \emph{parents}.
 \item Child configurations are created from the parent configurations
  by mutation and crossover.
 \item Evaluate the child configurations.
 \item Go to step three and repeat the process until a termination
  condition is reached.
 \end{enumerate}
 This strategy is more efficient than grid search or random search, but
 requires a substantial amount of iterations for good solutions and can
 thus be too expensive for hyperparameter optimization
 \cite{bischl2023}. We use an evolution strategy based on a genetic
 algorithm in this work to optimize the hyperparameters of our object
 detection model.
 \section{Related Work}
 \label{sec:related-work}