Add object detection implementation

2023-12-08 15:54:58 +01:00 · 2023-12-08 15:54:58 +01:00 · 6267db9485
commit 6267db9485
parent 7b0662b728
3 changed files with 375 additions and 332 deletions
--- a/thesis/references.bib
+++ b/thesis/references.bib
--- a/thesis/thesis.pdf
+++ b/thesis/thesis.pdf
--- a/thesis/thesis.tex
+++ b/thesis/thesis.tex
@ -1598,7 +1598,7 @@ computational cost of between eight to nine times. MobileNet v2
 \emph{squeeze and excitation layers} among other improvements. These
 concepts led to better classification accuracy at the same or smaller
 model size. The authors evaluate a large and a small variant of
-MobileNet v3 on Imagenet on single-core phone processors and achieve a
+MobileNet v3 on ImageNet on single-core phone processors and achieve a
 top-1 accuracy of 75.2\% and 67.4\% respectively.
 \section{Transfer Learning}
@ -1664,7 +1664,7 @@ which have to be made as a result of using transfer learning can
 introduce more complexity than would otherwise be necessary for a
 particular problem. It does, however, allow researchers to get started
 quickly and to iterate faster because popular network architectures
-pretrained on Imagenet are integrated into the major machine learning
+pretrained on ImageNet are integrated into the major machine learning
 frameworks. Transfer learning is used extensively in this work to
 train a classifier as well as an object detection model.
@ -2300,7 +2300,7 @@ the \gls{coco} test data set.
 The authors of \gls{yolo}v6 \cite{li2022a} use a new backbone based on
 RepVGG \cite{ding2021} which they call EfficientRep. They also use
-different losses for classification (Varifocal loss \cite{zhang2021})
+different losses for classification (varifocal loss \cite{zhang2021})
 and bounding box regression (\gls{siou}
 \cite{gevorgyan2022}/\gls{giou} \cite{rezatofighi2019}). \gls{yolo}v6
 is made available in eight scaled version of which the largest
@ -2310,7 +2310,7 @@ achieves a \gls{map} of 57.2\% on the \gls{coco} test set.
 \label{sssec:yolov7}
 At the time of implementation of our own plant detector, \gls{yolo}v7
-\cite{wang2022b} was the newest version within the \gls{yolo}
+\cite{wang2022} was the newest version within the \gls{yolo}
 family. Similarly to \gls{yolo}v4, it introduces more trainable bag of
 freebies which do not impact inference time. The improvements include
 the use of \glspl{eelan} (based on \glspl{elan} \cite{wang2022a}),
@ -2444,31 +2444,79 @@ random value within a range with a specified probability.
 \chapter{Prototype Implementation}
 \label{chap:implementation}
 In this chapter we describe the implementation of the prototype. Part
 of the implementation is how the two models were trained and with
 which data sets, how the models are deployed to the \gls{sbc}, and how
 they were optimized.
 \section{Object Detection}
 \label{sec:development-detection}
-Describe how the object detection model was trained and what the
+As mentioned before, our approach is split into a detection and a
-training set looks like. Include a section on hyperparameter
+classification stage. The object detector detects all plants in an
-optimization and go into detail about how the detector was optimized.
+image during the first stage and passes the cutouts on to the
 classifier. In this section, we describe what the data set the object
 detector was trained with looks like, what the results of the training
 phase are and how the model was optimized with respect to its
 hyperparameters.
-The object detection model was trained for 300 epochs on 79204 images
+\subsection{Data Set}
-with 284130 ground truth labels. The weights from the best-performing
+\label{ssec:obj-train-dataset}
-epoch were saved. The model's fitness for each epoch is calculated as
+
-the weighted average of \textsf{mAP}@0.5 and \textsf{mAP}@0.5:0.95:
+The object detection model has to correctly detect plants in various
 locations, different lighting conditions, and in partially occluded
 settings. Fortunately, there are many data sets available which
 contain a large amount of classes and samples of common everyday
 objects. Most of these data sets contain at least one class about
 plants and multiple related classes such as \emph{houseplant} and
 \emph{potted plant} can be merged together to form a single
 \emph{plant} class which exhibits a great variety of samples. One such
 data set which includes the aforementioned classes is the \gls{oid}
 \cite{kuznetsova2020,krasin2017}.
 The \gls{oid} has been published in multiple versions starting in 2016
 with version one. The most recent iteration is version seven which has
 been released in October 2022. We use version six of the data set in
 our own work which contains \num{9011219} training, \num{41620}
 validation, and \num{125436} testing images. The data set provides
 image-level labels, bounding boxes, object segmentations, visual
 relationships, and localized narratives on those images. For our own
 work, we are only interested in the labeled bounding boxes of all
 images which belong to the classes \emph{Houseplant} and \emph{Plant}
 with their respective class identifiers \texttt{/m/03fp41} and
 \texttt{/m/05s2s}. These images have been extracted from the data set
 and arranged in the directory structure which \gls{yolo}v7
 requires. The bounding boxes themselves are collapsed into one single
 label \emph{Plant} and converted to the \gls{yolo}v7 label format. In
 total, there are \num{79204} images with \num{284130} bounding boxes
 in the training set. \gls{yolo}v7 continuously validates the training
 progress after every epoch on a validation set of \num{3091} images
 with \num{4092} bounding boxes.
 \subsection{Training Phase}
 \label{ssec:obj-training-phase}
 We use the smallest \gls{yolo}v7 model which has \num{36.9e6}
 parameters \cite{wang2022} and has been pretrained on the \gls{coco}
 data set \cite{lin2015} with an input size of \num{640} by \num{640}
 pixels. The object detection model was then fine-tuned for \num{300}
 epochs on the training set. The weights from the best-performing epoch
 were saved. The model's fitness for each epoch is calculated as the
 weighted average of \gls{map}@0.5 and \gls{map}@0.5:0.95:
 \begin{equation}
  \label{eq:fitness}
-  f_{epoch} = 0.1 \cdot \mathsf{mAP}@0.5 + 0.9 \cdot \mathsf{mAP}@0.5\mathrm{:}0.95
+  f_{epoch} = 0.1 \cdot \mathrm{\gls{map}}@0.5 + 0.9 \cdot \mathrm{\gls{map}}@0.5\mathrm{:}0.95
 \end{equation}
 Figure~\ref{fig:fitness} shows the model's fitness over the training
-period of 300 epochs. The gray vertical line indicates the maximum
+period of \num{300} epochs. The gray vertical line indicates the
-fitness of 0.61 at epoch 133. The weights of that epoch were frozen to
+maximum fitness of \num{0.61} at epoch \num{133}. The weights of that
-be the final model parameters. Since the fitness metric assigns the
+epoch were frozen to be the final model parameters. Since the fitness
-\textsf{mAP} at the higher range the overwhelming weight, the
+metric assigns the \gls{map} at the higher range the overwhelming
-\textsf{mAP}@0.5 starts to decrease after epoch 30, but the
+weight, the \gls{map}@0.5 starts to decrease after epoch \num{30}, but
-\textsf{mAP}@0.5:0.95 picks up the slack until the maximum fitness at
+the \gls{map}@0.5:0.95 picks up the slack until the maximum fitness at
-epoch 133. This is an indication that the model achieves good
+epoch \num{133}. This is an indication that the model achieves good
 performance early on and continues to gain higher confidence values
 until performance deteriorates due to overfitting.
@ -2477,8 +2525,8 @@ until performance deteriorates due to overfitting.
  \includegraphics{graphics/model_fitness.pdf}
  \caption[Object detection fitness per epoch.]{Object detection model
    fitness for each epoch calculated as in
-    equation~\ref{eq:fitness}. The vertical gray line at 133 marks the
+    equation~\ref{eq:fitness}. The vertical gray line at \num{133}
-    epoch with the highest fitness.}
+    marks the epoch with the highest fitness.}
  \label{fig:fitness}
 \end{figure}
@ -2489,11 +2537,11 @@ starts to decrease from the beginning, while recall experiences a
 barely noticeable increase. Taken together with the box and object
 loss from figure~\ref{fig:box-obj-loss}, we speculate that the
 pre-trained model already generalizes well to plant detection because
-one of the categories in the COCO~\cite{lin2015} dataset is
+one of the categories in the \gls{coco} \cite{lin2015} dataset is
 \emph{potted plant}. Any further training solely impacts the
 confidence of detection, but does not lead to higher detection
 rates. This conclusion is supported by the increasing
-\textsf{mAP}@0.5:0.95 until epoch 133.
+\gls{map}@0.5:0.95 until epoch \num{133}.
 \begin{figure}
  \centering
@ -2524,226 +2572,67 @@ the bounding boxes become tighter around objects of interest. With
 increasing training time, however, the object loss increases,
 indicating that less and less plants are present in the predicted
 bounding boxes. It is likely that overfitting is a cause for the
-increasing object loss from epoch 40 onward. Since the best weights as
+increasing object loss from epoch \num{40} onward. Since the best
-measured by fitness are found at epoch 133 and the object loss
+weights as measured by fitness are found at epoch \num{133} and the
-accelerates from that point, epoch 133 is probably the correct cutoff
+object loss accelerates from that point, epoch \num{133} is arguably
-before overfitting occurs.
+the correct cutoff before overfitting occurs.
 \begin{figure}
  \centering
  \includegraphics{graphics/val_box_obj_loss.pdf}
  \caption[Object detection box and object loss.]{Box and object loss
-    measured against the validation set of 3091 images and 4092 ground
+    measured against the validation set of \num{3091} images and
-    truth labels. The class loss is omitted because there is only one
+    \num{4092} ground truth labels. The class loss is omitted because
-    class in the dataset and the loss is therefore always zero.}
+    there is only one class in the dataset and the loss is therefore
    always zero.}
  \label{fig:box-obj-loss}
 \end{figure}
-Estimated 2 pages for this section.
+\subsection{Hyperparameter Optimization}
-
+\label{ssec:obj-hypopt}
 \section{Classification}
 \label{sec:development-classification}
 Describe how the classification model was trained and what the
 training set looks like. Include a subsection hyperparameter
 optimization and go into detail about how the classifier was
 optimized.
 The dataset was split 85/15 into training and validation sets. The
 images in the training set were augmented with a random crop to arrive
 at the expected image dimensions of 224 pixels. Additionally, the
 training images were modified with a random horizontal flip to
 increase the variation in the set and to train a rotation invariant
 classifier. All images, regardless of their membership in the training
 or validation set, were normalized with the mean and standard
 deviation of the ImageNet~\cite{deng2009} dataset, which the original
 \gls{resnet} model was pre-trained with. Training was done for 50
 epochs and the best-performing model as measured by validation
 accuracy was selected as the final version.
 Figure~\ref{fig:classifier-training-metrics} shows accuracy and loss
 on the training and validation sets. There is a clear upwards trend
 until epoch 20 when validation accuracy and loss stabilize at around
 0.84 and 0.3, respectively. The quick convergence and resistance to
 overfitting can be attributed to the model already having robust
 feature extraction capabilities.
 \begin{figure}
  \centering
  \includegraphics{graphics/classifier-metrics.pdf}
  \caption[Classifier accuracy and loss during training.]{Accuracy and
    loss during training of the classifier. The model converges
    quickly, but additional epochs do not cause validation loss to
    increase, which would indicate overfitting. The maximum validation
    accuracy of 0.9118 is achieved at epoch 27.}
  \label{fig:classifier-training-metrics}
 \end{figure}
 Estimated 2 pages for this section.
 \section{Deployment}
 Describe the Jetson Nano, how the model is deployed to the device and
 how it reports its results (REST API).
 Estimated 2 pages for this section.
 \chapter{Evaluation}
 \label{chap:evaluation}
 The following sections contain a detailed evaluation of the model in
 various scenarios. First, we present metrics from the training phases
 of the constituent models. Second, we employ methods from the field of
 \gls{xai} such as \gls{grad-cam} to get a better understanding of the
 models' abstractions. Finally, we turn to the models' aggregate
 performance on the test set.
 \section{Methodology}
 \label{sec:methodology}
 Go over the evaluation methodology by explaining the test datasets,
 where they come from, and how they're structured. Explain how the
 testing phase was done and which metrics are employed to compare the
 models to the SOTA.
 Estimated 2 pages for this section.
 \section{Results}
 \label{sec:results}
 Systematically go over the results from the testing phase(s), show the
 plots and metrics, and explain what they contain. 
 Estimated 4 pages for this section.
 \subsection{Object Detection}
 \label{ssec:yolo-eval}
 The following parapraph should probably go into
 section~\ref{sec:development-detection}.
 The object detection model was pre-trained on the COCO~\cite{lin2015}
 dataset and fine-tuned with data from the \gls{oid}
 \cite{kuznetsova2020} in its sixth version. Since the full \gls{oid}
 dataset contains considerably more classes and samples than would be
 feasibly trainable on a small cluster of \glspl{gpu}, only images from
 the two classes \emph{Plant} and \emph{Houseplant} have been
 downloaded. The samples from the Houseplant class are merged into the
 Plant class because the distinction between the two is not necessary
 for our model. Furthermore, the \gls{oid} contains not only bounding
 box annotations for object detection tasks, but also instance
 segmentations, classification labels and more. These are not needed
 for our purposes and are omitted as well. In total, the dataset
 consists of 91479 images with a roughly 85/5/10 split for training,
 validation and testing, respectively.
 \subsubsection{Test Phase}
 \label{sssec:yolo-test}
 Of the 91479 images around 10\% were used for the test phase. These
 images contain a total of 12238 ground truth
 labels. Table~\ref{tab:yolo-metrics} shows precision, recall and the
 harmonic mean of both ($\mathrm{F}_1$-score). The results indicate
 that the model errs on the side of sensitivity because recall is
 higher than precision. Although some detections are not labeled as
 plants in the dataset, if there is a labeled plant in the ground truth
 data, the chance is high that it will be detected. This behavior is in
 line with how the model's detections are handled in practice. The
 detections are drawn on the original image and the user is able to
 check the bounding boxes visually. If there are wrong detections, the
 user can ignore them and focus on the relevant ones instead. A higher
 recall will thus serve the user's needs better than a high precision.
 \begin{table}[h]
  \centering
  \begin{tabular}{lrrrr}
    \toprule
    {} &  Precision &    Recall &  $\mathrm{F}_1$-score &  Support \\
    \midrule
    Plant        &   0.547571 &  0.737866 &  0.628633 &  12238.0 \\
    \bottomrule
  \end{tabular}
  \caption{Precision, recall and $\mathrm{F}_1$-score for the object
    detection model.}
  \label{tab:yolo-metrics}
 \end{table}
 Figure~\ref{fig:yolo-ap} shows the \gls{ap} for the \gls{iou}
 thresholds of 0.5 and 0.95. Predicted bounding boxes with an \gls{iou}
 of less than 0.5 are not taken into account for the precision and
 recall values of table~\ref{tab:yolo-metrics}. The lower the detection
 threshold, the more plants are detected. Conversely, a higher
 detection threshold leaves potential plants undetected. The
 precision-recall curves confirm this behavior because the area under
 the curve for the threshold of 0.5 is higher than for the threshold of
 0.95 ($0.66$ versus $0.41$). These values are combined in COCO's
 \cite{lin2015} main evaluation metric which is the \gls{ap} averaged
 across the \gls{iou} thresholds from 0.5 to 0.95 in 0.05 steps. This
 value is then averaged across all classes and called \gls{map}. The
 object detection model achieves a state-of-the-art \gls{map} of 0.5727
 for the \emph{Plant} class.
 \begin{figure}
  \centering
  \includegraphics{graphics/APpt5-pt95.pdf}
  \caption[Object detection AP@0.5 and AP@0.95.]{Precision-recall
    curves for \gls{iou} thresholds of 0.5 and 0.95. The \gls{ap} of a
    specific threshold is defined as the area under the
    precision-recall curve of that threshold. The \gls{map} across
    \gls{iou} thresholds from 0.5 to 0.95 in 0.05 steps
    \textsf{mAP}@0.5:0.95 is 0.5727.}
  \label{fig:yolo-ap}
 \end{figure}
 \subsubsection{Hyperparameter Optimization}
 \label{sssec:yolo-hyp-opt}
 This section should be moved to the hyperparameter optimization
 section in the development chapter
 (section~\ref{sec:development-detection}).
 To further improve the object detection performance, we perform
-hyper-parameter optimization using a genetic algorithm. Evolution of
+hyperparameter optimization using a genetic algorithm. Evolution of
-the hyper-parameters starts from the initial 30 default values
+the hyperparameters starts from the initial \num{30} default values
-provided by the authors of YOLO. Of those 30 values, 26 are allowed to
+provided by the authors of \gls{yolo}. Of those \num{30} values,
-mutate. During each generation, there is an 80\% chance that a
+\num{26} are allowed to mutate. During each generation, there is an
-mutation occurs with a variance of 0.04. To determine which generation
+80\% chance that a mutation occurs with a variance of \num{0.04}. To
-should be the parent of the new mutation, all previous generations are
+determine which generation should be the parent of the new mutation,
-ordered by fitness in decreasing order. At most five top generations
+all previous generations are ordered by fitness in decreasing
-are selected and one of them is chosen at random. Better generations
+order. At most five top generations are selected and one of them is
-have a higher chance of being selected as the selection is weighted by
+chosen at random. Better generations have a higher chance of being
-fitness. The parameters of that chosen generation are then mutated
+selected as the selection is weighted by fitness. The parameters of
-with the aforementioned probability and variance. Each generation is
+that chosen generation are then mutated with the aforementioned
-trained for three epochs and the fitness of the best epoch is
+probability and variance. Each generation is trained for three epochs
-recorded.
+and the fitness of the best epoch is recorded.
-In total, we ran 87 iterations of which the 34\textsuperscript{th}
+In total, we ran \num{87} iterations of which the
-generation provides the best fitness of 0.6076. Due to time
+\num{34}\textsuperscript{th} generation provides the best fitness of
-constraints, it was not possible to train each generation for more
+\num{0.6076}. Due to time constraints, it was not possible to train
-epochs or to run more iterations in total. We assume that the
+each generation for more epochs or to run more iterations in total. We
-performance of the first few epochs is a reasonable proxy for model
+assume that the performance of the first few epochs is a reasonable
-performance overall. The optimized version of the object detection
+proxy for model performance overall. The optimized version of the
-model is then trained for 70 epochs using the parameters of the
+object detection model is then trained for \num{70} epochs using the
-34\textsuperscript{th} generation.
+parameters of the \num{34}\textsuperscript{th} generation.
 \begin{figure}
  \centering
  \includegraphics{graphics/model_fitness_final.pdf}
  \caption[Optimized object detection fitness per epoch.]{Object
    detection model fitness for each epoch calculated as in
-    equation~\ref{eq:fitness}. The vertical gray line at 27 marks the
+    equation~\ref{eq:fitness}. The vertical gray line at \num{27}
-    epoch with the highest fitness of 0.6172.}
+    marks the epoch with the highest fitness of \num{0.6172}.}
  \label{fig:hyp-opt-fitness}
 \end{figure}
 Figure~\ref{fig:hyp-opt-fitness} shows the model's fitness during
-training for each epoch. After the highest fitness of 0.6172 at epoch
+training for each epoch. After the highest fitness of \num{0.6172} at
-27, the performance quickly declines and shows that further training
+epoch \num{27}, the performance quickly declines and shows that
-would likely not yield improved results. The model converges to its
+further training would likely not yield improved results. The model
-highest fitness much earlier than the non-optimized version, which
+converges to its highest fitness much earlier than the non-optimized
-indicates that the adjusted parameters provide a better starting point
+version, which indicates that the adjusted parameters provide a better
-in general. Furthermore, the maximum fitness is 0.74\% higher than in
+starting point in general. Furthermore, the maximum fitness is 0.74
-the non-optimized version.
+percentage points higher than in the non-optimized version.
 \begin{figure}
  \centering
@ -2751,7 +2640,7 @@ the non-optimized version.
  \caption[Hyper-parameter optimized object detection precision and
  recall during training.]{Overall precision and recall during
    training for each epoch of the optimized model. The vertical gray
-    line at 27 marks the epoch with the highest fitness.}
+    line at \num{27} marks the epoch with the highest fitness.}
  \label{fig:hyp-opt-prec-rec}
 \end{figure}
@ -2766,9 +2655,9 @@ non-optimized version and recall hovers at the same levels.
  \includegraphics{graphics/val_box_obj_loss_final.pdf}
  \caption[Hyper-parameter optimized object detection box and object
  loss.]{Box and object loss measured against the validation set of
-    3091 images and 4092 ground truth labels. The class loss is
+    \num{3091} images and \num{4092} ground truth labels. The class
-    omitted because there is only one class in the dataset and the
+    loss is omitted because there is only one class in the dataset and
-    loss is therefore always zero.}
+    the loss is therefore always zero.}
  \label{fig:hyp-opt-box-obj-loss}
 \end{figure}
@ -2777,96 +2666,84 @@ figure~\ref{fig:hyp-opt-box-obj-loss}. Both losses start from a lower
 level which suggests that the initial optimized parameters allow the
 model to converge quicker. The object loss exhibits a similar slope to
 the non-optimized model in figure~\ref{fig:box-obj-loss}. The vertical
-gray line again marks epoch 27 with the highest fitness. The box loss
+gray line again marks epoch \num{27} with the highest fitness. The box
-reaches its lower limit at that point and the object loss starts to
+loss reaches its lower limit at that point and the object loss starts
-increase again after epoch 27.
+to increase again after epoch \num{27}.
-\begin{table}[h]
+\section{Classification}
-  \centering
+\label{sec:development-classification}
  \begin{tabular}{lrrrr}
    \toprule
    {} &  Precision &    Recall &  $\mathrm{F}_1$-score &  Support \\
    \midrule
    Plant        &   0.633358 &  0.702811 &  0.666279 &  12238.0 \\
    \bottomrule
  \end{tabular}
  \caption{Precision, recall and $\mathrm{F}_1$-score for the
    optimized object detection model.}
  \label{tab:yolo-metrics-hyp}
 \end{table}
-Turning to the evaluation of the optimized model on the test dataset,
+The second stage of our approach consists of the classification model
-table~\ref{tab:yolo-metrics-hyp} shows precision, recall and the
+which determines whether the plant in question is water-stressed or
-$\mathrm{F}_1$-score for the optimized model. Comparing these metrics
+not. The classifier receives the cutouts for each plant from stage one
-with the non-optimized version from table~\ref{tab:yolo-metrics},
+(object detection). We chose a \gls{resnet}-50 model (see
-precision is significantly higher by more than 8.5\%. Recall, however,
+section~\ref{sec:methods-classification}) which has been pretrained on
-is 3.5\% lower. The $\mathrm{F}_1$-score is higher by more than 3.7\%
+ImageNet. We chose the \gls{resnet} architecture due to its popularity
-which indicates that the optimized model is better overall despite the
+and ease of implementation as well as its consistently high
-lower recall. We feel that the lower recall value is a suitable trade
+performance on various classification tasks. While its classification
-off for the substantially higher precision considering that the
+speed in comparison with networks optimized for mobile and edge
-non-optimized model's precision is quite low at 0.55.
+devices (e.g. MobileNet) is significantly lower, the deeper structure
 and the additional parameters are necessary for the fairly complex
 task at hand. Furthermore, the generous time budget for object
 detection \emph{and} classification allows for more accurate results
 at the expense of speed. The \num{50} layer architecture
 (\gls{resnet}-50) is adequate for our use case. In the following
 sections we describe the data set the classifier was trained on, the
 metrics of the training phase and how the performance of the model was
 further improved with hyperparameter optimization.
-The precision-recall curves in figure~\ref{fig:yolo-ap-hyp} for the
+\subsection{Data Set}
-optimized model show that the model draws looser bounding boxes than
+\label{ssec:class-train-dataset}
-the optimized model. The \gls{ap} for both \gls{iou} thresholds of 0.5
+
-and 0.95 is lower indicating worse performance. It is likely that more
+The data set we used for training the classifier consists of \num{452}
-iterations during evolution would help increase the \gls{ap} values as
+images of healthy and \num{452} stressed plants. 
-well. Even though the precision and recall values from
+
-table~\ref{tab:yolo-metrics-hyp} are better, the \textsf{mAP}@0.5:0.95
+%% TODO: write about data set
-is lower by 1.8\%.
+
 The dataset was split 85/15 into training and validation sets. The
 images in the training set were augmented with a random crop to arrive
 at the expected image dimensions of \num{224} pixels. Additionally,
 the training images were modified with a random horizontal flip to
 increase the variation in the set and to train a rotation invariant
 classifier. All images, regardless of their membership in the training
 or validation set, were normalized with the mean and standard
 deviation of the ImageNet \cite{deng2009} dataset, which the original
 \gls{resnet}-50 model was pretrained with. Training was done for
 \num{50} epochs and the best-performing model as measured by
 validation accuracy was selected as the final version.
 Figure~\ref{fig:classifier-training-metrics} shows accuracy and loss
 on the training and validation sets. There is a clear upwards trend
 until epoch \num{20} when validation accuracy and loss stabilize at
 around \num{0.84} and \num{0.3}, respectively. The quick convergence
 and resistance to overfitting can be attributed to the model already
 having robust feature extraction capabilities.
 \begin{figure}
  \centering
-  \includegraphics{graphics/APpt5-pt95-final.pdf}
+  \includegraphics{graphics/classifier-metrics.pdf}
-  \caption[Hyper-parameter optimized object detection AP@0.5 and
+  \caption[Classifier accuracy and loss during training.]{Accuracy and
-  AP@0.95.]{Precision-recall curves for \gls{iou} thresholds of 0.5
+    loss during training of the classifier. The model converges
-    and 0.95. The \gls{ap} of a specific threshold is defined as the
+    quickly, but additional epochs do not cause validation loss to
-    area under the precision-recall curve of that threshold. The
+    increase, which would indicate overfitting. The maximum validation
-    \gls{map} across \gls{iou} thresholds from 0.5 to 0.95 in 0.05
+    accuracy of \num{0.9118} is achieved at epoch \num{27}.}
-    steps \textsf{mAP}@0.5:0.95 is 0.5546.}
+  \label{fig:classifier-training-metrics}
  \label{fig:yolo-ap-hyp}
 \end{figure}
-\subsection{Classification}
+\subsection{Hyperparameter Optimization}
-\label{ssec:classifier-eval}
+\label{ssec:class-hypopt}
 The classifier receives cutouts from the object detection model and
 determines whether the image shows a stressed plant or not. To achieve
 this goal, we trained a \gls{resnet} \cite{he2016} on a dataset of 452
 images of healthy and 452 stressed plants. We chose the \gls{resnet}
 architecture due to its popularity and ease of implementation as well
 as its consistently high performance on various classification
 tasks. While its classification speed in comparison with networks
 optimized for mobile and edge devices (e.g. MobileNet) is
 significantly lower, the deeper structure and the additional
 parameters are necessary for the fairly complex task at
 hand. Furthermore, the generous time budget for object detection
 \emph{and} classification allows for more accurate results at the
 expense of speed. The architecture allows for multiple different
 structures, depending on the amount of layers. The smallest one has 18
 and the largest 152 layers with 34, 50 and 101 in-between. The larger
 networks have better accuracy in general, but come with trade-offs
 regarding training and inference time as well as required space. The
 50 layer architecture (\gls{resnet}50) is adequate for our use case.
 \subsubsection{Hyperparameter Optimization}
 \label{sssec:classifier-hyp-opt}
 This section should be moved to the hyperparameter optimization
 section in the development chapter
 (section~\ref{sec:development-classification}).
 In order to improve the aforementioned accuracy values, we perform
-hyper-parameter optimization across a wide range of
+hyperparameter optimization across a wide range of
-parameters. Table~\ref{tab:classifier-hyps} lists the hyper-parameters
+parameters. Table~\ref{tab:classifier-hyps} lists the hyperparameters
 and their possible values. Since the number of all combinations of
-values is 11520 and each combination is trained for 10 epochs with a
+values is \num{11520} and each combination is trained for \num{10}
-training time of approximately six minutes per combination, exhausting
+epochs with a training time of approximately six minutes per
-the search space would take 48 days. Due to time limitations, we have
+combination, exhausting the search space would take \num{48} days. Due
-chosen to not search exhaustively but to pick random combinations
+to time limitations, we have chosen to not search exhaustively but to
-instead. Random search works surprisingly well---especially compared to
+pick random combinations instead. Random search works surprisingly
-grid search---in a number of domains, one of which is hyper-parameter
+well---especially compared to grid search---in a number of domains, one of
-optimization~\cite{bergstra2012}.
+which is hyperparameter optimization \cite{bergstra2012}.
 \begin{table}[h]
  \centering
@ -3010,6 +2887,186 @@ $\mathrm{F}_1$-score of 1 on the training set.
  \label{fig:classifier-hyp-folds}
 \end{figure}
 \section{Deployment}
 Describe the Jetson Nano, how the model is deployed to the device and
 how it reports its results (REST API).
 Estimated 2 pages for this section.
 \chapter{Evaluation}
 \label{chap:evaluation}
 The following sections contain a detailed evaluation of the model in
 various scenarios. First, we present metrics from the training phases
 of the constituent models. Second, we employ methods from the field of
 \gls{xai} such as \gls{grad-cam} to get a better understanding of the
 models' abstractions. Finally, we turn to the models' aggregate
 performance on the test set.
 \section{Methodology}
 \label{sec:methodology}
 Go over the evaluation methodology by explaining the test datasets,
 where they come from, and how they're structured. Explain how the
 testing phase was done and which metrics are employed to compare the
 models to the SOTA.
 Estimated 2 pages for this section.
 \section{Results}
 \label{sec:results}
 Systematically go over the results from the testing phase(s), show the
 plots and metrics, and explain what they contain. 
 Estimated 4 pages for this section.
 \subsection{Object Detection}
 \label{ssec:yolo-eval}
 The following parapraph should probably go into
 section~\ref{sec:development-detection}.
 The object detection model was pre-trained on the COCO~\cite{lin2015}
 dataset and fine-tuned with data from the \gls{oid}
 \cite{kuznetsova2020} in its sixth version. Since the full \gls{oid}
 dataset contains considerably more classes and samples than would be
 feasibly trainable on a small cluster of \glspl{gpu}, only images from
 the two classes \emph{Plant} and \emph{Houseplant} have been
 downloaded. The samples from the Houseplant class are merged into the
 Plant class because the distinction between the two is not necessary
 for our model. Furthermore, the \gls{oid} contains not only bounding
 box annotations for object detection tasks, but also instance
 segmentations, classification labels and more. These are not needed
 for our purposes and are omitted as well. In total, the dataset
 consists of 91479 images with a roughly 85/5/10 split for training,
 validation and testing, respectively.
 \subsubsection{Test Phase}
 \label{sssec:yolo-test}
 Of the 91479 images around 10\% were used for the test phase. These
 images contain a total of 12238 ground truth
 labels. Table~\ref{tab:yolo-metrics} shows precision, recall and the
 harmonic mean of both ($\mathrm{F}_1$-score). The results indicate
 that the model errs on the side of sensitivity because recall is
 higher than precision. Although some detections are not labeled as
 plants in the dataset, if there is a labeled plant in the ground truth
 data, the chance is high that it will be detected. This behavior is in
 line with how the model's detections are handled in practice. The
 detections are drawn on the original image and the user is able to
 check the bounding boxes visually. If there are wrong detections, the
 user can ignore them and focus on the relevant ones instead. A higher
 recall will thus serve the user's needs better than a high precision.
 \begin{table}[h]
  \centering
  \begin{tabular}{lrrrr}
    \toprule
    {} &  Precision &    Recall &  $\mathrm{F}_1$-score &  Support \\
    \midrule
    Plant        &   0.547571 &  0.737866 &  0.628633 &  12238.0 \\
    \bottomrule
  \end{tabular}
  \caption{Precision, recall and $\mathrm{F}_1$-score for the object
    detection model.}
  \label{tab:yolo-metrics}
 \end{table}
 Figure~\ref{fig:yolo-ap} shows the \gls{ap} for the \gls{iou}
 thresholds of 0.5 and 0.95. Predicted bounding boxes with an \gls{iou}
 of less than 0.5 are not taken into account for the precision and
 recall values of table~\ref{tab:yolo-metrics}. The lower the detection
 threshold, the more plants are detected. Conversely, a higher
 detection threshold leaves potential plants undetected. The
 precision-recall curves confirm this behavior because the area under
 the curve for the threshold of 0.5 is higher than for the threshold of
 0.95 ($0.66$ versus $0.41$). These values are combined in COCO's
 \cite{lin2015} main evaluation metric which is the \gls{ap} averaged
 across the \gls{iou} thresholds from 0.5 to 0.95 in 0.05 steps. This
 value is then averaged across all classes and called \gls{map}. The
 object detection model achieves a state-of-the-art \gls{map} of 0.5727
 for the \emph{Plant} class.
 \begin{figure}
  \centering
  \includegraphics{graphics/APpt5-pt95.pdf}
  \caption[Object detection AP@0.5 and AP@0.95.]{Precision-recall
    curves for \gls{iou} thresholds of 0.5 and 0.95. The \gls{ap} of a
    specific threshold is defined as the area under the
    precision-recall curve of that threshold. The \gls{map} across
    \gls{iou} thresholds from 0.5 to 0.95 in 0.05 steps
    \textsf{mAP}@0.5:0.95 is 0.5727.}
  \label{fig:yolo-ap}
 \end{figure}
 \subsubsection{Hyperparameter Optimization}
 \label{sssec:yolo-hyp-opt}
 This section should be moved to the hyperparameter optimization
 section in the development chapter
 (section~\ref{sec:development-detection}).
 \begin{table}[h]
  \centering
  \begin{tabular}{lrrrr}
    \toprule
    {} &  Precision &    Recall &  $\mathrm{F}_1$-score &  Support \\
    \midrule
    Plant        &   0.633358 &  0.702811 &  0.666279 &  12238.0 \\
    \bottomrule
  \end{tabular}
  \caption{Precision, recall and $\mathrm{F}_1$-score for the
    optimized object detection model.}
  \label{tab:yolo-metrics-hyp}
 \end{table}
 Turning to the evaluation of the optimized model on the test dataset,
 table~\ref{tab:yolo-metrics-hyp} shows precision, recall and the
 $\mathrm{F}_1$-score for the optimized model. Comparing these metrics
 with the non-optimized version from table~\ref{tab:yolo-metrics},
 precision is significantly higher by more than 8.5\%. Recall, however,
 is 3.5\% lower. The $\mathrm{F}_1$-score is higher by more than 3.7\%
 which indicates that the optimized model is better overall despite the
 lower recall. We feel that the lower recall value is a suitable trade
 off for the substantially higher precision considering that the
 non-optimized model's precision is quite low at 0.55.
 The precision-recall curves in figure~\ref{fig:yolo-ap-hyp} for the
 optimized model show that the model draws looser bounding boxes than
 the optimized model. The \gls{ap} for both \gls{iou} thresholds of 0.5
 and 0.95 is lower indicating worse performance. It is likely that more
 iterations during evolution would help increase the \gls{ap} values as
 well. Even though the precision and recall values from
 table~\ref{tab:yolo-metrics-hyp} are better, the \textsf{mAP}@0.5:0.95
 is lower by 1.8\%.
 \begin{figure}
  \centering
  \includegraphics{graphics/APpt5-pt95-final.pdf}
  \caption[Hyper-parameter optimized object detection AP@0.5 and
  AP@0.95.]{Precision-recall curves for \gls{iou} thresholds of 0.5
    and 0.95. The \gls{ap} of a specific threshold is defined as the
    area under the precision-recall curve of that threshold. The
    \gls{map} across \gls{iou} thresholds from 0.5 to 0.95 in 0.05
    steps \textsf{mAP}@0.5:0.95 is 0.5546.}
  \label{fig:yolo-ap-hyp}
 \end{figure}
 \subsection{Classification}
 \label{ssec:classifier-eval}
 \subsubsection{Hyperparameter Optimization}
 \label{sssec:classifier-hyp-opt}
 This section should be moved to the hyperparameter optimization
 section in the development chapter
 (section~\ref{sec:development-classification}).
 \subsubsection{Class Activation Maps}
 \label{sssec:classifier-cam}