Add result section

2023-12-20 10:39:19 +01:00 · 2023-12-20 10:39:19 +01:00 · dced4c6902
commit dced4c6902
parent 326562ca85
2 changed files with 348 additions and 311 deletions
--- a/thesis/thesis.pdf
+++ b/thesis/thesis.pdf
--- a/thesis/thesis.tex
+++ b/thesis/thesis.tex
@ -1995,7 +1995,7 @@ resource. However, because the results are published via a \gls{rest}
 service, internet access is necessary to be able to retrieve the
 predictions.

-Other functional requirements are that the inference on the device for
+Other technical requirements are that the inference on the device for
 both models does not take too long (i.e. not longer than a few seconds
 per image). Even though plants are not known to grow extremely rapidly
 from one minute to the next, keeping the inference time low results in
@ -2003,12 +2003,42 @@ a more resource efficient prototype. As such, it is possible to run
 the device off of a battery which completes the self-contained nature
 of the prototype.

-From an evaluation perspective, the models are required to attain a
-reasonable level of accuracy. It is difficult to determine said level
-beforehand, but considering the task as well as general object
-detection and classification benchmarks such as \gls{coco}
-\cite{lin2015}, we expect a \gls{map} of around 40\% and precision and
-recall values of 70\%.
+From an evaluation perspective, the models should have high
+specificity and sensitivity. In order to be useful for plant
+water-stress detection, it is necessary to identify as many
+water-stressed plants as possible while keeping the number of false
+positives as low as possible (specificity). If the number of
+water-stressed plants is severely overestimated, downstream watering
+systems could damage the plants by overwatering. Conversely, if the
+number of water-stressed plants is underestimated, some plants are
+likely to die because no water-stress is detected (sensitivity).
+Furthermore, the models are required to attain a reasonable level of
+precision as well as good localization of plants. It is difficult to
+determine said levels beforehand, but considering the task at hand as
+well as general object detection and classification benchmarks such as
+\gls{coco} \cite{lin2015}, we expect a \gls{map} of around 40\% and
+precision and recall values of 70\%.
+
+Other basic model requirements are robust object detection and
+classification as well as good generalizability. The prototype should
+be able to function in different environments where different lighting
+conditions, different backgrounds, and different angles do not have an
+impact on model performance. Where feasible, models should be
+evaluated with cross validation to ensure that the performance of the
+model on the test set is a good indicator of its generalizability. In
+the same vein, models should not overfit or underfit the training data
+which also results in bad generalizability.
+
+During the iterative process of training the models as well as for
+evaluation purposes, the models should be interpretable. Especially
+when there is comparatively little training data available, verifying
+if the model is focusing on the \emph{right} parts of an image gives
+insight into its robustness and generalizability which can increase
+trust. Furthermore, if a model is clearly not focusing on the right
+parts of an image, interpretability can help debug where the problem
+lies. Interpretability is thus an important property of any model so
+that the model engineer is able to steer the training and inference
+process in the right direction.

 \section{Design}
 \label{sec:design}
@ -2741,11 +2771,11 @@ In order to improve the aforementioned accuracy values, we perform
 hyperparameter optimization across a wide range of
 parameters. Table~\ref{tab:classifier-hyps} lists the hyperparameters
 and their possible values. Since the number of all combinations of
-values is \num{11520} and each combination is trained for \num{10}
-epochs with a training time of approximately six minutes per
-combination, exhausting the search space would take \num{48} days. Due
-to time limitations, we have chosen to not search exhaustively but to
-pick random combinations instead. Random search works surprisingly
+values is \num{11520} and each combination is trained for ten epochs
+with a training time of approximately six minutes per combination,
+exhausting the search space would take \num{48} days. Due to time
+limitations, we have chosen to not search exhaustively but to pick
+random combinations instead. Random search works surprisingly
 well---especially compared to grid search---in a number of domains, one of
 which is hyperparameter optimization \cite{bergstra2012}.

@ -2782,7 +2812,9 @@ the observation that almost all configurations converge well before
 reaching the tenth epoch. The assumption that a training run with ten
 epochs provides a good proxy for final performance is supported by the
 quick convergence of validation accuracy and loss in
-figure~\ref{fig:classifier-training-metrics}.
+figure~\ref{fig:classifier-training-metrics}. Table~\ref{tab:classifier-final-hyps}
+lists the final hyperparameters which were chosen to train the
+improved model.

 \begin{equation}\label{eq:opt-prob}
  1 - (1 - 0.01)^{138} \approx 0.75
@ -2808,23 +2840,6 @@ figure~\ref{fig:classifier-training-metrics}.
  \label{fig:classifier-hyp-results}
 \end{figure}

-Table~\ref{tab:classifier-final-hyps} lists the final hyperparameters
-which were chosen to train the improved model. In order to confirm
-that the model does not suffer from overfitting or is a product of
-chance due to a coincidentally advantageous train/test split, we
-perform stratified $10$-fold cross validation on the dataset. Each
-fold contains 90\% training and 10\% test data and was trained for
-\num{25} epochs. Figure~\ref{fig:classifier-hyp-roc} shows the
-performance of the epoch with the highest $\mathrm{F}_1$-score of each
-fold as measured against the test split. The mean \gls{roc} curve
-provides a robust metric for a classifier's performance because it
-averages out the variability of the evaluation. Each fold manages to
-achieve at least an \gls{auc} of \num{0.94}, while the best fold
-reaches \num{0.99}. The mean \gls{roc} has an \gls{auc} of \num{0.96}
-with a standard deviation of \num{0.02}. These results indicate that
-the model is accurately predicting the correct class and is robust
-against variations in the training set.
-
 \begin{table}
  \centering
  \begin{tabular}{cccc}
@ -2842,6 +2857,232 @@ against variations in the training set.
  \label{tab:classifier-final-hyps}
 \end{table}

+\section{Deployment}
+
+After training of the two models (object detector and classifier), we
+export them to the \gls{onnx}\footnote{\url{https://github.com/onnx}}
+format and move the model files to the Nvidia Jetson Nano. On the
+device, a Flask application (\emph{server}) provides a \gls{rest}
+endpoint from which the results of the most recent prediction can be
+queried. The server periodically performs the following steps:
+
+\begin{enumerate}
+\item Call a binary which takes an image and writes it to a file.
+\item Take the image and detect all plants as well as their status
+  using the two models.
+\item Draw the returned bounding boxes onto the original image.
+\item Number each detection from left to right.
+\item Coerce the prediction for each bounding box into a tuple
+  $\langle I, S, T,\Delta T \rangle$.
+\item Store the image with the bounding boxes and an array of all
+  tuples (predictions) in a dictionary.
+\item Wait two minutes.
+\item Go to step one.
+\end{enumerate}
+
+The binary uses the accelerated GStreamer implementation by Nvidia to
+take an image. The tuple $\langle I, S, T,\Delta T \rangle$ consists of the following
+items: $I$ is the number of the bounding box in the image, $S$ the
+current state from one to ten, $T$ the timestamp of the prediction,
+and $\Delta T$ the time since the state $S$ last fell under three. The
+server performs these tasks asynchronously in the background and is
+always ready to respond to requests with the most recent prediction.
+
+This chapter detailed the training and deployment of the two models
+used for the plant water-stress detection system—the object detector
+and the classifier. Furthermore, we have specified the \gls{api} which
+publishes the results continuously. We will now turn towards the
+evaluation of the two separate models as well as the aggregate model.
+
+\chapter{Evaluation}
+\label{chap:evaluation}
+
+The following sections contain a detailed evaluation of the model in
+various scenarios. First, we describe the test datasets as well as the
+metrics used for assessing model performance. Second, we present the
+results of the evaluation and analyze the behavior of the classifier
+with \gls{grad-cam}. Finally, we discuss the results with respect to
+the research questions defined in section~\ref{sec:motivation}.
+
+\section{Methodology}
+\label{sec:methodology}
+
+In order to evaluate the object detection model and the classification
+model, we analyze their predictions on test datasets. For the object
+detection model, the test dataset is a 10\% split of the original
+dataset which we describe in section~\ref{ssec:obj-train-dataset}. The
+classifier is evaluated with a \num{10}-fold cross validation from the
+original dataset (see section~\ref{ssec:class-train-dataset}). After
+the evaluation of both models individually, we evaluate the model in
+aggregate on a new dataset. This is necessary because the prototype
+uses the two models as if they were one. The aggregate performance is
+ultimately the most important measure to decide if the prototype is
+able to meet the requirements.
+
+The test set for the aggregate model contains \num{640} images which
+were obtained from a google search using the terms \emph{thirsty
+plant}, \emph{wilted plant} and \emph{stressed plant}. Images which
+clearly show one or multiple plants with some amount of visible stress
+were added to the dataset. Care was taken to include plants with
+various degrees of stress and in various locations and lighting
+conditions. The search not only provided images of stressed plants,
+but also of healthy plants. The dataset is biased towards potted
+plants which are commonly put on display in western
+households. Furthermore, many plants, such as succulents, are sought
+after for home environments because of their ease of maintenance. Due
+to their inclusion in the dataset and how they exhibit water stress,
+the test set contains a wide variety of scenarios.
+
+After collecting the images, the aggregate model was run on them to
+obtain initial bounding boxes and classifications for ground truth
+labeling. Letting the model do the work beforehand and then correcting
+the labels allowed to include more images in the test set because they
+could be labeled more easily. Additionally, going over the detections
+and classifications provided a comprehensive view on how the models
+work and what their weaknesses and strengths are. After the labels
+have been corrected, the ground truth of the test set contains
+\num{766} bounding boxes of healthy plants and \num{494} of stressed
+plants.
+
+\section{Results}
+\label{sec:results}
+
+This section presents the results of the evaluation of the constituent
+models as well as the aggregate model. First, we evaluate the object
+detection model before and after hyperparameter optimization. Second,
+we evaluate the performance of the classifier after hyperparameter
+optimization and present the results of \gls{grad-cam}. Finally, we
+evaluate the aggregate model before and after hyperparameter
+optimization.
+
+\subsection{Object Detection}
+\label{ssec:yolo-eval}
+
+Of the \num{91479} images around 10\% were used for the test
+phase. These images contain a total of \num{12238} ground truth
+labels. Table~\ref{tab:yolo-metrics} shows precision, recall and the
+harmonic mean of both ($\mathrm{F}_1$-score). The results indicate
+that the model errs on the side of sensitivity because recall is
+higher than precision. Although some detections are not labeled as
+plants in the dataset, if there is a labeled plant in the ground truth
+data, the chance is high that it will be detected. This behavior is in
+line with how the model's detections are handled in practice. The
+detections are drawn on the original image and the user is able to
+check the bounding boxes visually. If there are wrong detections, the
+user can ignore them and focus on the relevant ones instead. A higher
+recall will thus serve the user's needs better than a high precision.
+
+\begin{table}[h]
+  \centering
+  \begin{tabular}{lrrrr}
+    \toprule
+    {} &  Precision &    Recall &  $\mathrm{F}_1$-score &  Support \\
+    \midrule
+    Plant        &   \num{0.547571} &  \num{0.737866} &  \num{0.628633} &  \num{12238.0} \\
+    \bottomrule
+  \end{tabular}
+  \caption{Precision, recall and $\mathrm{F}_1$-score for the object
+    detection model.}
+  \label{tab:yolo-metrics}
+\end{table}
+
+Figure~\ref{fig:yolo-ap} shows the \gls{ap} for the \gls{iou}
+thresholds of \num{0.5} and \num{0.95}. Predicted bounding boxes with
+an \gls{iou} of less than \num{0.5} are not taken into account for the
+precision and recall values of table~\ref{tab:yolo-metrics}. The lower
+the detection threshold, the more plants are detected. Conversely, a
+higher detection threshold leaves potential plants undetected. The
+precision-recall curves confirm this behavior because the area under
+the curve for the threshold of \num{0.5} is higher than for the
+threshold of \num{0.95} (\num{0.66} versus \num{0.41}). These values
+are combined in COCO's \cite{lin2015} main evaluation metric which is
+the \gls{ap} averaged across the \gls{iou} thresholds from \num{0.5}
+to \num{0.95} in \num{0.05} steps. This value is then averaged across
+all classes and called \gls{map}. The object detection model achieves
+a state-of-the-art \gls{map} of \num{0.5727} for the \emph{Plant} class.
+
+\begin{figure}
+  \centering
+  \includegraphics{graphics/APpt5-pt95.pdf}
+  \caption[Object detection AP@0.5 and AP@0.95.]{Precision-recall
+    curves for \gls{iou} thresholds of \num{0.5} and \num{0.95}. The
+    \gls{ap} of a specific threshold is defined as the area under the
+    precision-recall curve of that threshold. The \gls{map} across
+    \gls{iou} thresholds from \num{0.5} to \num{0.95} in \num{0.05}
+    steps \gls{map}@0.5:0.95 is \num{0.5727}.}
+  \label{fig:yolo-ap}
+\end{figure}
+
+\subsubsection{Hyperparameter Optimization}
+\label{sssec:yolo-hyp-opt}
+
+Turning to the evaluation of the optimized model on the test dataset,
+table~\ref{tab:yolo-metrics-hyp} shows precision, recall and the
+$\mathrm{F}_1$-score for the optimized model. Comparing these metrics
+with the non-optimized version from table~\ref{tab:yolo-metrics},
+precision is significantly higher by more than \num{8.5} percentage
+points. Recall, however, is \num{3.5} percentage points lower. The
+$\mathrm{F}_1$-score is higher by more than \num{3.7} percentage
+points which indicates that the optimized model is better overall
+despite the lower recall. We argue that the lower recall value is a
+suitable trade off for the substantially higher precision considering
+that the non-optimized model's precision is quite low at \num{0.55}.
+
+\begin{table}[h]
+  \centering
+  \begin{tabular}{lrrrr}
+    \toprule
+    {} &  Precision &    Recall &  $\mathrm{F}_1$-score &  Support \\
+    \midrule
+    Plant        &   \num{0.633358} &  \num{0.702811} &  \num{0.666279} &  \num{12238.0} \\
+    \bottomrule
+  \end{tabular}
+  \caption{Precision, recall and $\mathrm{F}_1$-score for the
+    optimized object detection model.}
+  \label{tab:yolo-metrics-hyp}
+\end{table}
+
+The precision-recall curves in figure~\ref{fig:yolo-ap-hyp} for the
+optimized model show that the model draws looser bounding boxes than
+the optimized model. The \gls{ap} for both \gls{iou} thresholds of
+\num{0.5} and \num{0.95} is lower indicating worse performance. It is
+likely that more iterations during evolution would help increase the
+\gls{ap} values as well. Even though the precision and recall values
+from table~\ref{tab:yolo-metrics-hyp} are better, the
+\gls{map}@0.5:0.95 is lower by \num{1.8} percentage points.
+
+\begin{figure}
+  \centering
+  \includegraphics{graphics/APpt5-pt95-final.pdf}
+  \caption[Hyper-parameter optimized object detection AP@0.5 and
+  AP@0.95.]{Precision-recall curves for \gls{iou} thresholds of
+    \num{0.5} and \num{0.95}. The \gls{ap} of a specific threshold is
+    defined as the area under the precision-recall curve of that
+    threshold. The \gls{map} across \gls{iou} thresholds from
+    \num{0.5} to \num{0.95} in \num{0.05} steps \gls{map}@0.5:0.95 is
+    \num{0.5546}.}
+  \label{fig:yolo-ap-hyp}
+\end{figure}
+
+\subsection{Classification}
+\label{ssec:classifier-eval}
+
+In order to confirm that the optimized classification model does not
+suffer from overfitting or is a product of chance due to a
+coincidentally advantageous train/test split, we perform stratified
+$10$-fold cross validation on the dataset. Each fold contains 90\%
+training and 10\% test data and was trained for \num{25}
+epochs. Figure~\ref{fig:classifier-hyp-roc} shows the performance of
+the epoch with the highest $\mathrm{F}_1$-score of each fold as
+measured against the test split. The mean \gls{roc} curve provides a
+robust metric for a classifier's performance because it averages out
+the variability of the evaluation. Each fold manages to achieve at
+least an \gls{auc} of \num{0.94}, while the best fold reaches
+\num{0.99}. The mean \gls{roc} has an \gls{auc} of \num{0.96} with a
+standard deviation of \num{0.02}. These results indicate that the
+model is accurately predicting the correct class and is robust against
+variations in the training set.
+
 \begin{figure}
  \centering
  \includegraphics{graphics/classifier-hyp-folds-roc.pdf}
@ -2893,210 +3134,6 @@ $\mathrm{F}_1$-score of \num{1} on the training set.
  \label{fig:classifier-hyp-folds}
 \end{figure}

-\section{Deployment}
-
-After training of the two models (object detector and classifier), we
-export them to the \gls{onnx}\footnote{\url{https://github.com/onnx}}
-format and move the model files to the Nvidia Jetson Nano. On the
-device, a Flask application (\emph{server}) provides a \gls{rest}
-endpoint from which the results of the most recent prediction can be
-queried. The server periodically performs the following steps:
-
-\begin{enumerate}
-\item Call a binary which takes an image and writes it to a file.
-\item Take the image and detect all plants as well as their status
-  using the two models.
-\item Draw the returned bounding boxes onto the original image.
-\item Number each detection from left to right.
-\item Coerce the prediction for each bounding box into a tuple
-  $\langle I, S, T,\Delta T \rangle$.
-\item Store the image with the bounding boxes and an array of all
-  tuples (predictions) in a dictionary.
-\item Wait two minutes.
-\item Go to step one.
-\end{enumerate}
-
-The binary uses the accelerated GStreamer implementation by Nvidia to
-take an image. The tuple $\langle I, S, T,\Delta T \rangle$ consists of the following
-items: $I$ is the number of the bounding box in the image, $S$ the
-current state from one to ten, $T$ the timestamp of the prediction,
-and $\Delta T$ the time since the state $S$ last fell under three. The
-server performs these tasks asynchronously in the background and is
-always ready to respond to requests with the most recent prediction.
-
-\chapter{Evaluation}
-\label{chap:evaluation}
-
-The following sections contain a detailed evaluation of the model in
-various scenarios. We employ methods from the field of \gls{xai} such
-as \gls{grad-cam} to get a better understanding of the models'
-abstractions. Finally, we turn to the models' aggregate performance on
-the test set.
-
-\section{Methodology}
-\label{sec:methodology}
-
-Go over the evaluation methodology by explaining the test datasets,
-where they come from, and how they're structured. Explain how the
-testing phase was done and which metrics are employed to compare the
-models to the SOTA.
-
-Estimated 2 pages for this section.
-
-\section{Results}
-\label{sec:results}
-
-Systematically go over the results from the testing phase(s), show the
-plots and metrics, and explain what they contain. 
-
-Estimated 4 pages for this section.
-
-\subsection{Object Detection}
-\label{ssec:yolo-eval}
-
-The following parapraph should probably go into
-section~\ref{sec:development-detection}.
-
-The object detection model was pre-trained on the COCO~\cite{lin2015}
-dataset and fine-tuned with data from the \gls{oid}
-\cite{kuznetsova2020} in its sixth version. Since the full \gls{oid}
-dataset contains considerably more classes and samples than would be
-feasibly trainable on a small cluster of \glspl{gpu}, only images from
-the two classes \emph{Plant} and \emph{Houseplant} have been
-downloaded. The samples from the Houseplant class are merged into the
-Plant class because the distinction between the two is not necessary
-for our model. Furthermore, the \gls{oid} contains not only bounding
-box annotations for object detection tasks, but also instance
-segmentations, classification labels and more. These are not needed
-for our purposes and are omitted as well. In total, the dataset
-consists of 91479 images with a roughly 85/5/10 split for training,
-validation and testing, respectively.
-
-\subsubsection{Test Phase}
-\label{sssec:yolo-test}
-
-Of the 91479 images around 10\% were used for the test phase. These
-images contain a total of 12238 ground truth
-labels. Table~\ref{tab:yolo-metrics} shows precision, recall and the
-harmonic mean of both ($\mathrm{F}_1$-score). The results indicate
-that the model errs on the side of sensitivity because recall is
-higher than precision. Although some detections are not labeled as
-plants in the dataset, if there is a labeled plant in the ground truth
-data, the chance is high that it will be detected. This behavior is in
-line with how the model's detections are handled in practice. The
-detections are drawn on the original image and the user is able to
-check the bounding boxes visually. If there are wrong detections, the
-user can ignore them and focus on the relevant ones instead. A higher
-recall will thus serve the user's needs better than a high precision.
-
-\begin{table}[h]
-  \centering
-  \begin{tabular}{lrrrr}
-    \toprule
-    {} &  Precision &    Recall &  $\mathrm{F}_1$-score &  Support \\
-    \midrule
-    Plant        &   0.547571 &  0.737866 &  0.628633 &  12238.0 \\
-    \bottomrule
-  \end{tabular}
-  \caption{Precision, recall and $\mathrm{F}_1$-score for the object
-    detection model.}
-  \label{tab:yolo-metrics}
-\end{table}
-
-Figure~\ref{fig:yolo-ap} shows the \gls{ap} for the \gls{iou}
-thresholds of 0.5 and 0.95. Predicted bounding boxes with an \gls{iou}
-of less than 0.5 are not taken into account for the precision and
-recall values of table~\ref{tab:yolo-metrics}. The lower the detection
-threshold, the more plants are detected. Conversely, a higher
-detection threshold leaves potential plants undetected. The
-precision-recall curves confirm this behavior because the area under
-the curve for the threshold of 0.5 is higher than for the threshold of
-0.95 ($0.66$ versus $0.41$). These values are combined in COCO's
-\cite{lin2015} main evaluation metric which is the \gls{ap} averaged
-across the \gls{iou} thresholds from 0.5 to 0.95 in 0.05 steps. This
-value is then averaged across all classes and called \gls{map}. The
-object detection model achieves a state-of-the-art \gls{map} of 0.5727
-for the \emph{Plant} class.
-
-\begin{figure}
-  \centering
-  \includegraphics{graphics/APpt5-pt95.pdf}
-  \caption[Object detection AP@0.5 and AP@0.95.]{Precision-recall
-    curves for \gls{iou} thresholds of 0.5 and 0.95. The \gls{ap} of a
-    specific threshold is defined as the area under the
-    precision-recall curve of that threshold. The \gls{map} across
-    \gls{iou} thresholds from 0.5 to 0.95 in 0.05 steps
-    \textsf{mAP}@0.5:0.95 is 0.5727.}
-  \label{fig:yolo-ap}
-\end{figure}
-
-\subsubsection{Hyperparameter Optimization}
-\label{sssec:yolo-hyp-opt}
-
-This section should be moved to the hyperparameter optimization
-section in the development chapter
-(section~\ref{sec:development-detection}).
-
-
-\begin{table}[h]
-  \centering
-  \begin{tabular}{lrrrr}
-    \toprule
-    {} &  Precision &    Recall &  $\mathrm{F}_1$-score &  Support \\
-    \midrule
-    Plant        &   0.633358 &  0.702811 &  0.666279 &  12238.0 \\
-    \bottomrule
-  \end{tabular}
-  \caption{Precision, recall and $\mathrm{F}_1$-score for the
-    optimized object detection model.}
-  \label{tab:yolo-metrics-hyp}
-\end{table}
-
-Turning to the evaluation of the optimized model on the test dataset,
-table~\ref{tab:yolo-metrics-hyp} shows precision, recall and the
-$\mathrm{F}_1$-score for the optimized model. Comparing these metrics
-with the non-optimized version from table~\ref{tab:yolo-metrics},
-precision is significantly higher by more than 8.5\%. Recall, however,
-is 3.5\% lower. The $\mathrm{F}_1$-score is higher by more than 3.7\%
-which indicates that the optimized model is better overall despite the
-lower recall. We feel that the lower recall value is a suitable trade
-off for the substantially higher precision considering that the
-non-optimized model's precision is quite low at 0.55.
-
-The precision-recall curves in figure~\ref{fig:yolo-ap-hyp} for the
-optimized model show that the model draws looser bounding boxes than
-the optimized model. The \gls{ap} for both \gls{iou} thresholds of 0.5
-and 0.95 is lower indicating worse performance. It is likely that more
-iterations during evolution would help increase the \gls{ap} values as
-well. Even though the precision and recall values from
-table~\ref{tab:yolo-metrics-hyp} are better, the \textsf{mAP}@0.5:0.95
-is lower by 1.8\%.
-
-\begin{figure}
-  \centering
-  \includegraphics{graphics/APpt5-pt95-final.pdf}
-  \caption[Hyper-parameter optimized object detection AP@0.5 and
-  AP@0.95.]{Precision-recall curves for \gls{iou} thresholds of 0.5
-    and 0.95. The \gls{ap} of a specific threshold is defined as the
-    area under the precision-recall curve of that threshold. The
-    \gls{map} across \gls{iou} thresholds from 0.5 to 0.95 in 0.05
-    steps \textsf{mAP}@0.5:0.95 is 0.5546.}
-  \label{fig:yolo-ap-hyp}
-\end{figure}
-
-\subsection{Classification}
-\label{ssec:classifier-eval}
-
-
-\subsubsection{Hyperparameter Optimization}
-\label{sssec:classifier-hyp-opt}
-
-This section should be moved to the hyperparameter optimization
-section in the development chapter
-(section~\ref{sec:development-classification}).
-
-
-
 \subsubsection{Class Activation Maps}
 \label{sssec:classifier-cam}

@ -3110,7 +3147,7 @@ explain why a decision was made in a certain way. The research field
 of \gls{xai} gained significance during the last few years because of
 the development of new methods to peek inside these black boxes.

-One such method, \gls{cam}~\cite{zhou2015}, is a popular tool to
+One such method, \gls{cam} \cite{zhou2015}, is a popular tool to
 produce visual explanations for decisions made by
 \glspl{cnn}. Convolutional layers essentially function as object
 detectors as long as no fully-connected layers perform the
@ -3120,7 +3157,7 @@ be retained until the last layer and used to generate activation maps
 for the predictions.

 A more recent approach to generating a \gls{cam} via gradients is
-proposed by~\textcite{selvaraju2020}. Their \gls{grad-cam} approach
+proposed by \textcite{selvaraju2020}. Their \gls{grad-cam} approach
 works by computing the gradient of the feature maps of the last
 convolutional layer with respect to the specified class. The last
 layer is chosen because the authors find that ``[…]  Grad-CAM maps
@ -3156,7 +3193,6 @@ of the image during classification.
  \label{fig:classifier-cam}
 \end{figure}

-
 \subsection{Aggregate Model}
 \label{ssec:aggregate-model}

@ -3167,31 +3203,6 @@ complete pipeline from gathering detections of potential plants in an
 image and forwarding them to the classifier to obtaining the results
 as either healthy or stressed with their associated confidence scores.

-The test set contains 640 images which were obtained from a google
-search using the terms \emph{thirsty plant}, \emph{wilted plant} and
-\emph{stressed plant}. Images which clearly show one or multiple
-plants with some amount of visible stress were added to the
-dataset. Care was taken to include plants with various degrees of
-stress and in various locations and lighting conditions. The search
-not only provided images of stressed plants, but also of healthy
-plants due to articles, which describe how to care for plants, having
-a banner image of healthy plants. The dataset is biased towards potted
-plants which are commonly put on display in western
-households. Furthermore, many plants, such as succulents, are sought
-after for home environments because of their ease of maintenance. Due
-to their inclusion in the dataset and how they exhibit water stress,
-the test set nevertheless contains a wide variety of scenarios.
-
-After collecting the images, the aggregate model was run on them to
-obtain initial bounding boxes and classifications for ground truth
-labeling. Letting the model do the work beforehand and then correcting
-the labels allowed to include more images in the test set because they
-could be labeled more easily. Additionally, going over the detections
-and classifications provided a comprehensive view on how the models
-work and what their weaknesses and strengths are. After the labels
-have been corrected, the ground truth of the test set contains 766
-bounding boxes of healthy plants and 494 of stressed plants.
-
 \subsection{Non-optimized Model}
 \label{ssec:model-non-optimized}

@ -3199,13 +3210,13 @@ bounding boxes of healthy plants and 494 of stressed plants.
  \centering
  \begin{tabular}{lrrrr}
    \toprule
-    {} &  precision &  recall &  $\mathrm{F}_{1}$-score &  support \\
+    {} &  Precision &  Recall &  $\mathrm{F}_{1}$-score &  Support \\
    \midrule
-    Healthy      &      0.665 &   0.554 &     0.604 &    766 \\
-    Stressed     &      0.639 &   0.502 &     0.562 &    494 \\
-    micro avg    &      0.655 &   0.533 &     0.588 &   1260 \\
-    macro avg    &      0.652 &   0.528 &     0.583 &   1260 \\
-    weighted avg &      0.655 &   0.533 &     0.588 &   1260 \\
+    Healthy      &      \num{0.665} &   \num{0.554} &     \num{0.604} &    \num{766} \\
+    Stressed     &      \num{0.639} &   \num{0.502} &     \num{0.562} &    \num{494} \\
+    Micro Avg    &      \num{0.655} &   \num{0.533} &     \num{0.588} &   \num{1260} \\
+    Macro Avg    &      \num{0.652} &   \num{0.528} &     \num{0.583} &   \num{1260} \\
+    Weighted Avg &      \num{0.655} &   \num{0.533} &     \num{0.588} &   \num{1260} \\
    \bottomrule
  \end{tabular}
  \caption{Precision, recall and $\mathrm{F}_1$-score for the
@ -3216,33 +3227,34 @@ bounding boxes of healthy plants and 494 of stressed plants.
 Table~\ref{tab:model-metrics} shows precision, recall and the
 $\mathrm{F}_1$-score for both classes \emph{Healthy} and
 \emph{Stressed}. Precision is higher than recall for both classes and
-the $\mathrm{F}_1$-score is at 0.59. Unfortunately, these values do
-not take the accuracy of bounding boxes into account and thus have
+the $\mathrm{F}_1$-score is at \num{0.59}. Unfortunately, these values
+do not take the accuracy of bounding boxes into account and thus have
 only limited expressive power.

 Figure~\ref{fig:aggregate-ap} shows the precision and recall curves
 for both classes at different \gls{iou} thresholds. The left plot
-shows the \gls{ap} for each class at the threshold of 0.5 and the
-right one at 0.95. The \gls{map} is 0.3581 and calculated across all
-classes as the median of the \gls{iou} thresholds from 0.5 to 0.95 in
-0.05 steps. The cliffs at around 0.6 (left) and 0.3 (right) happen at
-a detection threshold of 0.5. The classifier's last layer is a softmax
-layer which necessarily transforms the input into a probability of
-showing either a healthy or stressed plant. If the probability of an
-image showing a healthy plant is below 0.5, it is no longer classified
-as healthy but as stressed. The threshold for discriminating the two
-classes lies at the 0.5 value and is therefore the cutoff for either
-class.
+shows the \gls{ap} for each class at the threshold of \num{0.5} and
+the right one at \num{0.95}. The \gls{map} is \num{0.3581} and
+calculated across all classes as the median of the \gls{iou}
+thresholds from \num{0.5} to \num{0.95} in \num{0.05} steps. The
+cliffs at around \num{0.6} (left) and \num{0.3} (right) happen at a
+detection threshold of \num{0.5}. The classifier's last layer is a
+softmax layer which necessarily transforms the input into a
+probability of showing either a healthy or stressed plant. If the
+probability of an image showing a healthy plant is below \num{0.5}, it
+is no longer classified as healthy but as stressed. The threshold for
+discriminating the two classes lies at the \num{0.5} value and is
+therefore the cutoff for either class.

 \begin{figure}
  \centering
  \includegraphics{graphics/APmodel-model-optimized-relabeled.pdf}
  \caption[Aggregate model AP@0.5 and AP@0.95.]{Precision-recall
-    curves for \gls{iou} thresholds of 0.5 and 0.95. The \gls{ap} of a
-    specific threshold is defined as the area under the
+    curves for \gls{iou} thresholds of \num{0.5} and \num{0.95}. The
+    \gls{ap} of a specific threshold is defined as the area under the
    precision-recall curve of that threshold. The \gls{map} across
-    \gls{iou} thresholds from 0.5 to 0.95 in 0.05 steps
-    \textsf{mAP}@0.5:0.95 is 0.3581.}
+    \gls{iou} thresholds from \num{0.5} to \num{0.95} in \num{0.05}
+    steps \gls{map}@0.5:0.95 is \num{0.3581}.}
  \label{fig:aggregate-ap}
 \end{figure}

@ -3262,13 +3274,13 @@ section~\ref{ssec:aggregate-model}.
  \centering
  \begin{tabular}{lrrrr}
    \toprule
-    {} &  precision &  recall &  $\mathrm{F}_{1}$-score &  support \\
+    {} &  Precision &  Recall &  $\mathrm{F}_{1}$-score &  Support \\
    \midrule
    Healthy      &      0.711 &   0.555 &     0.623 &    766 \\
    Stressed     &      0.570 &   0.623 &     0.596 &    494 \\
-    micro avg    &      0.644 &   0.582 &     0.611 &   1260 \\
-    macro avg    &      0.641 &   0.589 &     0.609 &   1260 \\
-    weighted avg &      0.656 &   0.582 &     0.612 &   1260 \\
+    Micro Avg    &      0.644 &   0.582 &     0.611 &   1260 \\
+    Macro Avg    &      0.641 &   0.589 &     0.609 &   1260 \\
+    Weighted Avg &      0.656 &   0.582 &     0.612 &   1260 \\
    \bottomrule
  \end{tabular}
  \caption{Precision, recall and $\mathrm{F}_1$-score for the
@ -3278,49 +3290,74 @@ section~\ref{ssec:aggregate-model}.

 Table~\ref{tab:model-metrics-hyp} shows precision, recall and
 $\mathrm{F}_1$-score for the optimized model on the same test dataset
-of 640 images. All of the metrics are better for the optimized
+of \num{640} images. All of the metrics are better for the optimized
 model. In particular, precision for the healthy class could be
 improved significantly while recall remains at the same level. This
 results in a better $\mathrm{F}_1$-score for the healthy
 class. Precision for the stressed class is lower with the optimized
-model, but recall is significantly higher (0.502 vs. 0.623). The
-higher recall results in a 3\% gain for the $\mathrm{F}_1$-score in
-the stressed class. Overall, precision is the same but recall has
-improved significantly, which also results in a noticeable improvement
-for the average $\mathrm{F}_1$-score across both classes.
+model, but recall is significantly higher (\num{0.502}
+vs. \num{0.623}). The higher recall results in a 3 percentage point
+gain for the $\mathrm{F}_1$-score in the stressed class. Overall,
+precision is the same but recall has improved significantly, which
+also results in a noticeable improvement for the average
+$\mathrm{F}_1$-score across both classes.

 \begin{figure}
  \centering
  \includegraphics{graphics/APModel-model-original-relabeled.pdf}
  \caption[Optimized aggregate model AP@0.5 and
-  AP@0.95.]{Precision-recall curves for \gls{iou} thresholds of 0.5
-    and 0.95. The \gls{ap} of a specific threshold is defined as the
-    area under the precision-recall curve of that threshold. The
-    \gls{map} across \gls{iou} thresholds from 0.5 to 0.95 in 0.05
-    steps \textsf{mAP}@0.5:0.95 is 0.3838.}
+  AP@0.95.]{Precision-recall curves for \gls{iou} thresholds of
+    \num{0.5} and \num{0.95}. The \gls{ap} of a specific threshold is
+    defined as the area under the precision-recall curve of that
+    threshold. The \gls{map} across \gls{iou} thresholds from
+    \num{0.5} to \num{0.95} in \num{0.05} steps \gls{map}@0.5:0.95 is
+    \num{0.3838}.}
  \label{fig:aggregate-ap-hyp}
 \end{figure}

 Figure~\ref{fig:aggregate-ap-hyp} confirms the performance increase of
 the optimized model established in
-table~\ref{tab:model-metrics-hyp}. The \textsf{mAP}@0.5 is higher for
+table~\ref{tab:model-metrics-hyp}. The \gls{map}@0.5 is higher for
 both classes, indicating that the model better detects plants in
-general. The \textsf{mAP}@0.95 is slightly lower for the healthy
-class, which means that the confidence for the healthy class is
-slightly lower compared to the non-optimized model. The result is that
-more plants are correctly detected and classified overall, but the
+general. The \gls{map}@0.95 is slightly lower for the healthy class,
+which means that the confidence for the healthy class is slightly
+lower compared to the non-optimized model. The result is that more
+plants are correctly detected and classified overall, but the
 confidence scores tend to be lower with the optimized model. The
-\textsf{mAP}@0.5:0.95 could be improved by about 0.025.
+\gls{map}@0.5:0.95 could be improved by about \num{0.025}.

 \section{Discussion}
 \label{sec:discussion}

-Pull out discussion parts from current results chapter
-(~\ref{sec:results}) and add a section about achievement of the aim
-of the work discussed in motivation and problem statement section
-(~\ref{sec:methods}).
+Overall, the performance of the individual models is state of the art
+when compared with object detection benchmarks such as the \gls{coco}
+dataset. The \gls{map} of \num{0.5727} for the object detection model
+is in line with most other object detectors. The hyperparameter
+optimization of the object detector, however, raises further
+questions. The \gls{map} of the optimized model is \num{1.8}
+percentage points lower than the non-optimized version. Even though
+precision and recall of the model improved, the bounding boxes are
+worse. We argue that the hyperparameter optimization has to be run for
+more than \num{87} iterations to provide better results. Searching for
+the optimal hyperparameters with genetic methods usually requires many
+more iterations than that because it takes a significant amount of
+time to evolve the parameters \emph{away} from the starting
+conditions.

-Estimated 2 pages for this chapter.
+Furthermore, we only train each iteration for three epochs and assume
+that those already provide a good measure of the model's
+performance. It can be seen in figure~\ref{fig:hyp-opt-fitness} that
+the fitness during the first few epochs exhibits some amount of
+variation before it stabilizes. In fact, the fitness of the
+non-optimized object detector (figure~\ref{fig:fitness}) only achieves
+a stable value at epoch \num{50}. An optimized model is often able to
+converge faster which is supported by
+figure~\ref{fig:hyp-opt-fitness}, but even in that case it takes more
+than ten epochs to stabilize the training process.
+
+The optimized classifier
+shows very strong performance in the \num{10}-fold cross validation
+where it achieves a mean \gls{auc} of \num{0.96}.

 \chapter{Conclusion}
 \label{chap:conclusion}