Finish evaluation chapter

2023-12-21 17:40:31 +01:00 · 2023-12-21 17:40:31 +01:00 · 3a2da62bec
commit 3a2da62bec
parent dced4c6902
2 changed files with 114 additions and 14 deletions
--- a/thesis/thesis.pdf
+++ b/thesis/thesis.pdf
--- a/thesis/thesis.tex
+++ b/thesis/thesis.tex
@ -2901,8 +2901,8 @@ The following sections contain a detailed evaluation of the model in
 various scenarios. First, we describe the test datasets as well as the
 metrics used for assessing model performance. Second, we present the
 results of the evaluation and analyze the behavior of the classifier
-with \gls{grad-cam}. Finally, we discuss the results with respect to
+with \gls{grad-cam}. Finally, we discuss the results and identify the
-the research questions defined in section~\ref{sec:motivation}.
+limitations of our approach.
 \section{Methodology}
 \label{sec:methodology}
@ -3296,11 +3296,11 @@ improved significantly while recall remains at the same level. This
 results in a better $\mathrm{F}_1$-score for the healthy
 class. Precision for the stressed class is lower with the optimized
 model, but recall is significantly higher (\num{0.502}
-vs. \num{0.623}). The higher recall results in a 3 percentage point
+vs. \num{0.623}). The higher recall results in a three percentage
-gain for the $\mathrm{F}_1$-score in the stressed class. Overall,
+point gain for the $\mathrm{F}_1$-score in the stressed
-precision is the same but recall has improved significantly, which
+class. Overall, precision is the same but recall has improved
-also results in a noticeable improvement for the average
+significantly, which also results in a noticeable improvement for the
-$\mathrm{F}_1$-score across both classes.
+average $\mathrm{F}_1$-score across both classes.
 \begin{figure}
  \centering
@ -3332,8 +3332,39 @@ confidence scores tend to be lower with the optimized model. The
 Overall, the performance of the individual models is state of the art
 when compared with object detection benchmarks such as the \gls{coco}
 dataset. The \gls{map} of \num{0.5727} for the object detection model
-is in line with most other object detectors. The hyperparameter
+is in line with most other object detectors. Even though the results
-optimization of the object detector, however, raises further
+are reasonably good, we argue that they could be better for the
 purposes of plant detection in the context of this work. The \gls{oid}
 was labeled by humans and thus exhibits characteristics which are not
 optimal for our purposes. The class \emph{plant} does not seem to have
 been defined rigorously. Large patches of grass, for example, are
 labeled with large bounding boxes. Trees are sometimes labeled, but
 only if their size suggests that they could be bushes or similar types
 of plant. Large corn fields are also labeled as plants, but again with
 one large bounding box. If multiple plants are densely packed, the
 annotators often label them as belonging to one plant and thus one
 bounding box. Sometimes the effort has been made to delineate plants
 accurately and sometimes not which results in inconsistent bounding
 boxes. These inconsistencies and peculiarities as well as the always
 present error rate introduced by humans complicate the training
 process of our object detection model.
 During a random sampling of labels and predictions of the object
 detection model on the validation set, it became clear that the model
 tries to always correctly label each individual plant when it is faced
 with an image of closely packed plants. For images where one bounding
 box encapsulates all of the plants, the \gls{iou} of the model's
 predictions is too far off from the ground truth which lowers the
 \gls{map} accordingly. Since arguably all datasets will have some
 inconsistencies and errors in their ground truth, model engineers can
 only hope that the sheer amount of data available evens out these
 problems. In our case, the \num{79204} training images with
 \num{284130} bounding boxes might be enough to provide the model with
 a smooth distribution from which to learn from, but unless every
 single label is analyzed and systematically categorized this remains
 speculation.
 The hyperparameter optimization of the object detector raises further
 questions. The \gls{map} of the optimized model is \num{1.8}
 percentage points lower than the non-optimized version. Even though
 precision and recall of the model improved, the bounding boxes are
@ -3342,7 +3373,8 @@ more than \num{87} iterations to provide better results. Searching for
 the optimal hyperparameters with genetic methods usually requires many
 more iterations than that because it takes a significant amount of
 time to evolve the parameters \emph{away} from the starting
-conditions.
+conditions. However, as mentioned before, our time constraints only
 allowed optimization to run for \num{87} iterations.
 Furthermore, we only train each iteration for three epochs and assume
 that those already provide a good measure of the model's
@ -3353,11 +3385,79 @@ non-optimized object detector (figure~\ref{fig:fitness}) only achieves
 a stable value at epoch \num{50}. An optimized model is often able to
 converge faster which is supported by
 figure~\ref{fig:hyp-opt-fitness}, but even in that case it takes more
-than ten epochs to stabilize the training process.
+than ten epochs to stabilize the training process. We argue that three
 epochs are likely not enough to support the hyperparameter
 optimization process. Unfortunately, if the number of epochs per
 iteration is increased by one, the complete number of epochs over all
 iterations increases by the total number of iterations. Every
 additional epoch thus contributes to a significantly longer
 optimization time. For our purposes, \num{87} iterations and three
 epochs per iteration are close to the limit. Further iterations or
 epochs were not feasible within our time budget.
-The optimized classifier
+The optimized classifier shows a strong performance in the
-shows very strong performance in the \num{10}-fold cross validation
+\num{10}-fold cross validation where it achieves a mean \gls{auc} of
-where it achieves a mean \gls{auc} of \num{0.96}.
+\num{0.96}. The standard deviation of the \gls{auc} across all folds
 is small enough at \num{0.02} to indicate that the model generalizes
 well to unseen data. We are confident in these results provided that
 the ground truth was labeled correctly. The \gls{cam}
 (figure~\ref{fig:classifier-cam}) constitute another data point in
 support of this conclusion. Despite these points, the results come
 with a caveat. The ground truth was \emph{not} created by an expert in
 botany or related sciences and thus could contain a significant amount
 of errors. Even though we manually verified most of the labels in the
 dataset and agree with the labels, we are also \emph{not} expert
 labelers.
 The aggregate model achieves a \gls{map} of \num{0.3581} before and
 \num{0.3838} after optimization. If we look at the common benchmarks
 (\gls{coco}) again where the state of the art achieves \gls{map}
 values of between \num{0.5} and \num{0.58}, we are confident that our
 results are reasonably good. Comparing the \gls{map} values directly
 is not a clear indicator of how good the model is or should be because
 it is an apples to oranges comparison due to the different test
 datasets. Nevertheless, the task of detecting objects and classifying
 them is similar across both datasets and the comparison thus provides
 a rough guideline for the performance of our prototype. We argue that
 the task of classifying the plants into healthy and stressed on top of
 detecting plants is a more difficult task than \emph{just} object
 detection. Additionally to having to discriminate between different
 common objects, our model also has to discriminate between plant
 states which requires further knowledge. The lower \gls{map} values
 are thus attributable to the more difficult task posed by our research
 questions.
 We do not know the reason for the better performance of the optimized
 versus the non-optimized aggregate model. Evidently, the optimized
 version should be better, but considering that the optimized object
 detector performs worse in terms of \gls{map}, we would expect to see
 this reflected in the aggregate model as well. It is possible that the
 optimized classifier balances out the worse object detector and even
 provides better results beyond that. Another possibility is that the
 better performance is in large part due to the increased precision and
 recall of the optimized object detector. In fact, these two
 possibilities taken together might explain the optimized model
 results. Nevertheless, we caution against putting too much weight on
 the \num{2.5} percentage point \gls{map} increase because both models
 have been optimized \emph{separately} instead of \emph{in
 aggregate}. By optimizing the models separately to increase the
 accuracy on a new dataset instead of optimizing them in aggregate, we
 do not take the dependence between the two models into account. As an
 example, it could be the case that new, better configurations of both
 models are worse in aggregate than some other option would be. Even
 though both models are \emph{locally} better (w.r.t. their separate
 tasks), they are worse \emph{globally} when taken together to solve
 both tasks in series. A better approach to optimization would be to
 either combine both models into one and only optimize once or to
 introduce a different metric against which both models are optimized.
 Apart from these concerns, both models on their own as well as in
 aggregate are a promising first step into plant state
 classification. The results demonstrate that solving the task is
 feasible and that good results can be obtained with off-the-shelf
 object detectors and classifiers. As a consequence, the baseline set
 forth in this work is a starting point for further research in this
 direction.
 \chapter{Conclusion}
 \label{chap:conclusion}