Finish evaluation chapter

2023-12-21 17:40:31 +01:00 · 2023-12-21 17:40:31 +01:00 · 3a2da62bec
commit 3a2da62bec
parent dced4c6902
2 changed files with 114 additions and 14 deletions
--- a/thesis/thesis.pdf
+++ b/thesis/thesis.pdf
--- a/thesis/thesis.tex
+++ b/thesis/thesis.tex
@ -2901,8 +2901,8 @@ The following sections contain a detailed evaluation of the model in
 various scenarios. First, we describe the test datasets as well as the
 metrics used for assessing model performance. Second, we present the
 results of the evaluation and analyze the behavior of the classifier
-with \gls{grad-cam}. Finally, we discuss the results with respect to
-the research questions defined in section~\ref{sec:motivation}.
+with \gls{grad-cam}. Finally, we discuss the results and identify the
+limitations of our approach.

 \section{Methodology}
 \label{sec:methodology}
@ -3296,11 +3296,11 @@ improved significantly while recall remains at the same level. This
 results in a better $\mathrm{F}_1$-score for the healthy
 class. Precision for the stressed class is lower with the optimized
 model, but recall is significantly higher (\num{0.502}
-vs. \num{0.623}). The higher recall results in a 3 percentage point
-gain for the $\mathrm{F}_1$-score in the stressed class. Overall,
-precision is the same but recall has improved significantly, which
-also results in a noticeable improvement for the average
-$\mathrm{F}_1$-score across both classes.
+vs. \num{0.623}). The higher recall results in a three percentage
+point gain for the $\mathrm{F}_1$-score in the stressed
+class. Overall, precision is the same but recall has improved
+significantly, which also results in a noticeable improvement for the
+average $\mathrm{F}_1$-score across both classes.

 \begin{figure}
  \centering
@ -3332,8 +3332,39 @@ confidence scores tend to be lower with the optimized model. The
 Overall, the performance of the individual models is state of the art
 when compared with object detection benchmarks such as the \gls{coco}
 dataset. The \gls{map} of \num{0.5727} for the object detection model
-is in line with most other object detectors. The hyperparameter
-optimization of the object detector, however, raises further
+is in line with most other object detectors. Even though the results
+are reasonably good, we argue that they could be better for the
+purposes of plant detection in the context of this work. The \gls{oid}
+was labeled by humans and thus exhibits characteristics which are not
+optimal for our purposes. The class \emph{plant} does not seem to have
+been defined rigorously. Large patches of grass, for example, are
+labeled with large bounding boxes. Trees are sometimes labeled, but
+only if their size suggests that they could be bushes or similar types
+of plant. Large corn fields are also labeled as plants, but again with
+one large bounding box. If multiple plants are densely packed, the
+annotators often label them as belonging to one plant and thus one
+bounding box. Sometimes the effort has been made to delineate plants
+accurately and sometimes not which results in inconsistent bounding
+boxes. These inconsistencies and peculiarities as well as the always
+present error rate introduced by humans complicate the training
+process of our object detection model.
+
+During a random sampling of labels and predictions of the object
+detection model on the validation set, it became clear that the model
+tries to always correctly label each individual plant when it is faced
+with an image of closely packed plants. For images where one bounding
+box encapsulates all of the plants, the \gls{iou} of the model's
+predictions is too far off from the ground truth which lowers the
+\gls{map} accordingly. Since arguably all datasets will have some
+inconsistencies and errors in their ground truth, model engineers can
+only hope that the sheer amount of data available evens out these
+problems. In our case, the \num{79204} training images with
+\num{284130} bounding boxes might be enough to provide the model with
+a smooth distribution from which to learn from, but unless every
+single label is analyzed and systematically categorized this remains
+speculation.
+
+The hyperparameter optimization of the object detector raises further
 questions. The \gls{map} of the optimized model is \num{1.8}
 percentage points lower than the non-optimized version. Even though
 precision and recall of the model improved, the bounding boxes are
@ -3342,7 +3373,8 @@ more than \num{87} iterations to provide better results. Searching for
 the optimal hyperparameters with genetic methods usually requires many
 more iterations than that because it takes a significant amount of
 time to evolve the parameters \emph{away} from the starting
-conditions.
+conditions. However, as mentioned before, our time constraints only
+allowed optimization to run for \num{87} iterations.

 Furthermore, we only train each iteration for three epochs and assume
 that those already provide a good measure of the model's
@ -3353,11 +3385,79 @@ non-optimized object detector (figure~\ref{fig:fitness}) only achieves
 a stable value at epoch \num{50}. An optimized model is often able to
 converge faster which is supported by
 figure~\ref{fig:hyp-opt-fitness}, but even in that case it takes more
-than ten epochs to stabilize the training process.
+than ten epochs to stabilize the training process. We argue that three
+epochs are likely not enough to support the hyperparameter
+optimization process. Unfortunately, if the number of epochs per
+iteration is increased by one, the complete number of epochs over all
+iterations increases by the total number of iterations. Every
+additional epoch thus contributes to a significantly longer
+optimization time. For our purposes, \num{87} iterations and three
+epochs per iteration are close to the limit. Further iterations or
+epochs were not feasible within our time budget.

-The optimized classifier
-shows very strong performance in the \num{10}-fold cross validation
-where it achieves a mean \gls{auc} of \num{0.96}.
+The optimized classifier shows a strong performance in the
+\num{10}-fold cross validation where it achieves a mean \gls{auc} of
+\num{0.96}. The standard deviation of the \gls{auc} across all folds
+is small enough at \num{0.02} to indicate that the model generalizes
+well to unseen data. We are confident in these results provided that
+the ground truth was labeled correctly. The \gls{cam}
+(figure~\ref{fig:classifier-cam}) constitute another data point in
+support of this conclusion. Despite these points, the results come
+with a caveat. The ground truth was \emph{not} created by an expert in
+botany or related sciences and thus could contain a significant amount
+of errors. Even though we manually verified most of the labels in the
+dataset and agree with the labels, we are also \emph{not} expert
+labelers.
+
+The aggregate model achieves a \gls{map} of \num{0.3581} before and
+\num{0.3838} after optimization. If we look at the common benchmarks
+(\gls{coco}) again where the state of the art achieves \gls{map}
+values of between \num{0.5} and \num{0.58}, we are confident that our
+results are reasonably good. Comparing the \gls{map} values directly
+is not a clear indicator of how good the model is or should be because
+it is an apples to oranges comparison due to the different test
+datasets. Nevertheless, the task of detecting objects and classifying
+them is similar across both datasets and the comparison thus provides
+a rough guideline for the performance of our prototype. We argue that
+the task of classifying the plants into healthy and stressed on top of
+detecting plants is a more difficult task than \emph{just} object
+detection. Additionally to having to discriminate between different
+common objects, our model also has to discriminate between plant
+states which requires further knowledge. The lower \gls{map} values
+are thus attributable to the more difficult task posed by our research
+questions.
+
+We do not know the reason for the better performance of the optimized
+versus the non-optimized aggregate model. Evidently, the optimized
+version should be better, but considering that the optimized object
+detector performs worse in terms of \gls{map}, we would expect to see
+this reflected in the aggregate model as well. It is possible that the
+optimized classifier balances out the worse object detector and even
+provides better results beyond that. Another possibility is that the
+better performance is in large part due to the increased precision and
+recall of the optimized object detector. In fact, these two
+possibilities taken together might explain the optimized model
+results. Nevertheless, we caution against putting too much weight on
+the \num{2.5} percentage point \gls{map} increase because both models
+have been optimized \emph{separately} instead of \emph{in
+aggregate}. By optimizing the models separately to increase the
+accuracy on a new dataset instead of optimizing them in aggregate, we
+do not take the dependence between the two models into account. As an
+example, it could be the case that new, better configurations of both
+models are worse in aggregate than some other option would be. Even
+though both models are \emph{locally} better (w.r.t. their separate
+tasks), they are worse \emph{globally} when taken together to solve
+both tasks in series. A better approach to optimization would be to
+either combine both models into one and only optimize once or to
+introduce a different metric against which both models are optimized.
+
+Apart from these concerns, both models on their own as well as in
+aggregate are a promising first step into plant state
+classification. The results demonstrate that solving the task is
+feasible and that good results can be obtained with off-the-shelf
+object detectors and classifiers. As a consequence, the baseline set
+forth in this work is a starting point for further research in this
+direction.

 \chapter{Conclusion}
 \label{chap:conclusion}