Finish evaluation chapter
This commit is contained in:
parent
dced4c6902
commit
3a2da62bec
Binary file not shown.
@ -2901,8 +2901,8 @@ The following sections contain a detailed evaluation of the model in
|
|||||||
various scenarios. First, we describe the test datasets as well as the
|
various scenarios. First, we describe the test datasets as well as the
|
||||||
metrics used for assessing model performance. Second, we present the
|
metrics used for assessing model performance. Second, we present the
|
||||||
results of the evaluation and analyze the behavior of the classifier
|
results of the evaluation and analyze the behavior of the classifier
|
||||||
with \gls{grad-cam}. Finally, we discuss the results with respect to
|
with \gls{grad-cam}. Finally, we discuss the results and identify the
|
||||||
the research questions defined in section~\ref{sec:motivation}.
|
limitations of our approach.
|
||||||
|
|
||||||
\section{Methodology}
|
\section{Methodology}
|
||||||
\label{sec:methodology}
|
\label{sec:methodology}
|
||||||
@ -3296,11 +3296,11 @@ improved significantly while recall remains at the same level. This
|
|||||||
results in a better $\mathrm{F}_1$-score for the healthy
|
results in a better $\mathrm{F}_1$-score for the healthy
|
||||||
class. Precision for the stressed class is lower with the optimized
|
class. Precision for the stressed class is lower with the optimized
|
||||||
model, but recall is significantly higher (\num{0.502}
|
model, but recall is significantly higher (\num{0.502}
|
||||||
vs. \num{0.623}). The higher recall results in a 3 percentage point
|
vs. \num{0.623}). The higher recall results in a three percentage
|
||||||
gain for the $\mathrm{F}_1$-score in the stressed class. Overall,
|
point gain for the $\mathrm{F}_1$-score in the stressed
|
||||||
precision is the same but recall has improved significantly, which
|
class. Overall, precision is the same but recall has improved
|
||||||
also results in a noticeable improvement for the average
|
significantly, which also results in a noticeable improvement for the
|
||||||
$\mathrm{F}_1$-score across both classes.
|
average $\mathrm{F}_1$-score across both classes.
|
||||||
|
|
||||||
\begin{figure}
|
\begin{figure}
|
||||||
\centering
|
\centering
|
||||||
@ -3332,8 +3332,39 @@ confidence scores tend to be lower with the optimized model. The
|
|||||||
Overall, the performance of the individual models is state of the art
|
Overall, the performance of the individual models is state of the art
|
||||||
when compared with object detection benchmarks such as the \gls{coco}
|
when compared with object detection benchmarks such as the \gls{coco}
|
||||||
dataset. The \gls{map} of \num{0.5727} for the object detection model
|
dataset. The \gls{map} of \num{0.5727} for the object detection model
|
||||||
is in line with most other object detectors. The hyperparameter
|
is in line with most other object detectors. Even though the results
|
||||||
optimization of the object detector, however, raises further
|
are reasonably good, we argue that they could be better for the
|
||||||
|
purposes of plant detection in the context of this work. The \gls{oid}
|
||||||
|
was labeled by humans and thus exhibits characteristics which are not
|
||||||
|
optimal for our purposes. The class \emph{plant} does not seem to have
|
||||||
|
been defined rigorously. Large patches of grass, for example, are
|
||||||
|
labeled with large bounding boxes. Trees are sometimes labeled, but
|
||||||
|
only if their size suggests that they could be bushes or similar types
|
||||||
|
of plant. Large corn fields are also labeled as plants, but again with
|
||||||
|
one large bounding box. If multiple plants are densely packed, the
|
||||||
|
annotators often label them as belonging to one plant and thus one
|
||||||
|
bounding box. Sometimes the effort has been made to delineate plants
|
||||||
|
accurately and sometimes not which results in inconsistent bounding
|
||||||
|
boxes. These inconsistencies and peculiarities as well as the always
|
||||||
|
present error rate introduced by humans complicate the training
|
||||||
|
process of our object detection model.
|
||||||
|
|
||||||
|
During a random sampling of labels and predictions of the object
|
||||||
|
detection model on the validation set, it became clear that the model
|
||||||
|
tries to always correctly label each individual plant when it is faced
|
||||||
|
with an image of closely packed plants. For images where one bounding
|
||||||
|
box encapsulates all of the plants, the \gls{iou} of the model's
|
||||||
|
predictions is too far off from the ground truth which lowers the
|
||||||
|
\gls{map} accordingly. Since arguably all datasets will have some
|
||||||
|
inconsistencies and errors in their ground truth, model engineers can
|
||||||
|
only hope that the sheer amount of data available evens out these
|
||||||
|
problems. In our case, the \num{79204} training images with
|
||||||
|
\num{284130} bounding boxes might be enough to provide the model with
|
||||||
|
a smooth distribution from which to learn from, but unless every
|
||||||
|
single label is analyzed and systematically categorized this remains
|
||||||
|
speculation.
|
||||||
|
|
||||||
|
The hyperparameter optimization of the object detector raises further
|
||||||
questions. The \gls{map} of the optimized model is \num{1.8}
|
questions. The \gls{map} of the optimized model is \num{1.8}
|
||||||
percentage points lower than the non-optimized version. Even though
|
percentage points lower than the non-optimized version. Even though
|
||||||
precision and recall of the model improved, the bounding boxes are
|
precision and recall of the model improved, the bounding boxes are
|
||||||
@ -3342,7 +3373,8 @@ more than \num{87} iterations to provide better results. Searching for
|
|||||||
the optimal hyperparameters with genetic methods usually requires many
|
the optimal hyperparameters with genetic methods usually requires many
|
||||||
more iterations than that because it takes a significant amount of
|
more iterations than that because it takes a significant amount of
|
||||||
time to evolve the parameters \emph{away} from the starting
|
time to evolve the parameters \emph{away} from the starting
|
||||||
conditions.
|
conditions. However, as mentioned before, our time constraints only
|
||||||
|
allowed optimization to run for \num{87} iterations.
|
||||||
|
|
||||||
Furthermore, we only train each iteration for three epochs and assume
|
Furthermore, we only train each iteration for three epochs and assume
|
||||||
that those already provide a good measure of the model's
|
that those already provide a good measure of the model's
|
||||||
@ -3353,11 +3385,79 @@ non-optimized object detector (figure~\ref{fig:fitness}) only achieves
|
|||||||
a stable value at epoch \num{50}. An optimized model is often able to
|
a stable value at epoch \num{50}. An optimized model is often able to
|
||||||
converge faster which is supported by
|
converge faster which is supported by
|
||||||
figure~\ref{fig:hyp-opt-fitness}, but even in that case it takes more
|
figure~\ref{fig:hyp-opt-fitness}, but even in that case it takes more
|
||||||
than ten epochs to stabilize the training process.
|
than ten epochs to stabilize the training process. We argue that three
|
||||||
|
epochs are likely not enough to support the hyperparameter
|
||||||
|
optimization process. Unfortunately, if the number of epochs per
|
||||||
|
iteration is increased by one, the complete number of epochs over all
|
||||||
|
iterations increases by the total number of iterations. Every
|
||||||
|
additional epoch thus contributes to a significantly longer
|
||||||
|
optimization time. For our purposes, \num{87} iterations and three
|
||||||
|
epochs per iteration are close to the limit. Further iterations or
|
||||||
|
epochs were not feasible within our time budget.
|
||||||
|
|
||||||
The optimized classifier
|
The optimized classifier shows a strong performance in the
|
||||||
shows very strong performance in the \num{10}-fold cross validation
|
\num{10}-fold cross validation where it achieves a mean \gls{auc} of
|
||||||
where it achieves a mean \gls{auc} of \num{0.96}.
|
\num{0.96}. The standard deviation of the \gls{auc} across all folds
|
||||||
|
is small enough at \num{0.02} to indicate that the model generalizes
|
||||||
|
well to unseen data. We are confident in these results provided that
|
||||||
|
the ground truth was labeled correctly. The \gls{cam}
|
||||||
|
(figure~\ref{fig:classifier-cam}) constitute another data point in
|
||||||
|
support of this conclusion. Despite these points, the results come
|
||||||
|
with a caveat. The ground truth was \emph{not} created by an expert in
|
||||||
|
botany or related sciences and thus could contain a significant amount
|
||||||
|
of errors. Even though we manually verified most of the labels in the
|
||||||
|
dataset and agree with the labels, we are also \emph{not} expert
|
||||||
|
labelers.
|
||||||
|
|
||||||
|
The aggregate model achieves a \gls{map} of \num{0.3581} before and
|
||||||
|
\num{0.3838} after optimization. If we look at the common benchmarks
|
||||||
|
(\gls{coco}) again where the state of the art achieves \gls{map}
|
||||||
|
values of between \num{0.5} and \num{0.58}, we are confident that our
|
||||||
|
results are reasonably good. Comparing the \gls{map} values directly
|
||||||
|
is not a clear indicator of how good the model is or should be because
|
||||||
|
it is an apples to oranges comparison due to the different test
|
||||||
|
datasets. Nevertheless, the task of detecting objects and classifying
|
||||||
|
them is similar across both datasets and the comparison thus provides
|
||||||
|
a rough guideline for the performance of our prototype. We argue that
|
||||||
|
the task of classifying the plants into healthy and stressed on top of
|
||||||
|
detecting plants is a more difficult task than \emph{just} object
|
||||||
|
detection. Additionally to having to discriminate between different
|
||||||
|
common objects, our model also has to discriminate between plant
|
||||||
|
states which requires further knowledge. The lower \gls{map} values
|
||||||
|
are thus attributable to the more difficult task posed by our research
|
||||||
|
questions.
|
||||||
|
|
||||||
|
We do not know the reason for the better performance of the optimized
|
||||||
|
versus the non-optimized aggregate model. Evidently, the optimized
|
||||||
|
version should be better, but considering that the optimized object
|
||||||
|
detector performs worse in terms of \gls{map}, we would expect to see
|
||||||
|
this reflected in the aggregate model as well. It is possible that the
|
||||||
|
optimized classifier balances out the worse object detector and even
|
||||||
|
provides better results beyond that. Another possibility is that the
|
||||||
|
better performance is in large part due to the increased precision and
|
||||||
|
recall of the optimized object detector. In fact, these two
|
||||||
|
possibilities taken together might explain the optimized model
|
||||||
|
results. Nevertheless, we caution against putting too much weight on
|
||||||
|
the \num{2.5} percentage point \gls{map} increase because both models
|
||||||
|
have been optimized \emph{separately} instead of \emph{in
|
||||||
|
aggregate}. By optimizing the models separately to increase the
|
||||||
|
accuracy on a new dataset instead of optimizing them in aggregate, we
|
||||||
|
do not take the dependence between the two models into account. As an
|
||||||
|
example, it could be the case that new, better configurations of both
|
||||||
|
models are worse in aggregate than some other option would be. Even
|
||||||
|
though both models are \emph{locally} better (w.r.t. their separate
|
||||||
|
tasks), they are worse \emph{globally} when taken together to solve
|
||||||
|
both tasks in series. A better approach to optimization would be to
|
||||||
|
either combine both models into one and only optimize once or to
|
||||||
|
introduce a different metric against which both models are optimized.
|
||||||
|
|
||||||
|
Apart from these concerns, both models on their own as well as in
|
||||||
|
aggregate are a promising first step into plant state
|
||||||
|
classification. The results demonstrate that solving the task is
|
||||||
|
feasible and that good results can be obtained with off-the-shelf
|
||||||
|
object detectors and classifiers. As a consequence, the baseline set
|
||||||
|
forth in this work is a starting point for further research in this
|
||||||
|
direction.
|
||||||
|
|
||||||
\chapter{Conclusion}
|
\chapter{Conclusion}
|
||||||
\label{chap:conclusion}
|
\label{chap:conclusion}
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user