Add result section
This commit is contained in:
parent
326562ca85
commit
dced4c6902
Binary file not shown.
@ -1995,7 +1995,7 @@ resource. However, because the results are published via a \gls{rest}
|
|||||||
service, internet access is necessary to be able to retrieve the
|
service, internet access is necessary to be able to retrieve the
|
||||||
predictions.
|
predictions.
|
||||||
|
|
||||||
Other functional requirements are that the inference on the device for
|
Other technical requirements are that the inference on the device for
|
||||||
both models does not take too long (i.e. not longer than a few seconds
|
both models does not take too long (i.e. not longer than a few seconds
|
||||||
per image). Even though plants are not known to grow extremely rapidly
|
per image). Even though plants are not known to grow extremely rapidly
|
||||||
from one minute to the next, keeping the inference time low results in
|
from one minute to the next, keeping the inference time low results in
|
||||||
@ -2003,12 +2003,42 @@ a more resource efficient prototype. As such, it is possible to run
|
|||||||
the device off of a battery which completes the self-contained nature
|
the device off of a battery which completes the self-contained nature
|
||||||
of the prototype.
|
of the prototype.
|
||||||
|
|
||||||
From an evaluation perspective, the models are required to attain a
|
From an evaluation perspective, the models should have high
|
||||||
reasonable level of accuracy. It is difficult to determine said level
|
specificity and sensitivity. In order to be useful for plant
|
||||||
beforehand, but considering the task as well as general object
|
water-stress detection, it is necessary to identify as many
|
||||||
detection and classification benchmarks such as \gls{coco}
|
water-stressed plants as possible while keeping the number of false
|
||||||
\cite{lin2015}, we expect a \gls{map} of around 40\% and precision and
|
positives as low as possible (specificity). If the number of
|
||||||
recall values of 70\%.
|
water-stressed plants is severely overestimated, downstream watering
|
||||||
|
systems could damage the plants by overwatering. Conversely, if the
|
||||||
|
number of water-stressed plants is underestimated, some plants are
|
||||||
|
likely to die because no water-stress is detected (sensitivity).
|
||||||
|
Furthermore, the models are required to attain a reasonable level of
|
||||||
|
precision as well as good localization of plants. It is difficult to
|
||||||
|
determine said levels beforehand, but considering the task at hand as
|
||||||
|
well as general object detection and classification benchmarks such as
|
||||||
|
\gls{coco} \cite{lin2015}, we expect a \gls{map} of around 40\% and
|
||||||
|
precision and recall values of 70\%.
|
||||||
|
|
||||||
|
Other basic model requirements are robust object detection and
|
||||||
|
classification as well as good generalizability. The prototype should
|
||||||
|
be able to function in different environments where different lighting
|
||||||
|
conditions, different backgrounds, and different angles do not have an
|
||||||
|
impact on model performance. Where feasible, models should be
|
||||||
|
evaluated with cross validation to ensure that the performance of the
|
||||||
|
model on the test set is a good indicator of its generalizability. In
|
||||||
|
the same vein, models should not overfit or underfit the training data
|
||||||
|
which also results in bad generalizability.
|
||||||
|
|
||||||
|
During the iterative process of training the models as well as for
|
||||||
|
evaluation purposes, the models should be interpretable. Especially
|
||||||
|
when there is comparatively little training data available, verifying
|
||||||
|
if the model is focusing on the \emph{right} parts of an image gives
|
||||||
|
insight into its robustness and generalizability which can increase
|
||||||
|
trust. Furthermore, if a model is clearly not focusing on the right
|
||||||
|
parts of an image, interpretability can help debug where the problem
|
||||||
|
lies. Interpretability is thus an important property of any model so
|
||||||
|
that the model engineer is able to steer the training and inference
|
||||||
|
process in the right direction.
|
||||||
|
|
||||||
\section{Design}
|
\section{Design}
|
||||||
\label{sec:design}
|
\label{sec:design}
|
||||||
@ -2741,11 +2771,11 @@ In order to improve the aforementioned accuracy values, we perform
|
|||||||
hyperparameter optimization across a wide range of
|
hyperparameter optimization across a wide range of
|
||||||
parameters. Table~\ref{tab:classifier-hyps} lists the hyperparameters
|
parameters. Table~\ref{tab:classifier-hyps} lists the hyperparameters
|
||||||
and their possible values. Since the number of all combinations of
|
and their possible values. Since the number of all combinations of
|
||||||
values is \num{11520} and each combination is trained for \num{10}
|
values is \num{11520} and each combination is trained for ten epochs
|
||||||
epochs with a training time of approximately six minutes per
|
with a training time of approximately six minutes per combination,
|
||||||
combination, exhausting the search space would take \num{48} days. Due
|
exhausting the search space would take \num{48} days. Due to time
|
||||||
to time limitations, we have chosen to not search exhaustively but to
|
limitations, we have chosen to not search exhaustively but to pick
|
||||||
pick random combinations instead. Random search works surprisingly
|
random combinations instead. Random search works surprisingly
|
||||||
well---especially compared to grid search---in a number of domains, one of
|
well---especially compared to grid search---in a number of domains, one of
|
||||||
which is hyperparameter optimization \cite{bergstra2012}.
|
which is hyperparameter optimization \cite{bergstra2012}.
|
||||||
|
|
||||||
@ -2782,7 +2812,9 @@ the observation that almost all configurations converge well before
|
|||||||
reaching the tenth epoch. The assumption that a training run with ten
|
reaching the tenth epoch. The assumption that a training run with ten
|
||||||
epochs provides a good proxy for final performance is supported by the
|
epochs provides a good proxy for final performance is supported by the
|
||||||
quick convergence of validation accuracy and loss in
|
quick convergence of validation accuracy and loss in
|
||||||
figure~\ref{fig:classifier-training-metrics}.
|
figure~\ref{fig:classifier-training-metrics}. Table~\ref{tab:classifier-final-hyps}
|
||||||
|
lists the final hyperparameters which were chosen to train the
|
||||||
|
improved model.
|
||||||
|
|
||||||
\begin{equation}\label{eq:opt-prob}
|
\begin{equation}\label{eq:opt-prob}
|
||||||
1 - (1 - 0.01)^{138} \approx 0.75
|
1 - (1 - 0.01)^{138} \approx 0.75
|
||||||
@ -2808,23 +2840,6 @@ figure~\ref{fig:classifier-training-metrics}.
|
|||||||
\label{fig:classifier-hyp-results}
|
\label{fig:classifier-hyp-results}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
Table~\ref{tab:classifier-final-hyps} lists the final hyperparameters
|
|
||||||
which were chosen to train the improved model. In order to confirm
|
|
||||||
that the model does not suffer from overfitting or is a product of
|
|
||||||
chance due to a coincidentally advantageous train/test split, we
|
|
||||||
perform stratified $10$-fold cross validation on the dataset. Each
|
|
||||||
fold contains 90\% training and 10\% test data and was trained for
|
|
||||||
\num{25} epochs. Figure~\ref{fig:classifier-hyp-roc} shows the
|
|
||||||
performance of the epoch with the highest $\mathrm{F}_1$-score of each
|
|
||||||
fold as measured against the test split. The mean \gls{roc} curve
|
|
||||||
provides a robust metric for a classifier's performance because it
|
|
||||||
averages out the variability of the evaluation. Each fold manages to
|
|
||||||
achieve at least an \gls{auc} of \num{0.94}, while the best fold
|
|
||||||
reaches \num{0.99}. The mean \gls{roc} has an \gls{auc} of \num{0.96}
|
|
||||||
with a standard deviation of \num{0.02}. These results indicate that
|
|
||||||
the model is accurately predicting the correct class and is robust
|
|
||||||
against variations in the training set.
|
|
||||||
|
|
||||||
\begin{table}
|
\begin{table}
|
||||||
\centering
|
\centering
|
||||||
\begin{tabular}{cccc}
|
\begin{tabular}{cccc}
|
||||||
@ -2842,6 +2857,232 @@ against variations in the training set.
|
|||||||
\label{tab:classifier-final-hyps}
|
\label{tab:classifier-final-hyps}
|
||||||
\end{table}
|
\end{table}
|
||||||
|
|
||||||
|
\section{Deployment}
|
||||||
|
|
||||||
|
After training of the two models (object detector and classifier), we
|
||||||
|
export them to the \gls{onnx}\footnote{\url{https://github.com/onnx}}
|
||||||
|
format and move the model files to the Nvidia Jetson Nano. On the
|
||||||
|
device, a Flask application (\emph{server}) provides a \gls{rest}
|
||||||
|
endpoint from which the results of the most recent prediction can be
|
||||||
|
queried. The server periodically performs the following steps:
|
||||||
|
|
||||||
|
\begin{enumerate}
|
||||||
|
\item Call a binary which takes an image and writes it to a file.
|
||||||
|
\item Take the image and detect all plants as well as their status
|
||||||
|
using the two models.
|
||||||
|
\item Draw the returned bounding boxes onto the original image.
|
||||||
|
\item Number each detection from left to right.
|
||||||
|
\item Coerce the prediction for each bounding box into a tuple
|
||||||
|
$\langle I, S, T,\Delta T \rangle$.
|
||||||
|
\item Store the image with the bounding boxes and an array of all
|
||||||
|
tuples (predictions) in a dictionary.
|
||||||
|
\item Wait two minutes.
|
||||||
|
\item Go to step one.
|
||||||
|
\end{enumerate}
|
||||||
|
|
||||||
|
The binary uses the accelerated GStreamer implementation by Nvidia to
|
||||||
|
take an image. The tuple $\langle I, S, T,\Delta T \rangle$ consists of the following
|
||||||
|
items: $I$ is the number of the bounding box in the image, $S$ the
|
||||||
|
current state from one to ten, $T$ the timestamp of the prediction,
|
||||||
|
and $\Delta T$ the time since the state $S$ last fell under three. The
|
||||||
|
server performs these tasks asynchronously in the background and is
|
||||||
|
always ready to respond to requests with the most recent prediction.
|
||||||
|
|
||||||
|
This chapter detailed the training and deployment of the two models
|
||||||
|
used for the plant water-stress detection system—the object detector
|
||||||
|
and the classifier. Furthermore, we have specified the \gls{api} which
|
||||||
|
publishes the results continuously. We will now turn towards the
|
||||||
|
evaluation of the two separate models as well as the aggregate model.
|
||||||
|
|
||||||
|
\chapter{Evaluation}
|
||||||
|
\label{chap:evaluation}
|
||||||
|
|
||||||
|
The following sections contain a detailed evaluation of the model in
|
||||||
|
various scenarios. First, we describe the test datasets as well as the
|
||||||
|
metrics used for assessing model performance. Second, we present the
|
||||||
|
results of the evaluation and analyze the behavior of the classifier
|
||||||
|
with \gls{grad-cam}. Finally, we discuss the results with respect to
|
||||||
|
the research questions defined in section~\ref{sec:motivation}.
|
||||||
|
|
||||||
|
\section{Methodology}
|
||||||
|
\label{sec:methodology}
|
||||||
|
|
||||||
|
In order to evaluate the object detection model and the classification
|
||||||
|
model, we analyze their predictions on test datasets. For the object
|
||||||
|
detection model, the test dataset is a 10\% split of the original
|
||||||
|
dataset which we describe in section~\ref{ssec:obj-train-dataset}. The
|
||||||
|
classifier is evaluated with a \num{10}-fold cross validation from the
|
||||||
|
original dataset (see section~\ref{ssec:class-train-dataset}). After
|
||||||
|
the evaluation of both models individually, we evaluate the model in
|
||||||
|
aggregate on a new dataset. This is necessary because the prototype
|
||||||
|
uses the two models as if they were one. The aggregate performance is
|
||||||
|
ultimately the most important measure to decide if the prototype is
|
||||||
|
able to meet the requirements.
|
||||||
|
|
||||||
|
The test set for the aggregate model contains \num{640} images which
|
||||||
|
were obtained from a google search using the terms \emph{thirsty
|
||||||
|
plant}, \emph{wilted plant} and \emph{stressed plant}. Images which
|
||||||
|
clearly show one or multiple plants with some amount of visible stress
|
||||||
|
were added to the dataset. Care was taken to include plants with
|
||||||
|
various degrees of stress and in various locations and lighting
|
||||||
|
conditions. The search not only provided images of stressed plants,
|
||||||
|
but also of healthy plants. The dataset is biased towards potted
|
||||||
|
plants which are commonly put on display in western
|
||||||
|
households. Furthermore, many plants, such as succulents, are sought
|
||||||
|
after for home environments because of their ease of maintenance. Due
|
||||||
|
to their inclusion in the dataset and how they exhibit water stress,
|
||||||
|
the test set contains a wide variety of scenarios.
|
||||||
|
|
||||||
|
After collecting the images, the aggregate model was run on them to
|
||||||
|
obtain initial bounding boxes and classifications for ground truth
|
||||||
|
labeling. Letting the model do the work beforehand and then correcting
|
||||||
|
the labels allowed to include more images in the test set because they
|
||||||
|
could be labeled more easily. Additionally, going over the detections
|
||||||
|
and classifications provided a comprehensive view on how the models
|
||||||
|
work and what their weaknesses and strengths are. After the labels
|
||||||
|
have been corrected, the ground truth of the test set contains
|
||||||
|
\num{766} bounding boxes of healthy plants and \num{494} of stressed
|
||||||
|
plants.
|
||||||
|
|
||||||
|
\section{Results}
|
||||||
|
\label{sec:results}
|
||||||
|
|
||||||
|
This section presents the results of the evaluation of the constituent
|
||||||
|
models as well as the aggregate model. First, we evaluate the object
|
||||||
|
detection model before and after hyperparameter optimization. Second,
|
||||||
|
we evaluate the performance of the classifier after hyperparameter
|
||||||
|
optimization and present the results of \gls{grad-cam}. Finally, we
|
||||||
|
evaluate the aggregate model before and after hyperparameter
|
||||||
|
optimization.
|
||||||
|
|
||||||
|
\subsection{Object Detection}
|
||||||
|
\label{ssec:yolo-eval}
|
||||||
|
|
||||||
|
Of the \num{91479} images around 10\% were used for the test
|
||||||
|
phase. These images contain a total of \num{12238} ground truth
|
||||||
|
labels. Table~\ref{tab:yolo-metrics} shows precision, recall and the
|
||||||
|
harmonic mean of both ($\mathrm{F}_1$-score). The results indicate
|
||||||
|
that the model errs on the side of sensitivity because recall is
|
||||||
|
higher than precision. Although some detections are not labeled as
|
||||||
|
plants in the dataset, if there is a labeled plant in the ground truth
|
||||||
|
data, the chance is high that it will be detected. This behavior is in
|
||||||
|
line with how the model's detections are handled in practice. The
|
||||||
|
detections are drawn on the original image and the user is able to
|
||||||
|
check the bounding boxes visually. If there are wrong detections, the
|
||||||
|
user can ignore them and focus on the relevant ones instead. A higher
|
||||||
|
recall will thus serve the user's needs better than a high precision.
|
||||||
|
|
||||||
|
\begin{table}[h]
|
||||||
|
\centering
|
||||||
|
\begin{tabular}{lrrrr}
|
||||||
|
\toprule
|
||||||
|
{} & Precision & Recall & $\mathrm{F}_1$-score & Support \\
|
||||||
|
\midrule
|
||||||
|
Plant & \num{0.547571} & \num{0.737866} & \num{0.628633} & \num{12238.0} \\
|
||||||
|
\bottomrule
|
||||||
|
\end{tabular}
|
||||||
|
\caption{Precision, recall and $\mathrm{F}_1$-score for the object
|
||||||
|
detection model.}
|
||||||
|
\label{tab:yolo-metrics}
|
||||||
|
\end{table}
|
||||||
|
|
||||||
|
Figure~\ref{fig:yolo-ap} shows the \gls{ap} for the \gls{iou}
|
||||||
|
thresholds of \num{0.5} and \num{0.95}. Predicted bounding boxes with
|
||||||
|
an \gls{iou} of less than \num{0.5} are not taken into account for the
|
||||||
|
precision and recall values of table~\ref{tab:yolo-metrics}. The lower
|
||||||
|
the detection threshold, the more plants are detected. Conversely, a
|
||||||
|
higher detection threshold leaves potential plants undetected. The
|
||||||
|
precision-recall curves confirm this behavior because the area under
|
||||||
|
the curve for the threshold of \num{0.5} is higher than for the
|
||||||
|
threshold of \num{0.95} (\num{0.66} versus \num{0.41}). These values
|
||||||
|
are combined in COCO's \cite{lin2015} main evaluation metric which is
|
||||||
|
the \gls{ap} averaged across the \gls{iou} thresholds from \num{0.5}
|
||||||
|
to \num{0.95} in \num{0.05} steps. This value is then averaged across
|
||||||
|
all classes and called \gls{map}. The object detection model achieves
|
||||||
|
a state-of-the-art \gls{map} of \num{0.5727} for the \emph{Plant} class.
|
||||||
|
|
||||||
|
\begin{figure}
|
||||||
|
\centering
|
||||||
|
\includegraphics{graphics/APpt5-pt95.pdf}
|
||||||
|
\caption[Object detection AP@0.5 and AP@0.95.]{Precision-recall
|
||||||
|
curves for \gls{iou} thresholds of \num{0.5} and \num{0.95}. The
|
||||||
|
\gls{ap} of a specific threshold is defined as the area under the
|
||||||
|
precision-recall curve of that threshold. The \gls{map} across
|
||||||
|
\gls{iou} thresholds from \num{0.5} to \num{0.95} in \num{0.05}
|
||||||
|
steps \gls{map}@0.5:0.95 is \num{0.5727}.}
|
||||||
|
\label{fig:yolo-ap}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
|
\subsubsection{Hyperparameter Optimization}
|
||||||
|
\label{sssec:yolo-hyp-opt}
|
||||||
|
|
||||||
|
Turning to the evaluation of the optimized model on the test dataset,
|
||||||
|
table~\ref{tab:yolo-metrics-hyp} shows precision, recall and the
|
||||||
|
$\mathrm{F}_1$-score for the optimized model. Comparing these metrics
|
||||||
|
with the non-optimized version from table~\ref{tab:yolo-metrics},
|
||||||
|
precision is significantly higher by more than \num{8.5} percentage
|
||||||
|
points. Recall, however, is \num{3.5} percentage points lower. The
|
||||||
|
$\mathrm{F}_1$-score is higher by more than \num{3.7} percentage
|
||||||
|
points which indicates that the optimized model is better overall
|
||||||
|
despite the lower recall. We argue that the lower recall value is a
|
||||||
|
suitable trade off for the substantially higher precision considering
|
||||||
|
that the non-optimized model's precision is quite low at \num{0.55}.
|
||||||
|
|
||||||
|
\begin{table}[h]
|
||||||
|
\centering
|
||||||
|
\begin{tabular}{lrrrr}
|
||||||
|
\toprule
|
||||||
|
{} & Precision & Recall & $\mathrm{F}_1$-score & Support \\
|
||||||
|
\midrule
|
||||||
|
Plant & \num{0.633358} & \num{0.702811} & \num{0.666279} & \num{12238.0} \\
|
||||||
|
\bottomrule
|
||||||
|
\end{tabular}
|
||||||
|
\caption{Precision, recall and $\mathrm{F}_1$-score for the
|
||||||
|
optimized object detection model.}
|
||||||
|
\label{tab:yolo-metrics-hyp}
|
||||||
|
\end{table}
|
||||||
|
|
||||||
|
The precision-recall curves in figure~\ref{fig:yolo-ap-hyp} for the
|
||||||
|
optimized model show that the model draws looser bounding boxes than
|
||||||
|
the optimized model. The \gls{ap} for both \gls{iou} thresholds of
|
||||||
|
\num{0.5} and \num{0.95} is lower indicating worse performance. It is
|
||||||
|
likely that more iterations during evolution would help increase the
|
||||||
|
\gls{ap} values as well. Even though the precision and recall values
|
||||||
|
from table~\ref{tab:yolo-metrics-hyp} are better, the
|
||||||
|
\gls{map}@0.5:0.95 is lower by \num{1.8} percentage points.
|
||||||
|
|
||||||
|
\begin{figure}
|
||||||
|
\centering
|
||||||
|
\includegraphics{graphics/APpt5-pt95-final.pdf}
|
||||||
|
\caption[Hyper-parameter optimized object detection AP@0.5 and
|
||||||
|
AP@0.95.]{Precision-recall curves for \gls{iou} thresholds of
|
||||||
|
\num{0.5} and \num{0.95}. The \gls{ap} of a specific threshold is
|
||||||
|
defined as the area under the precision-recall curve of that
|
||||||
|
threshold. The \gls{map} across \gls{iou} thresholds from
|
||||||
|
\num{0.5} to \num{0.95} in \num{0.05} steps \gls{map}@0.5:0.95 is
|
||||||
|
\num{0.5546}.}
|
||||||
|
\label{fig:yolo-ap-hyp}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
|
\subsection{Classification}
|
||||||
|
\label{ssec:classifier-eval}
|
||||||
|
|
||||||
|
In order to confirm that the optimized classification model does not
|
||||||
|
suffer from overfitting or is a product of chance due to a
|
||||||
|
coincidentally advantageous train/test split, we perform stratified
|
||||||
|
$10$-fold cross validation on the dataset. Each fold contains 90\%
|
||||||
|
training and 10\% test data and was trained for \num{25}
|
||||||
|
epochs. Figure~\ref{fig:classifier-hyp-roc} shows the performance of
|
||||||
|
the epoch with the highest $\mathrm{F}_1$-score of each fold as
|
||||||
|
measured against the test split. The mean \gls{roc} curve provides a
|
||||||
|
robust metric for a classifier's performance because it averages out
|
||||||
|
the variability of the evaluation. Each fold manages to achieve at
|
||||||
|
least an \gls{auc} of \num{0.94}, while the best fold reaches
|
||||||
|
\num{0.99}. The mean \gls{roc} has an \gls{auc} of \num{0.96} with a
|
||||||
|
standard deviation of \num{0.02}. These results indicate that the
|
||||||
|
model is accurately predicting the correct class and is robust against
|
||||||
|
variations in the training set.
|
||||||
|
|
||||||
\begin{figure}
|
\begin{figure}
|
||||||
\centering
|
\centering
|
||||||
\includegraphics{graphics/classifier-hyp-folds-roc.pdf}
|
\includegraphics{graphics/classifier-hyp-folds-roc.pdf}
|
||||||
@ -2893,210 +3134,6 @@ $\mathrm{F}_1$-score of \num{1} on the training set.
|
|||||||
\label{fig:classifier-hyp-folds}
|
\label{fig:classifier-hyp-folds}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\section{Deployment}
|
|
||||||
|
|
||||||
After training of the two models (object detector and classifier), we
|
|
||||||
export them to the \gls{onnx}\footnote{\url{https://github.com/onnx}}
|
|
||||||
format and move the model files to the Nvidia Jetson Nano. On the
|
|
||||||
device, a Flask application (\emph{server}) provides a \gls{rest}
|
|
||||||
endpoint from which the results of the most recent prediction can be
|
|
||||||
queried. The server periodically performs the following steps:
|
|
||||||
|
|
||||||
\begin{enumerate}
|
|
||||||
\item Call a binary which takes an image and writes it to a file.
|
|
||||||
\item Take the image and detect all plants as well as their status
|
|
||||||
using the two models.
|
|
||||||
\item Draw the returned bounding boxes onto the original image.
|
|
||||||
\item Number each detection from left to right.
|
|
||||||
\item Coerce the prediction for each bounding box into a tuple
|
|
||||||
$\langle I, S, T,\Delta T \rangle$.
|
|
||||||
\item Store the image with the bounding boxes and an array of all
|
|
||||||
tuples (predictions) in a dictionary.
|
|
||||||
\item Wait two minutes.
|
|
||||||
\item Go to step one.
|
|
||||||
\end{enumerate}
|
|
||||||
|
|
||||||
The binary uses the accelerated GStreamer implementation by Nvidia to
|
|
||||||
take an image. The tuple $\langle I, S, T,\Delta T \rangle$ consists of the following
|
|
||||||
items: $I$ is the number of the bounding box in the image, $S$ the
|
|
||||||
current state from one to ten, $T$ the timestamp of the prediction,
|
|
||||||
and $\Delta T$ the time since the state $S$ last fell under three. The
|
|
||||||
server performs these tasks asynchronously in the background and is
|
|
||||||
always ready to respond to requests with the most recent prediction.
|
|
||||||
|
|
||||||
\chapter{Evaluation}
|
|
||||||
\label{chap:evaluation}
|
|
||||||
|
|
||||||
The following sections contain a detailed evaluation of the model in
|
|
||||||
various scenarios. We employ methods from the field of \gls{xai} such
|
|
||||||
as \gls{grad-cam} to get a better understanding of the models'
|
|
||||||
abstractions. Finally, we turn to the models' aggregate performance on
|
|
||||||
the test set.
|
|
||||||
|
|
||||||
\section{Methodology}
|
|
||||||
\label{sec:methodology}
|
|
||||||
|
|
||||||
Go over the evaluation methodology by explaining the test datasets,
|
|
||||||
where they come from, and how they're structured. Explain how the
|
|
||||||
testing phase was done and which metrics are employed to compare the
|
|
||||||
models to the SOTA.
|
|
||||||
|
|
||||||
Estimated 2 pages for this section.
|
|
||||||
|
|
||||||
\section{Results}
|
|
||||||
\label{sec:results}
|
|
||||||
|
|
||||||
Systematically go over the results from the testing phase(s), show the
|
|
||||||
plots and metrics, and explain what they contain.
|
|
||||||
|
|
||||||
Estimated 4 pages for this section.
|
|
||||||
|
|
||||||
\subsection{Object Detection}
|
|
||||||
\label{ssec:yolo-eval}
|
|
||||||
|
|
||||||
The following parapraph should probably go into
|
|
||||||
section~\ref{sec:development-detection}.
|
|
||||||
|
|
||||||
The object detection model was pre-trained on the COCO~\cite{lin2015}
|
|
||||||
dataset and fine-tuned with data from the \gls{oid}
|
|
||||||
\cite{kuznetsova2020} in its sixth version. Since the full \gls{oid}
|
|
||||||
dataset contains considerably more classes and samples than would be
|
|
||||||
feasibly trainable on a small cluster of \glspl{gpu}, only images from
|
|
||||||
the two classes \emph{Plant} and \emph{Houseplant} have been
|
|
||||||
downloaded. The samples from the Houseplant class are merged into the
|
|
||||||
Plant class because the distinction between the two is not necessary
|
|
||||||
for our model. Furthermore, the \gls{oid} contains not only bounding
|
|
||||||
box annotations for object detection tasks, but also instance
|
|
||||||
segmentations, classification labels and more. These are not needed
|
|
||||||
for our purposes and are omitted as well. In total, the dataset
|
|
||||||
consists of 91479 images with a roughly 85/5/10 split for training,
|
|
||||||
validation and testing, respectively.
|
|
||||||
|
|
||||||
\subsubsection{Test Phase}
|
|
||||||
\label{sssec:yolo-test}
|
|
||||||
|
|
||||||
Of the 91479 images around 10\% were used for the test phase. These
|
|
||||||
images contain a total of 12238 ground truth
|
|
||||||
labels. Table~\ref{tab:yolo-metrics} shows precision, recall and the
|
|
||||||
harmonic mean of both ($\mathrm{F}_1$-score). The results indicate
|
|
||||||
that the model errs on the side of sensitivity because recall is
|
|
||||||
higher than precision. Although some detections are not labeled as
|
|
||||||
plants in the dataset, if there is a labeled plant in the ground truth
|
|
||||||
data, the chance is high that it will be detected. This behavior is in
|
|
||||||
line with how the model's detections are handled in practice. The
|
|
||||||
detections are drawn on the original image and the user is able to
|
|
||||||
check the bounding boxes visually. If there are wrong detections, the
|
|
||||||
user can ignore them and focus on the relevant ones instead. A higher
|
|
||||||
recall will thus serve the user's needs better than a high precision.
|
|
||||||
|
|
||||||
\begin{table}[h]
|
|
||||||
\centering
|
|
||||||
\begin{tabular}{lrrrr}
|
|
||||||
\toprule
|
|
||||||
{} & Precision & Recall & $\mathrm{F}_1$-score & Support \\
|
|
||||||
\midrule
|
|
||||||
Plant & 0.547571 & 0.737866 & 0.628633 & 12238.0 \\
|
|
||||||
\bottomrule
|
|
||||||
\end{tabular}
|
|
||||||
\caption{Precision, recall and $\mathrm{F}_1$-score for the object
|
|
||||||
detection model.}
|
|
||||||
\label{tab:yolo-metrics}
|
|
||||||
\end{table}
|
|
||||||
|
|
||||||
Figure~\ref{fig:yolo-ap} shows the \gls{ap} for the \gls{iou}
|
|
||||||
thresholds of 0.5 and 0.95. Predicted bounding boxes with an \gls{iou}
|
|
||||||
of less than 0.5 are not taken into account for the precision and
|
|
||||||
recall values of table~\ref{tab:yolo-metrics}. The lower the detection
|
|
||||||
threshold, the more plants are detected. Conversely, a higher
|
|
||||||
detection threshold leaves potential plants undetected. The
|
|
||||||
precision-recall curves confirm this behavior because the area under
|
|
||||||
the curve for the threshold of 0.5 is higher than for the threshold of
|
|
||||||
0.95 ($0.66$ versus $0.41$). These values are combined in COCO's
|
|
||||||
\cite{lin2015} main evaluation metric which is the \gls{ap} averaged
|
|
||||||
across the \gls{iou} thresholds from 0.5 to 0.95 in 0.05 steps. This
|
|
||||||
value is then averaged across all classes and called \gls{map}. The
|
|
||||||
object detection model achieves a state-of-the-art \gls{map} of 0.5727
|
|
||||||
for the \emph{Plant} class.
|
|
||||||
|
|
||||||
\begin{figure}
|
|
||||||
\centering
|
|
||||||
\includegraphics{graphics/APpt5-pt95.pdf}
|
|
||||||
\caption[Object detection AP@0.5 and AP@0.95.]{Precision-recall
|
|
||||||
curves for \gls{iou} thresholds of 0.5 and 0.95. The \gls{ap} of a
|
|
||||||
specific threshold is defined as the area under the
|
|
||||||
precision-recall curve of that threshold. The \gls{map} across
|
|
||||||
\gls{iou} thresholds from 0.5 to 0.95 in 0.05 steps
|
|
||||||
\textsf{mAP}@0.5:0.95 is 0.5727.}
|
|
||||||
\label{fig:yolo-ap}
|
|
||||||
\end{figure}
|
|
||||||
|
|
||||||
\subsubsection{Hyperparameter Optimization}
|
|
||||||
\label{sssec:yolo-hyp-opt}
|
|
||||||
|
|
||||||
This section should be moved to the hyperparameter optimization
|
|
||||||
section in the development chapter
|
|
||||||
(section~\ref{sec:development-detection}).
|
|
||||||
|
|
||||||
|
|
||||||
\begin{table}[h]
|
|
||||||
\centering
|
|
||||||
\begin{tabular}{lrrrr}
|
|
||||||
\toprule
|
|
||||||
{} & Precision & Recall & $\mathrm{F}_1$-score & Support \\
|
|
||||||
\midrule
|
|
||||||
Plant & 0.633358 & 0.702811 & 0.666279 & 12238.0 \\
|
|
||||||
\bottomrule
|
|
||||||
\end{tabular}
|
|
||||||
\caption{Precision, recall and $\mathrm{F}_1$-score for the
|
|
||||||
optimized object detection model.}
|
|
||||||
\label{tab:yolo-metrics-hyp}
|
|
||||||
\end{table}
|
|
||||||
|
|
||||||
Turning to the evaluation of the optimized model on the test dataset,
|
|
||||||
table~\ref{tab:yolo-metrics-hyp} shows precision, recall and the
|
|
||||||
$\mathrm{F}_1$-score for the optimized model. Comparing these metrics
|
|
||||||
with the non-optimized version from table~\ref{tab:yolo-metrics},
|
|
||||||
precision is significantly higher by more than 8.5\%. Recall, however,
|
|
||||||
is 3.5\% lower. The $\mathrm{F}_1$-score is higher by more than 3.7\%
|
|
||||||
which indicates that the optimized model is better overall despite the
|
|
||||||
lower recall. We feel that the lower recall value is a suitable trade
|
|
||||||
off for the substantially higher precision considering that the
|
|
||||||
non-optimized model's precision is quite low at 0.55.
|
|
||||||
|
|
||||||
The precision-recall curves in figure~\ref{fig:yolo-ap-hyp} for the
|
|
||||||
optimized model show that the model draws looser bounding boxes than
|
|
||||||
the optimized model. The \gls{ap} for both \gls{iou} thresholds of 0.5
|
|
||||||
and 0.95 is lower indicating worse performance. It is likely that more
|
|
||||||
iterations during evolution would help increase the \gls{ap} values as
|
|
||||||
well. Even though the precision and recall values from
|
|
||||||
table~\ref{tab:yolo-metrics-hyp} are better, the \textsf{mAP}@0.5:0.95
|
|
||||||
is lower by 1.8\%.
|
|
||||||
|
|
||||||
\begin{figure}
|
|
||||||
\centering
|
|
||||||
\includegraphics{graphics/APpt5-pt95-final.pdf}
|
|
||||||
\caption[Hyper-parameter optimized object detection AP@0.5 and
|
|
||||||
AP@0.95.]{Precision-recall curves for \gls{iou} thresholds of 0.5
|
|
||||||
and 0.95. The \gls{ap} of a specific threshold is defined as the
|
|
||||||
area under the precision-recall curve of that threshold. The
|
|
||||||
\gls{map} across \gls{iou} thresholds from 0.5 to 0.95 in 0.05
|
|
||||||
steps \textsf{mAP}@0.5:0.95 is 0.5546.}
|
|
||||||
\label{fig:yolo-ap-hyp}
|
|
||||||
\end{figure}
|
|
||||||
|
|
||||||
\subsection{Classification}
|
|
||||||
\label{ssec:classifier-eval}
|
|
||||||
|
|
||||||
|
|
||||||
\subsubsection{Hyperparameter Optimization}
|
|
||||||
\label{sssec:classifier-hyp-opt}
|
|
||||||
|
|
||||||
This section should be moved to the hyperparameter optimization
|
|
||||||
section in the development chapter
|
|
||||||
(section~\ref{sec:development-classification}).
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
\subsubsection{Class Activation Maps}
|
\subsubsection{Class Activation Maps}
|
||||||
\label{sssec:classifier-cam}
|
\label{sssec:classifier-cam}
|
||||||
|
|
||||||
@ -3110,7 +3147,7 @@ explain why a decision was made in a certain way. The research field
|
|||||||
of \gls{xai} gained significance during the last few years because of
|
of \gls{xai} gained significance during the last few years because of
|
||||||
the development of new methods to peek inside these black boxes.
|
the development of new methods to peek inside these black boxes.
|
||||||
|
|
||||||
One such method, \gls{cam}~\cite{zhou2015}, is a popular tool to
|
One such method, \gls{cam} \cite{zhou2015}, is a popular tool to
|
||||||
produce visual explanations for decisions made by
|
produce visual explanations for decisions made by
|
||||||
\glspl{cnn}. Convolutional layers essentially function as object
|
\glspl{cnn}. Convolutional layers essentially function as object
|
||||||
detectors as long as no fully-connected layers perform the
|
detectors as long as no fully-connected layers perform the
|
||||||
@ -3120,7 +3157,7 @@ be retained until the last layer and used to generate activation maps
|
|||||||
for the predictions.
|
for the predictions.
|
||||||
|
|
||||||
A more recent approach to generating a \gls{cam} via gradients is
|
A more recent approach to generating a \gls{cam} via gradients is
|
||||||
proposed by~\textcite{selvaraju2020}. Their \gls{grad-cam} approach
|
proposed by \textcite{selvaraju2020}. Their \gls{grad-cam} approach
|
||||||
works by computing the gradient of the feature maps of the last
|
works by computing the gradient of the feature maps of the last
|
||||||
convolutional layer with respect to the specified class. The last
|
convolutional layer with respect to the specified class. The last
|
||||||
layer is chosen because the authors find that ``[…] Grad-CAM maps
|
layer is chosen because the authors find that ``[…] Grad-CAM maps
|
||||||
@ -3156,7 +3193,6 @@ of the image during classification.
|
|||||||
\label{fig:classifier-cam}
|
\label{fig:classifier-cam}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
|
|
||||||
\subsection{Aggregate Model}
|
\subsection{Aggregate Model}
|
||||||
\label{ssec:aggregate-model}
|
\label{ssec:aggregate-model}
|
||||||
|
|
||||||
@ -3167,31 +3203,6 @@ complete pipeline from gathering detections of potential plants in an
|
|||||||
image and forwarding them to the classifier to obtaining the results
|
image and forwarding them to the classifier to obtaining the results
|
||||||
as either healthy or stressed with their associated confidence scores.
|
as either healthy or stressed with their associated confidence scores.
|
||||||
|
|
||||||
The test set contains 640 images which were obtained from a google
|
|
||||||
search using the terms \emph{thirsty plant}, \emph{wilted plant} and
|
|
||||||
\emph{stressed plant}. Images which clearly show one or multiple
|
|
||||||
plants with some amount of visible stress were added to the
|
|
||||||
dataset. Care was taken to include plants with various degrees of
|
|
||||||
stress and in various locations and lighting conditions. The search
|
|
||||||
not only provided images of stressed plants, but also of healthy
|
|
||||||
plants due to articles, which describe how to care for plants, having
|
|
||||||
a banner image of healthy plants. The dataset is biased towards potted
|
|
||||||
plants which are commonly put on display in western
|
|
||||||
households. Furthermore, many plants, such as succulents, are sought
|
|
||||||
after for home environments because of their ease of maintenance. Due
|
|
||||||
to their inclusion in the dataset and how they exhibit water stress,
|
|
||||||
the test set nevertheless contains a wide variety of scenarios.
|
|
||||||
|
|
||||||
After collecting the images, the aggregate model was run on them to
|
|
||||||
obtain initial bounding boxes and classifications for ground truth
|
|
||||||
labeling. Letting the model do the work beforehand and then correcting
|
|
||||||
the labels allowed to include more images in the test set because they
|
|
||||||
could be labeled more easily. Additionally, going over the detections
|
|
||||||
and classifications provided a comprehensive view on how the models
|
|
||||||
work and what their weaknesses and strengths are. After the labels
|
|
||||||
have been corrected, the ground truth of the test set contains 766
|
|
||||||
bounding boxes of healthy plants and 494 of stressed plants.
|
|
||||||
|
|
||||||
\subsection{Non-optimized Model}
|
\subsection{Non-optimized Model}
|
||||||
\label{ssec:model-non-optimized}
|
\label{ssec:model-non-optimized}
|
||||||
|
|
||||||
@ -3199,13 +3210,13 @@ bounding boxes of healthy plants and 494 of stressed plants.
|
|||||||
\centering
|
\centering
|
||||||
\begin{tabular}{lrrrr}
|
\begin{tabular}{lrrrr}
|
||||||
\toprule
|
\toprule
|
||||||
{} & precision & recall & $\mathrm{F}_{1}$-score & support \\
|
{} & Precision & Recall & $\mathrm{F}_{1}$-score & Support \\
|
||||||
\midrule
|
\midrule
|
||||||
Healthy & 0.665 & 0.554 & 0.604 & 766 \\
|
Healthy & \num{0.665} & \num{0.554} & \num{0.604} & \num{766} \\
|
||||||
Stressed & 0.639 & 0.502 & 0.562 & 494 \\
|
Stressed & \num{0.639} & \num{0.502} & \num{0.562} & \num{494} \\
|
||||||
micro avg & 0.655 & 0.533 & 0.588 & 1260 \\
|
Micro Avg & \num{0.655} & \num{0.533} & \num{0.588} & \num{1260} \\
|
||||||
macro avg & 0.652 & 0.528 & 0.583 & 1260 \\
|
Macro Avg & \num{0.652} & \num{0.528} & \num{0.583} & \num{1260} \\
|
||||||
weighted avg & 0.655 & 0.533 & 0.588 & 1260 \\
|
Weighted Avg & \num{0.655} & \num{0.533} & \num{0.588} & \num{1260} \\
|
||||||
\bottomrule
|
\bottomrule
|
||||||
\end{tabular}
|
\end{tabular}
|
||||||
\caption{Precision, recall and $\mathrm{F}_1$-score for the
|
\caption{Precision, recall and $\mathrm{F}_1$-score for the
|
||||||
@ -3216,33 +3227,34 @@ bounding boxes of healthy plants and 494 of stressed plants.
|
|||||||
Table~\ref{tab:model-metrics} shows precision, recall and the
|
Table~\ref{tab:model-metrics} shows precision, recall and the
|
||||||
$\mathrm{F}_1$-score for both classes \emph{Healthy} and
|
$\mathrm{F}_1$-score for both classes \emph{Healthy} and
|
||||||
\emph{Stressed}. Precision is higher than recall for both classes and
|
\emph{Stressed}. Precision is higher than recall for both classes and
|
||||||
the $\mathrm{F}_1$-score is at 0.59. Unfortunately, these values do
|
the $\mathrm{F}_1$-score is at \num{0.59}. Unfortunately, these values
|
||||||
not take the accuracy of bounding boxes into account and thus have
|
do not take the accuracy of bounding boxes into account and thus have
|
||||||
only limited expressive power.
|
only limited expressive power.
|
||||||
|
|
||||||
Figure~\ref{fig:aggregate-ap} shows the precision and recall curves
|
Figure~\ref{fig:aggregate-ap} shows the precision and recall curves
|
||||||
for both classes at different \gls{iou} thresholds. The left plot
|
for both classes at different \gls{iou} thresholds. The left plot
|
||||||
shows the \gls{ap} for each class at the threshold of 0.5 and the
|
shows the \gls{ap} for each class at the threshold of \num{0.5} and
|
||||||
right one at 0.95. The \gls{map} is 0.3581 and calculated across all
|
the right one at \num{0.95}. The \gls{map} is \num{0.3581} and
|
||||||
classes as the median of the \gls{iou} thresholds from 0.5 to 0.95 in
|
calculated across all classes as the median of the \gls{iou}
|
||||||
0.05 steps. The cliffs at around 0.6 (left) and 0.3 (right) happen at
|
thresholds from \num{0.5} to \num{0.95} in \num{0.05} steps. The
|
||||||
a detection threshold of 0.5. The classifier's last layer is a softmax
|
cliffs at around \num{0.6} (left) and \num{0.3} (right) happen at a
|
||||||
layer which necessarily transforms the input into a probability of
|
detection threshold of \num{0.5}. The classifier's last layer is a
|
||||||
showing either a healthy or stressed plant. If the probability of an
|
softmax layer which necessarily transforms the input into a
|
||||||
image showing a healthy plant is below 0.5, it is no longer classified
|
probability of showing either a healthy or stressed plant. If the
|
||||||
as healthy but as stressed. The threshold for discriminating the two
|
probability of an image showing a healthy plant is below \num{0.5}, it
|
||||||
classes lies at the 0.5 value and is therefore the cutoff for either
|
is no longer classified as healthy but as stressed. The threshold for
|
||||||
class.
|
discriminating the two classes lies at the \num{0.5} value and is
|
||||||
|
therefore the cutoff for either class.
|
||||||
|
|
||||||
\begin{figure}
|
\begin{figure}
|
||||||
\centering
|
\centering
|
||||||
\includegraphics{graphics/APmodel-model-optimized-relabeled.pdf}
|
\includegraphics{graphics/APmodel-model-optimized-relabeled.pdf}
|
||||||
\caption[Aggregate model AP@0.5 and AP@0.95.]{Precision-recall
|
\caption[Aggregate model AP@0.5 and AP@0.95.]{Precision-recall
|
||||||
curves for \gls{iou} thresholds of 0.5 and 0.95. The \gls{ap} of a
|
curves for \gls{iou} thresholds of \num{0.5} and \num{0.95}. The
|
||||||
specific threshold is defined as the area under the
|
\gls{ap} of a specific threshold is defined as the area under the
|
||||||
precision-recall curve of that threshold. The \gls{map} across
|
precision-recall curve of that threshold. The \gls{map} across
|
||||||
\gls{iou} thresholds from 0.5 to 0.95 in 0.05 steps
|
\gls{iou} thresholds from \num{0.5} to \num{0.95} in \num{0.05}
|
||||||
\textsf{mAP}@0.5:0.95 is 0.3581.}
|
steps \gls{map}@0.5:0.95 is \num{0.3581}.}
|
||||||
\label{fig:aggregate-ap}
|
\label{fig:aggregate-ap}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
@ -3262,13 +3274,13 @@ section~\ref{ssec:aggregate-model}.
|
|||||||
\centering
|
\centering
|
||||||
\begin{tabular}{lrrrr}
|
\begin{tabular}{lrrrr}
|
||||||
\toprule
|
\toprule
|
||||||
{} & precision & recall & $\mathrm{F}_{1}$-score & support \\
|
{} & Precision & Recall & $\mathrm{F}_{1}$-score & Support \\
|
||||||
\midrule
|
\midrule
|
||||||
Healthy & 0.711 & 0.555 & 0.623 & 766 \\
|
Healthy & 0.711 & 0.555 & 0.623 & 766 \\
|
||||||
Stressed & 0.570 & 0.623 & 0.596 & 494 \\
|
Stressed & 0.570 & 0.623 & 0.596 & 494 \\
|
||||||
micro avg & 0.644 & 0.582 & 0.611 & 1260 \\
|
Micro Avg & 0.644 & 0.582 & 0.611 & 1260 \\
|
||||||
macro avg & 0.641 & 0.589 & 0.609 & 1260 \\
|
Macro Avg & 0.641 & 0.589 & 0.609 & 1260 \\
|
||||||
weighted avg & 0.656 & 0.582 & 0.612 & 1260 \\
|
Weighted Avg & 0.656 & 0.582 & 0.612 & 1260 \\
|
||||||
\bottomrule
|
\bottomrule
|
||||||
\end{tabular}
|
\end{tabular}
|
||||||
\caption{Precision, recall and $\mathrm{F}_1$-score for the
|
\caption{Precision, recall and $\mathrm{F}_1$-score for the
|
||||||
@ -3278,49 +3290,74 @@ section~\ref{ssec:aggregate-model}.
|
|||||||
|
|
||||||
Table~\ref{tab:model-metrics-hyp} shows precision, recall and
|
Table~\ref{tab:model-metrics-hyp} shows precision, recall and
|
||||||
$\mathrm{F}_1$-score for the optimized model on the same test dataset
|
$\mathrm{F}_1$-score for the optimized model on the same test dataset
|
||||||
of 640 images. All of the metrics are better for the optimized
|
of \num{640} images. All of the metrics are better for the optimized
|
||||||
model. In particular, precision for the healthy class could be
|
model. In particular, precision for the healthy class could be
|
||||||
improved significantly while recall remains at the same level. This
|
improved significantly while recall remains at the same level. This
|
||||||
results in a better $\mathrm{F}_1$-score for the healthy
|
results in a better $\mathrm{F}_1$-score for the healthy
|
||||||
class. Precision for the stressed class is lower with the optimized
|
class. Precision for the stressed class is lower with the optimized
|
||||||
model, but recall is significantly higher (0.502 vs. 0.623). The
|
model, but recall is significantly higher (\num{0.502}
|
||||||
higher recall results in a 3\% gain for the $\mathrm{F}_1$-score in
|
vs. \num{0.623}). The higher recall results in a 3 percentage point
|
||||||
the stressed class. Overall, precision is the same but recall has
|
gain for the $\mathrm{F}_1$-score in the stressed class. Overall,
|
||||||
improved significantly, which also results in a noticeable improvement
|
precision is the same but recall has improved significantly, which
|
||||||
for the average $\mathrm{F}_1$-score across both classes.
|
also results in a noticeable improvement for the average
|
||||||
|
$\mathrm{F}_1$-score across both classes.
|
||||||
|
|
||||||
\begin{figure}
|
\begin{figure}
|
||||||
\centering
|
\centering
|
||||||
\includegraphics{graphics/APModel-model-original-relabeled.pdf}
|
\includegraphics{graphics/APModel-model-original-relabeled.pdf}
|
||||||
\caption[Optimized aggregate model AP@0.5 and
|
\caption[Optimized aggregate model AP@0.5 and
|
||||||
AP@0.95.]{Precision-recall curves for \gls{iou} thresholds of 0.5
|
AP@0.95.]{Precision-recall curves for \gls{iou} thresholds of
|
||||||
and 0.95. The \gls{ap} of a specific threshold is defined as the
|
\num{0.5} and \num{0.95}. The \gls{ap} of a specific threshold is
|
||||||
area under the precision-recall curve of that threshold. The
|
defined as the area under the precision-recall curve of that
|
||||||
\gls{map} across \gls{iou} thresholds from 0.5 to 0.95 in 0.05
|
threshold. The \gls{map} across \gls{iou} thresholds from
|
||||||
steps \textsf{mAP}@0.5:0.95 is 0.3838.}
|
\num{0.5} to \num{0.95} in \num{0.05} steps \gls{map}@0.5:0.95 is
|
||||||
|
\num{0.3838}.}
|
||||||
\label{fig:aggregate-ap-hyp}
|
\label{fig:aggregate-ap-hyp}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
Figure~\ref{fig:aggregate-ap-hyp} confirms the performance increase of
|
Figure~\ref{fig:aggregate-ap-hyp} confirms the performance increase of
|
||||||
the optimized model established in
|
the optimized model established in
|
||||||
table~\ref{tab:model-metrics-hyp}. The \textsf{mAP}@0.5 is higher for
|
table~\ref{tab:model-metrics-hyp}. The \gls{map}@0.5 is higher for
|
||||||
both classes, indicating that the model better detects plants in
|
both classes, indicating that the model better detects plants in
|
||||||
general. The \textsf{mAP}@0.95 is slightly lower for the healthy
|
general. The \gls{map}@0.95 is slightly lower for the healthy class,
|
||||||
class, which means that the confidence for the healthy class is
|
which means that the confidence for the healthy class is slightly
|
||||||
slightly lower compared to the non-optimized model. The result is that
|
lower compared to the non-optimized model. The result is that more
|
||||||
more plants are correctly detected and classified overall, but the
|
plants are correctly detected and classified overall, but the
|
||||||
confidence scores tend to be lower with the optimized model. The
|
confidence scores tend to be lower with the optimized model. The
|
||||||
\textsf{mAP}@0.5:0.95 could be improved by about 0.025.
|
\gls{map}@0.5:0.95 could be improved by about \num{0.025}.
|
||||||
|
|
||||||
\section{Discussion}
|
\section{Discussion}
|
||||||
\label{sec:discussion}
|
\label{sec:discussion}
|
||||||
|
|
||||||
Pull out discussion parts from current results chapter
|
Overall, the performance of the individual models is state of the art
|
||||||
(~\ref{sec:results}) and add a section about achievement of the aim
|
when compared with object detection benchmarks such as the \gls{coco}
|
||||||
of the work discussed in motivation and problem statement section
|
dataset. The \gls{map} of \num{0.5727} for the object detection model
|
||||||
(~\ref{sec:methods}).
|
is in line with most other object detectors. The hyperparameter
|
||||||
|
optimization of the object detector, however, raises further
|
||||||
|
questions. The \gls{map} of the optimized model is \num{1.8}
|
||||||
|
percentage points lower than the non-optimized version. Even though
|
||||||
|
precision and recall of the model improved, the bounding boxes are
|
||||||
|
worse. We argue that the hyperparameter optimization has to be run for
|
||||||
|
more than \num{87} iterations to provide better results. Searching for
|
||||||
|
the optimal hyperparameters with genetic methods usually requires many
|
||||||
|
more iterations than that because it takes a significant amount of
|
||||||
|
time to evolve the parameters \emph{away} from the starting
|
||||||
|
conditions.
|
||||||
|
|
||||||
Estimated 2 pages for this chapter.
|
Furthermore, we only train each iteration for three epochs and assume
|
||||||
|
that those already provide a good measure of the model's
|
||||||
|
performance. It can be seen in figure~\ref{fig:hyp-opt-fitness} that
|
||||||
|
the fitness during the first few epochs exhibits some amount of
|
||||||
|
variation before it stabilizes. In fact, the fitness of the
|
||||||
|
non-optimized object detector (figure~\ref{fig:fitness}) only achieves
|
||||||
|
a stable value at epoch \num{50}. An optimized model is often able to
|
||||||
|
converge faster which is supported by
|
||||||
|
figure~\ref{fig:hyp-opt-fitness}, but even in that case it takes more
|
||||||
|
than ten epochs to stabilize the training process.
|
||||||
|
|
||||||
|
The optimized classifier
|
||||||
|
shows very strong performance in the \num{10}-fold cross validation
|
||||||
|
where it achieves a mean \gls{auc} of \num{0.96}.
|
||||||
|
|
||||||
\chapter{Conclusion}
|
\chapter{Conclusion}
|
||||||
\label{chap:conclusion}
|
\label{chap:conclusion}
|
||||||
|
|||||||
Loading…
x
Reference in New Issue
Block a user