Add result section

This commit is contained in:
Tobias Eidelpes 2023-12-20 10:39:19 +01:00
parent 326562ca85
commit dced4c6902
2 changed files with 348 additions and 311 deletions

Binary file not shown.

View File

@ -1995,7 +1995,7 @@ resource. However, because the results are published via a \gls{rest}
service, internet access is necessary to be able to retrieve the
predictions.
Other functional requirements are that the inference on the device for
Other technical requirements are that the inference on the device for
both models does not take too long (i.e. not longer than a few seconds
per image). Even though plants are not known to grow extremely rapidly
from one minute to the next, keeping the inference time low results in
@ -2003,12 +2003,42 @@ a more resource efficient prototype. As such, it is possible to run
the device off of a battery which completes the self-contained nature
of the prototype.
From an evaluation perspective, the models are required to attain a
reasonable level of accuracy. It is difficult to determine said level
beforehand, but considering the task as well as general object
detection and classification benchmarks such as \gls{coco}
\cite{lin2015}, we expect a \gls{map} of around 40\% and precision and
recall values of 70\%.
From an evaluation perspective, the models should have high
specificity and sensitivity. In order to be useful for plant
water-stress detection, it is necessary to identify as many
water-stressed plants as possible while keeping the number of false
positives as low as possible (specificity). If the number of
water-stressed plants is severely overestimated, downstream watering
systems could damage the plants by overwatering. Conversely, if the
number of water-stressed plants is underestimated, some plants are
likely to die because no water-stress is detected (sensitivity).
Furthermore, the models are required to attain a reasonable level of
precision as well as good localization of plants. It is difficult to
determine said levels beforehand, but considering the task at hand as
well as general object detection and classification benchmarks such as
\gls{coco} \cite{lin2015}, we expect a \gls{map} of around 40\% and
precision and recall values of 70\%.
Other basic model requirements are robust object detection and
classification as well as good generalizability. The prototype should
be able to function in different environments where different lighting
conditions, different backgrounds, and different angles do not have an
impact on model performance. Where feasible, models should be
evaluated with cross validation to ensure that the performance of the
model on the test set is a good indicator of its generalizability. In
the same vein, models should not overfit or underfit the training data
which also results in bad generalizability.
During the iterative process of training the models as well as for
evaluation purposes, the models should be interpretable. Especially
when there is comparatively little training data available, verifying
if the model is focusing on the \emph{right} parts of an image gives
insight into its robustness and generalizability which can increase
trust. Furthermore, if a model is clearly not focusing on the right
parts of an image, interpretability can help debug where the problem
lies. Interpretability is thus an important property of any model so
that the model engineer is able to steer the training and inference
process in the right direction.
\section{Design}
\label{sec:design}
@ -2741,11 +2771,11 @@ In order to improve the aforementioned accuracy values, we perform
hyperparameter optimization across a wide range of
parameters. Table~\ref{tab:classifier-hyps} lists the hyperparameters
and their possible values. Since the number of all combinations of
values is \num{11520} and each combination is trained for \num{10}
epochs with a training time of approximately six minutes per
combination, exhausting the search space would take \num{48} days. Due
to time limitations, we have chosen to not search exhaustively but to
pick random combinations instead. Random search works surprisingly
values is \num{11520} and each combination is trained for ten epochs
with a training time of approximately six minutes per combination,
exhausting the search space would take \num{48} days. Due to time
limitations, we have chosen to not search exhaustively but to pick
random combinations instead. Random search works surprisingly
well---especially compared to grid search---in a number of domains, one of
which is hyperparameter optimization \cite{bergstra2012}.
@ -2782,7 +2812,9 @@ the observation that almost all configurations converge well before
reaching the tenth epoch. The assumption that a training run with ten
epochs provides a good proxy for final performance is supported by the
quick convergence of validation accuracy and loss in
figure~\ref{fig:classifier-training-metrics}.
figure~\ref{fig:classifier-training-metrics}. Table~\ref{tab:classifier-final-hyps}
lists the final hyperparameters which were chosen to train the
improved model.
\begin{equation}\label{eq:opt-prob}
1 - (1 - 0.01)^{138} \approx 0.75
@ -2808,23 +2840,6 @@ figure~\ref{fig:classifier-training-metrics}.
\label{fig:classifier-hyp-results}
\end{figure}
Table~\ref{tab:classifier-final-hyps} lists the final hyperparameters
which were chosen to train the improved model. In order to confirm
that the model does not suffer from overfitting or is a product of
chance due to a coincidentally advantageous train/test split, we
perform stratified $10$-fold cross validation on the dataset. Each
fold contains 90\% training and 10\% test data and was trained for
\num{25} epochs. Figure~\ref{fig:classifier-hyp-roc} shows the
performance of the epoch with the highest $\mathrm{F}_1$-score of each
fold as measured against the test split. The mean \gls{roc} curve
provides a robust metric for a classifier's performance because it
averages out the variability of the evaluation. Each fold manages to
achieve at least an \gls{auc} of \num{0.94}, while the best fold
reaches \num{0.99}. The mean \gls{roc} has an \gls{auc} of \num{0.96}
with a standard deviation of \num{0.02}. These results indicate that
the model is accurately predicting the correct class and is robust
against variations in the training set.
\begin{table}
\centering
\begin{tabular}{cccc}
@ -2842,6 +2857,232 @@ against variations in the training set.
\label{tab:classifier-final-hyps}
\end{table}
\section{Deployment}
After training of the two models (object detector and classifier), we
export them to the \gls{onnx}\footnote{\url{https://github.com/onnx}}
format and move the model files to the Nvidia Jetson Nano. On the
device, a Flask application (\emph{server}) provides a \gls{rest}
endpoint from which the results of the most recent prediction can be
queried. The server periodically performs the following steps:
\begin{enumerate}
\item Call a binary which takes an image and writes it to a file.
\item Take the image and detect all plants as well as their status
using the two models.
\item Draw the returned bounding boxes onto the original image.
\item Number each detection from left to right.
\item Coerce the prediction for each bounding box into a tuple
$\langle I, S, T,\Delta T \rangle$.
\item Store the image with the bounding boxes and an array of all
tuples (predictions) in a dictionary.
\item Wait two minutes.
\item Go to step one.
\end{enumerate}
The binary uses the accelerated GStreamer implementation by Nvidia to
take an image. The tuple $\langle I, S, T,\Delta T \rangle$ consists of the following
items: $I$ is the number of the bounding box in the image, $S$ the
current state from one to ten, $T$ the timestamp of the prediction,
and $\Delta T$ the time since the state $S$ last fell under three. The
server performs these tasks asynchronously in the background and is
always ready to respond to requests with the most recent prediction.
This chapter detailed the training and deployment of the two models
used for the plant water-stress detection system—the object detector
and the classifier. Furthermore, we have specified the \gls{api} which
publishes the results continuously. We will now turn towards the
evaluation of the two separate models as well as the aggregate model.
\chapter{Evaluation}
\label{chap:evaluation}
The following sections contain a detailed evaluation of the model in
various scenarios. First, we describe the test datasets as well as the
metrics used for assessing model performance. Second, we present the
results of the evaluation and analyze the behavior of the classifier
with \gls{grad-cam}. Finally, we discuss the results with respect to
the research questions defined in section~\ref{sec:motivation}.
\section{Methodology}
\label{sec:methodology}
In order to evaluate the object detection model and the classification
model, we analyze their predictions on test datasets. For the object
detection model, the test dataset is a 10\% split of the original
dataset which we describe in section~\ref{ssec:obj-train-dataset}. The
classifier is evaluated with a \num{10}-fold cross validation from the
original dataset (see section~\ref{ssec:class-train-dataset}). After
the evaluation of both models individually, we evaluate the model in
aggregate on a new dataset. This is necessary because the prototype
uses the two models as if they were one. The aggregate performance is
ultimately the most important measure to decide if the prototype is
able to meet the requirements.
The test set for the aggregate model contains \num{640} images which
were obtained from a google search using the terms \emph{thirsty
plant}, \emph{wilted plant} and \emph{stressed plant}. Images which
clearly show one or multiple plants with some amount of visible stress
were added to the dataset. Care was taken to include plants with
various degrees of stress and in various locations and lighting
conditions. The search not only provided images of stressed plants,
but also of healthy plants. The dataset is biased towards potted
plants which are commonly put on display in western
households. Furthermore, many plants, such as succulents, are sought
after for home environments because of their ease of maintenance. Due
to their inclusion in the dataset and how they exhibit water stress,
the test set contains a wide variety of scenarios.
After collecting the images, the aggregate model was run on them to
obtain initial bounding boxes and classifications for ground truth
labeling. Letting the model do the work beforehand and then correcting
the labels allowed to include more images in the test set because they
could be labeled more easily. Additionally, going over the detections
and classifications provided a comprehensive view on how the models
work and what their weaknesses and strengths are. After the labels
have been corrected, the ground truth of the test set contains
\num{766} bounding boxes of healthy plants and \num{494} of stressed
plants.
\section{Results}
\label{sec:results}
This section presents the results of the evaluation of the constituent
models as well as the aggregate model. First, we evaluate the object
detection model before and after hyperparameter optimization. Second,
we evaluate the performance of the classifier after hyperparameter
optimization and present the results of \gls{grad-cam}. Finally, we
evaluate the aggregate model before and after hyperparameter
optimization.
\subsection{Object Detection}
\label{ssec:yolo-eval}
Of the \num{91479} images around 10\% were used for the test
phase. These images contain a total of \num{12238} ground truth
labels. Table~\ref{tab:yolo-metrics} shows precision, recall and the
harmonic mean of both ($\mathrm{F}_1$-score). The results indicate
that the model errs on the side of sensitivity because recall is
higher than precision. Although some detections are not labeled as
plants in the dataset, if there is a labeled plant in the ground truth
data, the chance is high that it will be detected. This behavior is in
line with how the model's detections are handled in practice. The
detections are drawn on the original image and the user is able to
check the bounding boxes visually. If there are wrong detections, the
user can ignore them and focus on the relevant ones instead. A higher
recall will thus serve the user's needs better than a high precision.
\begin{table}[h]
\centering
\begin{tabular}{lrrrr}
\toprule
{} & Precision & Recall & $\mathrm{F}_1$-score & Support \\
\midrule
Plant & \num{0.547571} & \num{0.737866} & \num{0.628633} & \num{12238.0} \\
\bottomrule
\end{tabular}
\caption{Precision, recall and $\mathrm{F}_1$-score for the object
detection model.}
\label{tab:yolo-metrics}
\end{table}
Figure~\ref{fig:yolo-ap} shows the \gls{ap} for the \gls{iou}
thresholds of \num{0.5} and \num{0.95}. Predicted bounding boxes with
an \gls{iou} of less than \num{0.5} are not taken into account for the
precision and recall values of table~\ref{tab:yolo-metrics}. The lower
the detection threshold, the more plants are detected. Conversely, a
higher detection threshold leaves potential plants undetected. The
precision-recall curves confirm this behavior because the area under
the curve for the threshold of \num{0.5} is higher than for the
threshold of \num{0.95} (\num{0.66} versus \num{0.41}). These values
are combined in COCO's \cite{lin2015} main evaluation metric which is
the \gls{ap} averaged across the \gls{iou} thresholds from \num{0.5}
to \num{0.95} in \num{0.05} steps. This value is then averaged across
all classes and called \gls{map}. The object detection model achieves
a state-of-the-art \gls{map} of \num{0.5727} for the \emph{Plant} class.
\begin{figure}
\centering
\includegraphics{graphics/APpt5-pt95.pdf}
\caption[Object detection AP@0.5 and AP@0.95.]{Precision-recall
curves for \gls{iou} thresholds of \num{0.5} and \num{0.95}. The
\gls{ap} of a specific threshold is defined as the area under the
precision-recall curve of that threshold. The \gls{map} across
\gls{iou} thresholds from \num{0.5} to \num{0.95} in \num{0.05}
steps \gls{map}@0.5:0.95 is \num{0.5727}.}
\label{fig:yolo-ap}
\end{figure}
\subsubsection{Hyperparameter Optimization}
\label{sssec:yolo-hyp-opt}
Turning to the evaluation of the optimized model on the test dataset,
table~\ref{tab:yolo-metrics-hyp} shows precision, recall and the
$\mathrm{F}_1$-score for the optimized model. Comparing these metrics
with the non-optimized version from table~\ref{tab:yolo-metrics},
precision is significantly higher by more than \num{8.5} percentage
points. Recall, however, is \num{3.5} percentage points lower. The
$\mathrm{F}_1$-score is higher by more than \num{3.7} percentage
points which indicates that the optimized model is better overall
despite the lower recall. We argue that the lower recall value is a
suitable trade off for the substantially higher precision considering
that the non-optimized model's precision is quite low at \num{0.55}.
\begin{table}[h]
\centering
\begin{tabular}{lrrrr}
\toprule
{} & Precision & Recall & $\mathrm{F}_1$-score & Support \\
\midrule
Plant & \num{0.633358} & \num{0.702811} & \num{0.666279} & \num{12238.0} \\
\bottomrule
\end{tabular}
\caption{Precision, recall and $\mathrm{F}_1$-score for the
optimized object detection model.}
\label{tab:yolo-metrics-hyp}
\end{table}
The precision-recall curves in figure~\ref{fig:yolo-ap-hyp} for the
optimized model show that the model draws looser bounding boxes than
the optimized model. The \gls{ap} for both \gls{iou} thresholds of
\num{0.5} and \num{0.95} is lower indicating worse performance. It is
likely that more iterations during evolution would help increase the
\gls{ap} values as well. Even though the precision and recall values
from table~\ref{tab:yolo-metrics-hyp} are better, the
\gls{map}@0.5:0.95 is lower by \num{1.8} percentage points.
\begin{figure}
\centering
\includegraphics{graphics/APpt5-pt95-final.pdf}
\caption[Hyper-parameter optimized object detection AP@0.5 and
AP@0.95.]{Precision-recall curves for \gls{iou} thresholds of
\num{0.5} and \num{0.95}. The \gls{ap} of a specific threshold is
defined as the area under the precision-recall curve of that
threshold. The \gls{map} across \gls{iou} thresholds from
\num{0.5} to \num{0.95} in \num{0.05} steps \gls{map}@0.5:0.95 is
\num{0.5546}.}
\label{fig:yolo-ap-hyp}
\end{figure}
\subsection{Classification}
\label{ssec:classifier-eval}
In order to confirm that the optimized classification model does not
suffer from overfitting or is a product of chance due to a
coincidentally advantageous train/test split, we perform stratified
$10$-fold cross validation on the dataset. Each fold contains 90\%
training and 10\% test data and was trained for \num{25}
epochs. Figure~\ref{fig:classifier-hyp-roc} shows the performance of
the epoch with the highest $\mathrm{F}_1$-score of each fold as
measured against the test split. The mean \gls{roc} curve provides a
robust metric for a classifier's performance because it averages out
the variability of the evaluation. Each fold manages to achieve at
least an \gls{auc} of \num{0.94}, while the best fold reaches
\num{0.99}. The mean \gls{roc} has an \gls{auc} of \num{0.96} with a
standard deviation of \num{0.02}. These results indicate that the
model is accurately predicting the correct class and is robust against
variations in the training set.
\begin{figure}
\centering
\includegraphics{graphics/classifier-hyp-folds-roc.pdf}
@ -2893,210 +3134,6 @@ $\mathrm{F}_1$-score of \num{1} on the training set.
\label{fig:classifier-hyp-folds}
\end{figure}
\section{Deployment}
After training of the two models (object detector and classifier), we
export them to the \gls{onnx}\footnote{\url{https://github.com/onnx}}
format and move the model files to the Nvidia Jetson Nano. On the
device, a Flask application (\emph{server}) provides a \gls{rest}
endpoint from which the results of the most recent prediction can be
queried. The server periodically performs the following steps:
\begin{enumerate}
\item Call a binary which takes an image and writes it to a file.
\item Take the image and detect all plants as well as their status
using the two models.
\item Draw the returned bounding boxes onto the original image.
\item Number each detection from left to right.
\item Coerce the prediction for each bounding box into a tuple
$\langle I, S, T,\Delta T \rangle$.
\item Store the image with the bounding boxes and an array of all
tuples (predictions) in a dictionary.
\item Wait two minutes.
\item Go to step one.
\end{enumerate}
The binary uses the accelerated GStreamer implementation by Nvidia to
take an image. The tuple $\langle I, S, T,\Delta T \rangle$ consists of the following
items: $I$ is the number of the bounding box in the image, $S$ the
current state from one to ten, $T$ the timestamp of the prediction,
and $\Delta T$ the time since the state $S$ last fell under three. The
server performs these tasks asynchronously in the background and is
always ready to respond to requests with the most recent prediction.
\chapter{Evaluation}
\label{chap:evaluation}
The following sections contain a detailed evaluation of the model in
various scenarios. We employ methods from the field of \gls{xai} such
as \gls{grad-cam} to get a better understanding of the models'
abstractions. Finally, we turn to the models' aggregate performance on
the test set.
\section{Methodology}
\label{sec:methodology}
Go over the evaluation methodology by explaining the test datasets,
where they come from, and how they're structured. Explain how the
testing phase was done and which metrics are employed to compare the
models to the SOTA.
Estimated 2 pages for this section.
\section{Results}
\label{sec:results}
Systematically go over the results from the testing phase(s), show the
plots and metrics, and explain what they contain.
Estimated 4 pages for this section.
\subsection{Object Detection}
\label{ssec:yolo-eval}
The following parapraph should probably go into
section~\ref{sec:development-detection}.
The object detection model was pre-trained on the COCO~\cite{lin2015}
dataset and fine-tuned with data from the \gls{oid}
\cite{kuznetsova2020} in its sixth version. Since the full \gls{oid}
dataset contains considerably more classes and samples than would be
feasibly trainable on a small cluster of \glspl{gpu}, only images from
the two classes \emph{Plant} and \emph{Houseplant} have been
downloaded. The samples from the Houseplant class are merged into the
Plant class because the distinction between the two is not necessary
for our model. Furthermore, the \gls{oid} contains not only bounding
box annotations for object detection tasks, but also instance
segmentations, classification labels and more. These are not needed
for our purposes and are omitted as well. In total, the dataset
consists of 91479 images with a roughly 85/5/10 split for training,
validation and testing, respectively.
\subsubsection{Test Phase}
\label{sssec:yolo-test}
Of the 91479 images around 10\% were used for the test phase. These
images contain a total of 12238 ground truth
labels. Table~\ref{tab:yolo-metrics} shows precision, recall and the
harmonic mean of both ($\mathrm{F}_1$-score). The results indicate
that the model errs on the side of sensitivity because recall is
higher than precision. Although some detections are not labeled as
plants in the dataset, if there is a labeled plant in the ground truth
data, the chance is high that it will be detected. This behavior is in
line with how the model's detections are handled in practice. The
detections are drawn on the original image and the user is able to
check the bounding boxes visually. If there are wrong detections, the
user can ignore them and focus on the relevant ones instead. A higher
recall will thus serve the user's needs better than a high precision.
\begin{table}[h]
\centering
\begin{tabular}{lrrrr}
\toprule
{} & Precision & Recall & $\mathrm{F}_1$-score & Support \\
\midrule
Plant & 0.547571 & 0.737866 & 0.628633 & 12238.0 \\
\bottomrule
\end{tabular}
\caption{Precision, recall and $\mathrm{F}_1$-score for the object
detection model.}
\label{tab:yolo-metrics}
\end{table}
Figure~\ref{fig:yolo-ap} shows the \gls{ap} for the \gls{iou}
thresholds of 0.5 and 0.95. Predicted bounding boxes with an \gls{iou}
of less than 0.5 are not taken into account for the precision and
recall values of table~\ref{tab:yolo-metrics}. The lower the detection
threshold, the more plants are detected. Conversely, a higher
detection threshold leaves potential plants undetected. The
precision-recall curves confirm this behavior because the area under
the curve for the threshold of 0.5 is higher than for the threshold of
0.95 ($0.66$ versus $0.41$). These values are combined in COCO's
\cite{lin2015} main evaluation metric which is the \gls{ap} averaged
across the \gls{iou} thresholds from 0.5 to 0.95 in 0.05 steps. This
value is then averaged across all classes and called \gls{map}. The
object detection model achieves a state-of-the-art \gls{map} of 0.5727
for the \emph{Plant} class.
\begin{figure}
\centering
\includegraphics{graphics/APpt5-pt95.pdf}
\caption[Object detection AP@0.5 and AP@0.95.]{Precision-recall
curves for \gls{iou} thresholds of 0.5 and 0.95. The \gls{ap} of a
specific threshold is defined as the area under the
precision-recall curve of that threshold. The \gls{map} across
\gls{iou} thresholds from 0.5 to 0.95 in 0.05 steps
\textsf{mAP}@0.5:0.95 is 0.5727.}
\label{fig:yolo-ap}
\end{figure}
\subsubsection{Hyperparameter Optimization}
\label{sssec:yolo-hyp-opt}
This section should be moved to the hyperparameter optimization
section in the development chapter
(section~\ref{sec:development-detection}).
\begin{table}[h]
\centering
\begin{tabular}{lrrrr}
\toprule
{} & Precision & Recall & $\mathrm{F}_1$-score & Support \\
\midrule
Plant & 0.633358 & 0.702811 & 0.666279 & 12238.0 \\
\bottomrule
\end{tabular}
\caption{Precision, recall and $\mathrm{F}_1$-score for the
optimized object detection model.}
\label{tab:yolo-metrics-hyp}
\end{table}
Turning to the evaluation of the optimized model on the test dataset,
table~\ref{tab:yolo-metrics-hyp} shows precision, recall and the
$\mathrm{F}_1$-score for the optimized model. Comparing these metrics
with the non-optimized version from table~\ref{tab:yolo-metrics},
precision is significantly higher by more than 8.5\%. Recall, however,
is 3.5\% lower. The $\mathrm{F}_1$-score is higher by more than 3.7\%
which indicates that the optimized model is better overall despite the
lower recall. We feel that the lower recall value is a suitable trade
off for the substantially higher precision considering that the
non-optimized model's precision is quite low at 0.55.
The precision-recall curves in figure~\ref{fig:yolo-ap-hyp} for the
optimized model show that the model draws looser bounding boxes than
the optimized model. The \gls{ap} for both \gls{iou} thresholds of 0.5
and 0.95 is lower indicating worse performance. It is likely that more
iterations during evolution would help increase the \gls{ap} values as
well. Even though the precision and recall values from
table~\ref{tab:yolo-metrics-hyp} are better, the \textsf{mAP}@0.5:0.95
is lower by 1.8\%.
\begin{figure}
\centering
\includegraphics{graphics/APpt5-pt95-final.pdf}
\caption[Hyper-parameter optimized object detection AP@0.5 and
AP@0.95.]{Precision-recall curves for \gls{iou} thresholds of 0.5
and 0.95. The \gls{ap} of a specific threshold is defined as the
area under the precision-recall curve of that threshold. The
\gls{map} across \gls{iou} thresholds from 0.5 to 0.95 in 0.05
steps \textsf{mAP}@0.5:0.95 is 0.5546.}
\label{fig:yolo-ap-hyp}
\end{figure}
\subsection{Classification}
\label{ssec:classifier-eval}
\subsubsection{Hyperparameter Optimization}
\label{sssec:classifier-hyp-opt}
This section should be moved to the hyperparameter optimization
section in the development chapter
(section~\ref{sec:development-classification}).
\subsubsection{Class Activation Maps}
\label{sssec:classifier-cam}
@ -3110,7 +3147,7 @@ explain why a decision was made in a certain way. The research field
of \gls{xai} gained significance during the last few years because of
the development of new methods to peek inside these black boxes.
One such method, \gls{cam}~\cite{zhou2015}, is a popular tool to
One such method, \gls{cam} \cite{zhou2015}, is a popular tool to
produce visual explanations for decisions made by
\glspl{cnn}. Convolutional layers essentially function as object
detectors as long as no fully-connected layers perform the
@ -3120,7 +3157,7 @@ be retained until the last layer and used to generate activation maps
for the predictions.
A more recent approach to generating a \gls{cam} via gradients is
proposed by~\textcite{selvaraju2020}. Their \gls{grad-cam} approach
proposed by \textcite{selvaraju2020}. Their \gls{grad-cam} approach
works by computing the gradient of the feature maps of the last
convolutional layer with respect to the specified class. The last
layer is chosen because the authors find that ``[…] Grad-CAM maps
@ -3156,7 +3193,6 @@ of the image during classification.
\label{fig:classifier-cam}
\end{figure}
\subsection{Aggregate Model}
\label{ssec:aggregate-model}
@ -3167,31 +3203,6 @@ complete pipeline from gathering detections of potential plants in an
image and forwarding them to the classifier to obtaining the results
as either healthy or stressed with their associated confidence scores.
The test set contains 640 images which were obtained from a google
search using the terms \emph{thirsty plant}, \emph{wilted plant} and
\emph{stressed plant}. Images which clearly show one or multiple
plants with some amount of visible stress were added to the
dataset. Care was taken to include plants with various degrees of
stress and in various locations and lighting conditions. The search
not only provided images of stressed plants, but also of healthy
plants due to articles, which describe how to care for plants, having
a banner image of healthy plants. The dataset is biased towards potted
plants which are commonly put on display in western
households. Furthermore, many plants, such as succulents, are sought
after for home environments because of their ease of maintenance. Due
to their inclusion in the dataset and how they exhibit water stress,
the test set nevertheless contains a wide variety of scenarios.
After collecting the images, the aggregate model was run on them to
obtain initial bounding boxes and classifications for ground truth
labeling. Letting the model do the work beforehand and then correcting
the labels allowed to include more images in the test set because they
could be labeled more easily. Additionally, going over the detections
and classifications provided a comprehensive view on how the models
work and what their weaknesses and strengths are. After the labels
have been corrected, the ground truth of the test set contains 766
bounding boxes of healthy plants and 494 of stressed plants.
\subsection{Non-optimized Model}
\label{ssec:model-non-optimized}
@ -3199,13 +3210,13 @@ bounding boxes of healthy plants and 494 of stressed plants.
\centering
\begin{tabular}{lrrrr}
\toprule
{} & precision & recall & $\mathrm{F}_{1}$-score & support \\
{} & Precision & Recall & $\mathrm{F}_{1}$-score & Support \\
\midrule
Healthy & 0.665 & 0.554 & 0.604 & 766 \\
Stressed & 0.639 & 0.502 & 0.562 & 494 \\
micro avg & 0.655 & 0.533 & 0.588 & 1260 \\
macro avg & 0.652 & 0.528 & 0.583 & 1260 \\
weighted avg & 0.655 & 0.533 & 0.588 & 1260 \\
Healthy & \num{0.665} & \num{0.554} & \num{0.604} & \num{766} \\
Stressed & \num{0.639} & \num{0.502} & \num{0.562} & \num{494} \\
Micro Avg & \num{0.655} & \num{0.533} & \num{0.588} & \num{1260} \\
Macro Avg & \num{0.652} & \num{0.528} & \num{0.583} & \num{1260} \\
Weighted Avg & \num{0.655} & \num{0.533} & \num{0.588} & \num{1260} \\
\bottomrule
\end{tabular}
\caption{Precision, recall and $\mathrm{F}_1$-score for the
@ -3216,33 +3227,34 @@ bounding boxes of healthy plants and 494 of stressed plants.
Table~\ref{tab:model-metrics} shows precision, recall and the
$\mathrm{F}_1$-score for both classes \emph{Healthy} and
\emph{Stressed}. Precision is higher than recall for both classes and
the $\mathrm{F}_1$-score is at 0.59. Unfortunately, these values do
not take the accuracy of bounding boxes into account and thus have
the $\mathrm{F}_1$-score is at \num{0.59}. Unfortunately, these values
do not take the accuracy of bounding boxes into account and thus have
only limited expressive power.
Figure~\ref{fig:aggregate-ap} shows the precision and recall curves
for both classes at different \gls{iou} thresholds. The left plot
shows the \gls{ap} for each class at the threshold of 0.5 and the
right one at 0.95. The \gls{map} is 0.3581 and calculated across all
classes as the median of the \gls{iou} thresholds from 0.5 to 0.95 in
0.05 steps. The cliffs at around 0.6 (left) and 0.3 (right) happen at
a detection threshold of 0.5. The classifier's last layer is a softmax
layer which necessarily transforms the input into a probability of
showing either a healthy or stressed plant. If the probability of an
image showing a healthy plant is below 0.5, it is no longer classified
as healthy but as stressed. The threshold for discriminating the two
classes lies at the 0.5 value and is therefore the cutoff for either
class.
shows the \gls{ap} for each class at the threshold of \num{0.5} and
the right one at \num{0.95}. The \gls{map} is \num{0.3581} and
calculated across all classes as the median of the \gls{iou}
thresholds from \num{0.5} to \num{0.95} in \num{0.05} steps. The
cliffs at around \num{0.6} (left) and \num{0.3} (right) happen at a
detection threshold of \num{0.5}. The classifier's last layer is a
softmax layer which necessarily transforms the input into a
probability of showing either a healthy or stressed plant. If the
probability of an image showing a healthy plant is below \num{0.5}, it
is no longer classified as healthy but as stressed. The threshold for
discriminating the two classes lies at the \num{0.5} value and is
therefore the cutoff for either class.
\begin{figure}
\centering
\includegraphics{graphics/APmodel-model-optimized-relabeled.pdf}
\caption[Aggregate model AP@0.5 and AP@0.95.]{Precision-recall
curves for \gls{iou} thresholds of 0.5 and 0.95. The \gls{ap} of a
specific threshold is defined as the area under the
curves for \gls{iou} thresholds of \num{0.5} and \num{0.95}. The
\gls{ap} of a specific threshold is defined as the area under the
precision-recall curve of that threshold. The \gls{map} across
\gls{iou} thresholds from 0.5 to 0.95 in 0.05 steps
\textsf{mAP}@0.5:0.95 is 0.3581.}
\gls{iou} thresholds from \num{0.5} to \num{0.95} in \num{0.05}
steps \gls{map}@0.5:0.95 is \num{0.3581}.}
\label{fig:aggregate-ap}
\end{figure}
@ -3262,13 +3274,13 @@ section~\ref{ssec:aggregate-model}.
\centering
\begin{tabular}{lrrrr}
\toprule
{} & precision & recall & $\mathrm{F}_{1}$-score & support \\
{} & Precision & Recall & $\mathrm{F}_{1}$-score & Support \\
\midrule
Healthy & 0.711 & 0.555 & 0.623 & 766 \\
Stressed & 0.570 & 0.623 & 0.596 & 494 \\
micro avg & 0.644 & 0.582 & 0.611 & 1260 \\
macro avg & 0.641 & 0.589 & 0.609 & 1260 \\
weighted avg & 0.656 & 0.582 & 0.612 & 1260 \\
Micro Avg & 0.644 & 0.582 & 0.611 & 1260 \\
Macro Avg & 0.641 & 0.589 & 0.609 & 1260 \\
Weighted Avg & 0.656 & 0.582 & 0.612 & 1260 \\
\bottomrule
\end{tabular}
\caption{Precision, recall and $\mathrm{F}_1$-score for the
@ -3278,49 +3290,74 @@ section~\ref{ssec:aggregate-model}.
Table~\ref{tab:model-metrics-hyp} shows precision, recall and
$\mathrm{F}_1$-score for the optimized model on the same test dataset
of 640 images. All of the metrics are better for the optimized
of \num{640} images. All of the metrics are better for the optimized
model. In particular, precision for the healthy class could be
improved significantly while recall remains at the same level. This
results in a better $\mathrm{F}_1$-score for the healthy
class. Precision for the stressed class is lower with the optimized
model, but recall is significantly higher (0.502 vs. 0.623). The
higher recall results in a 3\% gain for the $\mathrm{F}_1$-score in
the stressed class. Overall, precision is the same but recall has
improved significantly, which also results in a noticeable improvement
for the average $\mathrm{F}_1$-score across both classes.
model, but recall is significantly higher (\num{0.502}
vs. \num{0.623}). The higher recall results in a 3 percentage point
gain for the $\mathrm{F}_1$-score in the stressed class. Overall,
precision is the same but recall has improved significantly, which
also results in a noticeable improvement for the average
$\mathrm{F}_1$-score across both classes.
\begin{figure}
\centering
\includegraphics{graphics/APModel-model-original-relabeled.pdf}
\caption[Optimized aggregate model AP@0.5 and
AP@0.95.]{Precision-recall curves for \gls{iou} thresholds of 0.5
and 0.95. The \gls{ap} of a specific threshold is defined as the
area under the precision-recall curve of that threshold. The
\gls{map} across \gls{iou} thresholds from 0.5 to 0.95 in 0.05
steps \textsf{mAP}@0.5:0.95 is 0.3838.}
AP@0.95.]{Precision-recall curves for \gls{iou} thresholds of
\num{0.5} and \num{0.95}. The \gls{ap} of a specific threshold is
defined as the area under the precision-recall curve of that
threshold. The \gls{map} across \gls{iou} thresholds from
\num{0.5} to \num{0.95} in \num{0.05} steps \gls{map}@0.5:0.95 is
\num{0.3838}.}
\label{fig:aggregate-ap-hyp}
\end{figure}
Figure~\ref{fig:aggregate-ap-hyp} confirms the performance increase of
the optimized model established in
table~\ref{tab:model-metrics-hyp}. The \textsf{mAP}@0.5 is higher for
table~\ref{tab:model-metrics-hyp}. The \gls{map}@0.5 is higher for
both classes, indicating that the model better detects plants in
general. The \textsf{mAP}@0.95 is slightly lower for the healthy
class, which means that the confidence for the healthy class is
slightly lower compared to the non-optimized model. The result is that
more plants are correctly detected and classified overall, but the
general. The \gls{map}@0.95 is slightly lower for the healthy class,
which means that the confidence for the healthy class is slightly
lower compared to the non-optimized model. The result is that more
plants are correctly detected and classified overall, but the
confidence scores tend to be lower with the optimized model. The
\textsf{mAP}@0.5:0.95 could be improved by about 0.025.
\gls{map}@0.5:0.95 could be improved by about \num{0.025}.
\section{Discussion}
\label{sec:discussion}
Pull out discussion parts from current results chapter
(~\ref{sec:results}) and add a section about achievement of the aim
of the work discussed in motivation and problem statement section
(~\ref{sec:methods}).
Overall, the performance of the individual models is state of the art
when compared with object detection benchmarks such as the \gls{coco}
dataset. The \gls{map} of \num{0.5727} for the object detection model
is in line with most other object detectors. The hyperparameter
optimization of the object detector, however, raises further
questions. The \gls{map} of the optimized model is \num{1.8}
percentage points lower than the non-optimized version. Even though
precision and recall of the model improved, the bounding boxes are
worse. We argue that the hyperparameter optimization has to be run for
more than \num{87} iterations to provide better results. Searching for
the optimal hyperparameters with genetic methods usually requires many
more iterations than that because it takes a significant amount of
time to evolve the parameters \emph{away} from the starting
conditions.
Estimated 2 pages for this chapter.
Furthermore, we only train each iteration for three epochs and assume
that those already provide a good measure of the model's
performance. It can be seen in figure~\ref{fig:hyp-opt-fitness} that
the fitness during the first few epochs exhibits some amount of
variation before it stabilizes. In fact, the fitness of the
non-optimized object detector (figure~\ref{fig:fitness}) only achieves
a stable value at epoch \num{50}. An optimized model is often able to
converge faster which is supported by
figure~\ref{fig:hyp-opt-fitness}, but even in that case it takes more
than ten epochs to stabilize the training process.
The optimized classifier
shows very strong performance in the \num{10}-fold cross validation
where it achieves a mean \gls{auc} of \num{0.96}.
\chapter{Conclusion}
\label{chap:conclusion}