Fix various consistency errors
This commit is contained in:
parent
bd56ced119
commit
a3f0222a7f
@ -183,7 +183,7 @@ learning.
|
||||
Large-scale as well as small local farmers are able to survey their
|
||||
fields and gardens with drones or stationary cameras to determine soil
|
||||
and plant condition as well as when to water or
|
||||
fertilize~\cite{ramos-giraldo2020}. Machine learning models play an
|
||||
fertilize \cite{ramos-giraldo2020}. Machine learning models play an
|
||||
important role in that process because they allow automated
|
||||
decision-making in real time. While machine learning has been used in
|
||||
large-scale agriculture, it is also a valuable tool for household
|
||||
@ -199,11 +199,11 @@ are numerous. First, gathering data in the field requires a network of
|
||||
sensors which are linked to a central server for processing. Since
|
||||
communication between sensors is difficult without proper
|
||||
infrastructure, there is a high demand for processing the data on the
|
||||
sensor itself~\cite{mcenroe2022}. Second, differences in local soil,
|
||||
sensor itself \cite{mcenroe2022}. Second, differences in local soil,
|
||||
plant and weather conditions require models to be optimized for these
|
||||
diverse inputs. Centrally trained models often lose the nuances
|
||||
present in the data because they have to provide actionable
|
||||
information for a larger area~\cite{awad2019}. Third, specialized
|
||||
information for a larger area \cite{awad2019}. Third, specialized
|
||||
methods such as hyper- or multispectral imaging in the field provide
|
||||
fine-grained information about the object of interest but come with
|
||||
substantial upfront costs and are of limited interest for gardeners.
|
||||
@ -224,7 +224,7 @@ plants in the field of view and then to determine if the plants need
|
||||
water or not. The model should be suitable for edge devices equipped
|
||||
with a \gls{tpu} or \gls{gpu} but with otherwise limited processing
|
||||
capabilities. Examples of such systems include Google's Coral
|
||||
development board and the Nvidia Jetson series of~\glspl{sbc}. The
|
||||
development board and the Nvidia Jetson series of \glspl{sbc}. The
|
||||
model should make use of state-of-the-art algorithms from either
|
||||
classical machine learning or deep learning. The literature review
|
||||
will yield an appropriate machine learning method. Furthermore, the
|
||||
@ -325,19 +325,19 @@ further insights about the type of models which are commonly used.
|
||||
|
||||
In order to find and select appropriate datasets to train the models
|
||||
on, we will survey the existing big datasets for classes we can
|
||||
use. Datasets such as the \gls{coco}~\cite{lin2015} and
|
||||
\gls{voc}~\cite{everingham2010} contain the highly relevant class
|
||||
use. Datasets such as the \gls{coco} \cite{lin2015} and
|
||||
\gls{voc} \cite{everingham2010} contain the highly relevant class
|
||||
\emph{Potted Plant}. By extracting only these classes from multiple
|
||||
datasets and concatenating them together, it is possible to create one
|
||||
unified dataset which only contains the classes necessary for training
|
||||
the model.
|
||||
|
||||
The training of the models will happen in an environment where more
|
||||
computational resources are available than what the~\gls{sbc}
|
||||
offers. We will deploy the final model with the~\gls{api} to
|
||||
the~\gls{sbc} after training and optimization. Furthermore, training
|
||||
will happen in tandem with a continuous evaluation process. After
|
||||
every iteration of the model, an evaluation run against the test set
|
||||
computational resources are available than what the \gls{sbc}
|
||||
offers. We will deploy the final model with the \gls{api} to the
|
||||
\gls{sbc} after training and optimization. Furthermore, training will
|
||||
happen in tandem with a continuous evaluation process. After every
|
||||
iteration of the model, an evaluation run against the test set
|
||||
determines if there has been an improvement in performance. The
|
||||
results of the evaluation feed back into the parameter selection at
|
||||
the beginning of each training phase. Small changes to the training
|
||||
@ -357,7 +357,7 @@ has been met, and—if not—give reasons for the rejection of all or part
|
||||
of the hypotheses.
|
||||
|
||||
Overall, the development of our application follows an evolutionary
|
||||
pototyping process~\cite{davis1992,sears2007}. Instead of producing a
|
||||
prototyping process \cite{davis1992,sears2007}. Instead of producing a
|
||||
full-fledged product from the start, development happens iteratively
|
||||
in phases. The main phases and their order for the prototype at hand
|
||||
are: model selection, implementation, and evaluation. The results of
|
||||
@ -404,7 +404,7 @@ results of the testing phases as well as the performance of the
|
||||
aggregate model. Futhermore, the results are compared with the
|
||||
expectations and it is discussed whether they are explainable in the
|
||||
context of the task at hand as well as benchmark results from other
|
||||
datasets (\gls{coco}~\cite{lin2015}). Chapter~\ref{chap:conclusion}
|
||||
datasets (\gls{coco} \cite{lin2015}). Chapter~\ref{chap:conclusion}
|
||||
concludes the thesis with a summary and an outlook on possible
|
||||
improvements and further research questions.
|
||||
|
||||
@ -685,8 +685,8 @@ network and is, therefore, not suitable for complex intra-data
|
||||
relationships. A major downside to using the Heaviside step function
|
||||
is that it is not differentiable at $x = 0$ and has a $0$ derivative
|
||||
elsewhere. These properties make it unsuitable for use with gradient
|
||||
descent during back-propagation (section
|
||||
\ref{ssec:theory-back-propagation}).
|
||||
descent during backpropagation
|
||||
(section~\ref{ssec:theory-backprop}).
|
||||
|
||||
\subsubsection{Sigmoid}
|
||||
\label{sssec:theory-sigmoid}
|
||||
@ -852,28 +852,28 @@ there is the case of binary random variables, i.e. only two classes to
|
||||
classify exist, the measure is called binary
|
||||
cross-entropy. Cross-entropy loss is known to outperform \gls{mse} for
|
||||
classification tasks and allows the model to be trained
|
||||
faster~\cite{simard2003}.
|
||||
faster \cite{simard2003}.
|
||||
|
||||
\subsection{Back-Propagation}
|
||||
\label{ssec:theory-back-propagation}
|
||||
\subsection{Backpropagation}
|
||||
\label{ssec:theory-backprop}
|
||||
|
||||
So far, information only flows forward through the network whenever a
|
||||
prediction for a particular input should be made. In order for a
|
||||
neural network to learn, information about the computed loss has to
|
||||
flow backward through the network. Only then can the weights at the
|
||||
individual neurons be updated. This type of information flow is termed
|
||||
\emph{back-propagation} \cite{rumelhart1986}. Back-propagation
|
||||
computes the gradient of a loss function with respect to the weights
|
||||
of a network for an input-output pair. The algorithm computes the
|
||||
gradient iteratively starting from the last layer and works its way
|
||||
backward through the network until it reaches the first layer.
|
||||
\emph{backpropagation} \cite{rumelhart1986}. Backpropagation computes
|
||||
the gradient of a loss function with respect to the weights of a
|
||||
network for an input-output pair. The algorithm computes the gradient
|
||||
iteratively starting from the last layer and works its way backward
|
||||
through the network until it reaches the first layer.
|
||||
|
||||
Strictly speaking, back-propagation only computes the gradient, but
|
||||
Strictly speaking, backpropagation only computes the gradient, but
|
||||
does not determine how the gradient is used to learn the new
|
||||
weights. Once the back-propagation algorithm has computed the
|
||||
gradient, that gradient is passed to an algorithm which finds a local
|
||||
minimum of it. This step is usually performed by some variant of
|
||||
gradient descent \cite{cauchy1847}.
|
||||
weights. Once the backpropagation algorithm has computed the gradient,
|
||||
that gradient is passed to an algorithm which finds a local minimum of
|
||||
it. This step is usually performed by some variant of gradient descent
|
||||
\cite{cauchy1847}.
|
||||
|
||||
\section{Object Detection}
|
||||
\label{sec:background-detection}
|
||||
@ -900,7 +900,7 @@ time.
|
||||
\label{sssec:obj-viola-jones}
|
||||
|
||||
The first milestone was the face detector by
|
||||
~\textcite{viola2001,viola2001} which is able to perform face
|
||||
\textcite{viola2001,viola2001} which is able to perform face
|
||||
recognition on $384$ by $288$ pixel (grayscale) images with
|
||||
\qty{15}{fps} on a \qty{700}{\MHz} Intel Pentium III processor. The
|
||||
authors use an integral image representation where every pixel is the
|
||||
@ -909,7 +909,7 @@ representation allows them to quickly and efficiently calculate
|
||||
Haar-like features.
|
||||
|
||||
The Haar-like features are passed to a modified AdaBoost
|
||||
algorithm~\cite{freund1995} which only selects the (presumably) most
|
||||
algorithm \cite{freund1995} which only selects the (presumably) most
|
||||
important features. At the end there is a cascading stage of
|
||||
classifiers where regions are only considered further if they are
|
||||
promising. Every additional classifier adds complexity, but once a
|
||||
@ -921,7 +921,7 @@ achieves comparable results to the state of the art in 2001.
|
||||
\subsubsection{HOG Detector}
|
||||
\label{sssec:obj-hog}
|
||||
|
||||
The \gls{hog}~\cite{dalal2005} is a feature descriptor used in
|
||||
The \gls{hog} \cite{dalal2005} is a feature descriptor used in
|
||||
computer vision and image processing to detect objects in images. It
|
||||
is a detector which detects shape like other methods such as
|
||||
\gls{sift} \cite{lowe1999}. The idea is to use the distribution of
|
||||
@ -940,14 +940,14 @@ with images of 64 by 128 pixels and make sure that the image contains
|
||||
a margin of 16 pixels around the person. Decreasing the border by
|
||||
either enlarging the person or reducing the overall image size results
|
||||
in worse performance. Unfortunately, their method is far from being
|
||||
able to process images in real time—a 320 by 240 image takes roughly a
|
||||
second to process.
|
||||
able to process images in real time—a $320$ by $240$ image takes
|
||||
roughly a second to process.
|
||||
|
||||
\subsubsection{Deformable Part-Based Model}
|
||||
\label{sssec:obj-dpm}
|
||||
|
||||
\glspl{dpm}~\cite{felzenszwalb2008a} were the winners of the \gls{voc}
|
||||
challenge in the years 2007, 2008 and 2009. The method is heavily
|
||||
\glspl{dpm} \cite{felzenszwalb2008a} were the winners of the \gls{voc}
|
||||
challenge in the years 2007, 2008, and 2009. The method is heavily
|
||||
based on the previously discussed \gls{hog} since it also uses
|
||||
\gls{hog} descriptors internally. The authors addition is the idea of
|
||||
learning how to decompose objects during training and
|
||||
@ -1008,25 +1008,25 @@ often not as efficient as one-stage detectors.
|
||||
|
||||
\textcite{girshick2014} were the first to propose using feature
|
||||
representations of \glspl{cnn} for object detection. Their approach
|
||||
consists of generating around 2000 region proposals and passing these
|
||||
on to a \gls{cnn} for feature extraction. The fixed-length feature
|
||||
vector is used as input for a linear \gls{svm} which classifies the
|
||||
region. They name their method R-\gls{cnn}, where the R stands for
|
||||
region.
|
||||
consists of generating around $2000$ region proposals and passing
|
||||
these on to a \gls{cnn} for feature extraction. The fixed-length
|
||||
feature vector is used as input for a linear \gls{svm} which
|
||||
classifies the region. They name their method R-\gls{cnn}, where the R
|
||||
stands for region.
|
||||
|
||||
R-\gls{cnn} uses selective search to generate region proposals
|
||||
\cite{uijlings2013}.The authors use selective search's \emph{fast
|
||||
mode} to generate the 2000 proposals and warp (i.e. aspect ratios are
|
||||
not retained) each proposal into the image dimensions required by the
|
||||
\gls{cnn}. The \gls{cnn}, which matches the architecture of AlexNet
|
||||
\cite{krizhevsky2012}, generates a $4096$-dimensional feature vector
|
||||
and each feature vector is scored by a linear \gls{svm} for each
|
||||
class. Scored regions are selected/discarded by comparing each region
|
||||
to other regions within the same class and rejecting them if there
|
||||
exists another region with a higher score and greater \gls{iou} than a
|
||||
threshold. The linear \gls{svm} classifiers are trained to only label
|
||||
a region as positive if the overlap, as measured by \gls{iou}, is
|
||||
above $0.3$.
|
||||
mode} to generate the $2000$ proposals and warp (i.e. aspect ratios
|
||||
are not retained) each proposal into the image dimensions required by
|
||||
the \gls{cnn}. The \gls{cnn}, which matches the architecture of
|
||||
AlexNet \cite{krizhevsky2012}, generates a $4096$-dimensional feature
|
||||
vector and each feature vector is scored by a linear \gls{svm} for
|
||||
each class. Scored regions are selected/discarded by comparing each
|
||||
region to other regions within the same class and rejecting them if
|
||||
there exists another region with a higher score and greater \gls{iou}
|
||||
than a threshold. The linear \gls{svm} classifiers are trained to only
|
||||
label a region as positive if the overlap, as measured by \gls{iou},
|
||||
is above $0.3$.
|
||||
|
||||
While the approach of generating region proposals is not new, using a
|
||||
\gls{cnn} purely for feature extraction is. Unfortunately, R-\gls{cnn}
|
||||
@ -1132,15 +1132,15 @@ on all levels. \glspl{fpn} are an important building block of many
|
||||
state-of-the-art object detectors.
|
||||
|
||||
A \gls{fpn} first computes the feature pyramid bottom-up with a
|
||||
scaling step of 2. The lower levels capture less semantic information
|
||||
scaling step of two. The lower levels capture less semantic information
|
||||
than the higher levels, but include more spatial information due to
|
||||
the higher granularity. In a second step, the \gls{fpn} upsamples the
|
||||
higher levels such that the dimensions of two consecutive layers are
|
||||
the same. The upsampled top layer is merged with the layer beneath it
|
||||
via element-wise addition and convolved with a $1\times 1$ convolutional
|
||||
layer to reduce channel dimensions and to smooth out potential
|
||||
artifacts introduced during the upsampling step. The results of that
|
||||
operation constitute the new \emph{top layer} and the process
|
||||
via element-wise addition and convolved with a one by one
|
||||
convolutional layer to reduce channel dimensions and to smooth out
|
||||
potential artifacts introduced during the upsampling step. The results
|
||||
of that operation constitute the new \emph{top layer} and the process
|
||||
continues with the layer below it until the finest resolution feature
|
||||
map is generated. In this way, the features of the different layers at
|
||||
different scales are fused to obtain a feature map with high semantic
|
||||
@ -1216,7 +1216,7 @@ detect smaller and denser objects as well.
|
||||
|
||||
The authors report results on \gls{voc} 2007 for their \gls{ssd}300
|
||||
and \gls{ssd}512 model varieties. The number refers to the size of the
|
||||
input images. \gls{ssd}300 outperforms Fast R-\gls{cnn} by 1.1
|
||||
input images. \gls{ssd}300 outperforms Fast R-\gls{cnn} by $1.1$
|
||||
percentage points (\gls{map} 66.9\% vs 68\%). \gls{ssd}512 outperforms
|
||||
Faster R-\gls{cnn} by 1.7\% \gls{map}. If trained on the \gls{voc}
|
||||
2007, 2012 and \gls{coco} train sets, \gls{ssd}512 achieves a
|
||||
@ -1343,7 +1343,7 @@ The idea of automatic generation of feature maps via \glspl{ann} gave
|
||||
rise to \glspl{cnn}. Early \glspl{cnn} \cite{lecun1989} were mostly
|
||||
discarded for practical applications because they require much more
|
||||
data during training than traditional methods and also more processing
|
||||
power during inference. Passing $224\times 224$ pixel images to a
|
||||
power during inference. Passing $224$ by $224$ pixel images to a
|
||||
\gls{cnn}, as is common today, was simply not feasible if one wanted a
|
||||
reasonable inference time. With the development of \glspl{gpu} and
|
||||
supporting software such as the \gls{cuda} toolkit, it was possible to
|
||||
@ -1367,24 +1367,24 @@ function. The error function with which the weights are updated is
|
||||
|
||||
The architecture of LeNet-5 is composed of two convolutional layers,
|
||||
two pooling layers and a dense block of three fully-connected
|
||||
layers. The input image is a grayscale image of 32 by 32 pixels. The
|
||||
first convolutional layer generates six feature maps, each with a
|
||||
scale of 28 by 28 pixels. Each feature map is fed to a pooling layer
|
||||
which effectively downsamples the image by a factor of two. By
|
||||
aggregating each two by two area in the feature map via averaging, the
|
||||
authors are more likely to obtain relative (to each other) instead of
|
||||
absolute positions of the features. To make up for the loss in spatial
|
||||
resolution, the following convolutional layer increases the amount of
|
||||
feature maps to 16 which aims to increase the richness of the learned
|
||||
representations. Another pooling layer follows which reduces the size
|
||||
of each of the 16 feature maps to five by five pixels. A dense block
|
||||
of three fully-connected layers of 120, 84 and 10 neurons respectively
|
||||
serves as the actual classifier in the network. The last layer uses
|
||||
the euclidean \gls{rbf} to compute the class an image belongs to (0-9
|
||||
digits).
|
||||
layers. The input image is a grayscale image of $32$ by $32$
|
||||
pixels. The first convolutional layer generates six feature maps, each
|
||||
with a scale of $28$ by $28$ pixels. Each feature map is fed to a
|
||||
pooling layer which effectively downsamples the image by a factor of
|
||||
two. By aggregating each two by two area in the feature map via
|
||||
averaging, the authors are more likely to obtain relative (to each
|
||||
other) instead of absolute positions of the features. To make up for
|
||||
the loss in spatial resolution, the following convolutional layer
|
||||
increases the amount of feature maps to $16$ which aims to increase
|
||||
the richness of the learned representations. Another pooling layer
|
||||
follows which reduces the size of each of the $16$ feature maps to
|
||||
five by five pixels. A dense block of three fully-connected layers of
|
||||
120, 84 and 10 neurons respectively serves as the actual classifier in
|
||||
the network. The last layer uses the euclidean \gls{rbf} to compute
|
||||
the class an image belongs to (0-9 digits).
|
||||
|
||||
The performance of LeNet-5 was measured on the \gls{mnist} database
|
||||
which consists of 70.000 labeled images of handwritten digits. The
|
||||
which consists of $70000$ labeled images of handwritten digits. The
|
||||
\gls{mse} on the test set is 0.95\%. This result is impressive
|
||||
considering that character recognition with a \gls{cnn} had not been
|
||||
done before. However, standard machine learning methods of the time,
|
||||
@ -1453,7 +1453,7 @@ second layers of the feature maps present in AlexNet. They identify
|
||||
multiple problems with their structure such as aliasing artifacts and
|
||||
a mix of low and high frequency information without any mid
|
||||
frequencies. These results indicate that the filter size in AlexNet is
|
||||
too large at 11 by 11 and the authors reduce it to seven by
|
||||
too large at $11$ by $11$ and the authors reduce it to seven by
|
||||
seven. Additionally, they modify the original stride of four to
|
||||
two. These two changes result in an improvement in the top-5 error
|
||||
rate of 1.6\% over their own replicated AlexNet result of 18.1\%.
|
||||
@ -1461,7 +1461,7 @@ rate of 1.6\% over their own replicated AlexNet result of 18.1\%.
|
||||
\subsubsection{GoogLeNet}
|
||||
\label{sssec:theory-googlenet}
|
||||
|
||||
GoogLeNet, also known as Inception-v1, was proposed by
|
||||
GoogLeNet, also known as Inception v1, was proposed by
|
||||
\textcite{szegedy2015} to increase the depth of the network without
|
||||
introducing too much additional complexity. Since the relevant parts
|
||||
of an image can often be of different sizes, but kernels within
|
||||
@ -1504,15 +1504,15 @@ non-linearities by having two \glspl{relu} instead of only one. The
|
||||
authors provide five different networks with increasing number of
|
||||
parameters based on these principles. The smallest network has a depth
|
||||
of eight convolutional layers and three fully-connected layers for the
|
||||
head (11 in total). The largest network has 16 convolutional and three
|
||||
fully-connected layers (19 in total). The fully-connected layers are
|
||||
the same for each architecture, only the layout of the convolutional
|
||||
layers varies.
|
||||
head ($11$ in total). The largest network has $16$ convolutional and
|
||||
three fully-connected layers ($19$ in total). The fully-connected
|
||||
layers are the same for each architecture, only the layout of the
|
||||
convolutional layers varies.
|
||||
|
||||
The deepest network with 19 layers achieves a top-5 error rate on
|
||||
The deepest network with $19$ layers achieves a top-5 error rate on
|
||||
\gls{ilsvrc} 2014 of 9\%. If trained with different image scales in
|
||||
the range of $S \in [256, 512]$, the same network achieves a top-5 error
|
||||
rate of 8\% (test set at scale 256). By combining their two largest
|
||||
rate of 8\% (test set at scale $256$). By combining their two largest
|
||||
architectures and multi-crop as well as dense evaluation, they achieve
|
||||
an ensemble top-5 error rate of 6.8\%, while their best single network
|
||||
with multi-crop and dense evaluation results in 7\%, thus beating the
|
||||
@ -1522,8 +1522,8 @@ section~\ref{sssec:theory-googlenet}) by 0.9\%.
|
||||
\subsubsection{ResNet}
|
||||
\label{sssec:theory-resnet}
|
||||
|
||||
The 22-layer structure of GoogLeNet \cite{szegedy2015} and the
|
||||
19-layer structure of VGGNet \cite{simonyan2015} showed that
|
||||
The $22$-layer structure of GoogLeNet \cite{szegedy2015} and the
|
||||
$19$-layer structure of VGGNet \cite{simonyan2015} showed that
|
||||
\emph{going deeper} is beneficial for achieving better classification
|
||||
performance. However, the authors of VGGNet already note that stacking
|
||||
even more layers does not lead to better performance because the model
|
||||
@ -1706,13 +1706,13 @@ Estimated 3 pages for this section.
|
||||
|
||||
The literature on machine learning in agriculture is broadly divided
|
||||
into four main areas:~livestock management, soil management, water
|
||||
management, and crop management~\cite{benos2021}. Of those four, water
|
||||
management, and crop management \cite{benos2021}. Of those four, water
|
||||
management only makes up about 10\% of all surveyed papers during the
|
||||
years 2018--2020. This highlights the potential for research in this
|
||||
area to have a high real-world impact.
|
||||
|
||||
\textcite{su2020} used traditional feature extraction and
|
||||
pre-processing techniques to train various machine learning models for
|
||||
preprocessing techniques to train various machine learning models for
|
||||
classifying water stress for a wheat field. They took top-down images
|
||||
of the field using an \gls{uav}, segmented wheat pixels from
|
||||
background pixels and constructed features based on spectral
|
||||
@ -1742,47 +1742,49 @@ their results do not transfer well to the other seasons under survey
|
||||
\textcite{zhuang2017} showed that water stress in maize can be
|
||||
detected early on and, therefore, still provide actionable information
|
||||
before the plants succumb to drought. They installed a camera which
|
||||
took $640\times480$ pixel RGB images every two hours. A simple linear
|
||||
classifier (SVM) segmented the image into foreground and background
|
||||
using the green color channel. The authors constructed a
|
||||
fourteen-dimensional feature space consisting of color and texture
|
||||
features. A gradient boosted decision tree (GBDT) model classified the
|
||||
images into water stressed and non-stressed and achieved an accuracy
|
||||
of $\qty{90.39}{\percent}$. Remarkably, the classification was not
|
||||
took $640$ by $480$ pixel RGB images every two hours. A simple linear
|
||||
classifier (\gls{svm}) segmented the image into foreground and
|
||||
background using the green color channel. The authors constructed a
|
||||
$14$-dimensional feature space consisting of color and texture
|
||||
features. A \gls{gbdt} model classified the images into water stressed
|
||||
and non-stressed and achieved an accuracy of
|
||||
$\qty{90.39}{\percent}$. Remarkably, the classification was not
|
||||
significantly impacted by illumination changes throughout the day.
|
||||
|
||||
\textcite{an2019} used the ResNet50 model as a basis for transfer
|
||||
learning and achieved high classification scores (ca. 95\%) on
|
||||
maize. Their model was fed with $640\times480$ pixel images of maize
|
||||
from three different viewpoints and across three different growth
|
||||
phases. The images were converted to grayscale which turned out to
|
||||
slightly lower classification accuracy. Their results also highlight
|
||||
the superiority of deep convolutional neural networks (DCNNs) compared
|
||||
to manual feature extraction and gradient boosted decision trees
|
||||
(GBDTs).
|
||||
\textcite{an2019} used the ResNet50 model (see
|
||||
section~\ref{sssec:theory-resnet}) as a basis for transfer learning and
|
||||
achieved high classification scores (ca. 95\%) on maize. Their model
|
||||
was fed with $640$ by $480$ pixel images of maize from three different
|
||||
viewpoints and across three different growth phases. The images were
|
||||
converted to grayscale which turned out to slightly lower
|
||||
classification accuracy. Their results also highlight the superiority
|
||||
of \glspl{dcnn} compared to manual feature extraction and
|
||||
\glspl{gbdt}.
|
||||
|
||||
\textcite{chandel2021} investigated deep learning models in depth by
|
||||
comparing three well-known CNNs. The models under scrutiny were
|
||||
AlexNet, GoogLeNet, and Inception V3. Each model was trained with a
|
||||
dataset containing images of maize, okra, and soybean at different
|
||||
stages of growth and under stress and no stress. The researchers did
|
||||
not include an object detection step before image classification and
|
||||
compiled a fairly small dataset of 1200 images. Of the three models,
|
||||
GoogLeNet beat the other two with a sizable lead at a classification
|
||||
accuracy of >94\% for all three types of crop. The authors attribute
|
||||
its success to its inherently deeper structure and application of
|
||||
multiple convolutional layers at different stages. Unfortunately, all
|
||||
of the images were taken at the same $\ang{45}\pm\ang{5}$ angle and it
|
||||
stands to reason that the models would perform significantly worse on
|
||||
images taken under different conditions.
|
||||
comparing three well-known \glspl{cnn}. The models under scrutiny were
|
||||
AlexNet (see section~\ref{sssec:theory-alexnet}), GoogLeNet (see
|
||||
section~\ref{sssec:theory-googlenet}), and Inception v3. Each model
|
||||
was trained with a dataset containing images of maize, okra, and
|
||||
soybean at different stages of growth and under stress and no
|
||||
stress. The researchers did not include an object detection step
|
||||
before image classification and compiled a fairly small dataset of
|
||||
$1200$ images. Of the three models, GoogLeNet beat the other two with
|
||||
a sizable lead at a classification accuracy of >94\% for all three
|
||||
types of crop. The authors attribute its success to its inherently
|
||||
deeper structure and application of multiple convolutional layers at
|
||||
different stages. Unfortunately, all of the images were taken at the
|
||||
same $\ang{45}\pm\ang{5}$ angle and it stands to reason that the models
|
||||
would perform significantly worse on images taken under different
|
||||
conditions.
|
||||
|
||||
\textcite{ramos-giraldo2020} detected water stress in soybean and corn
|
||||
crops with a pretrained model based on DenseNet-121. Low-cost cameras
|
||||
deployed in the field provided the training data over a 70-day
|
||||
period. They achieved a classification accuracy for the degree of
|
||||
wilting of 88\%.
|
||||
crops with a pretrained model based on DenseNet-121 (see
|
||||
section~\ref{sssec:theory-densenet}). Low-cost cameras deployed in the
|
||||
field provided the training data over a $70$-day period. They achieved
|
||||
a classification accuracy for the degree of wilting of 88\%.
|
||||
|
||||
In a later study, the same authors~\cite{ramos-giraldo2020a} deployed
|
||||
In a later study, the same authors \cite{ramos-giraldo2020a} deployed
|
||||
their machine learning model in the field to test it for production
|
||||
use. They installed multiple Raspberry Pis with attached Raspberry Pi
|
||||
Cameras which took images in $\qty{30}{\minute}$ intervals. The
|
||||
@ -1797,27 +1799,26 @@ classification scores on corn and soybean with a low-cost setup.
|
||||
\textcite{azimi2020} demonstrate the efficacy of deep learning models
|
||||
versus classical machine learning models on chickpea plants. The
|
||||
authors created their own dataset in a laboratory setting for stressed
|
||||
and non-stressed plants. They acquired 8000 images at eight different
|
||||
angles in total. For the classical machine learning models, they
|
||||
extracted feature vectors using scale-invariant feature transform
|
||||
(SIFT) and histogram of oriented gradients (HOG). The features are fed
|
||||
into three classical machine learning models: support vector machine
|
||||
(SVM), k-nearest neighbors (KNN), and a decision tree (DT) using the
|
||||
classification and regression (CART) algorithm. On the deep learning
|
||||
side, they used their own CNN architecture and the pre-trained
|
||||
ResNet-18 model. The accuracy scores for the classical models was in
|
||||
the range of $\qty{60}{\percent}$ to $\qty{73}{\percent}$ with the SVM
|
||||
outperforming the two others. The CNN achieved higher scores at
|
||||
$\qty{72}{\percent}$ to $\qty{78}{\percent}$ and ResNet-18 achieved
|
||||
the highest scores at $\qty{82}{\percent}$ to
|
||||
$\qty{86}{\percent}$. The results clearly show the superiority of deep
|
||||
learning over classical machine learning. A downside of their approach
|
||||
lies in the collection of the images. The background in all images was
|
||||
uniformly white and the plants were prominently placed in the
|
||||
center. It should, therefore, not be assumed that the same
|
||||
classification scores can be achieved on plants in the field with
|
||||
messy and noisy backgrounds as well as illumination changes and so
|
||||
forth.
|
||||
and non-stressed plants. They acquired $8000$ images at eight
|
||||
different angles in total. For the classical machine learning models,
|
||||
they extracted feature vectors using \gls{sift} and \gls{hog}. The
|
||||
features are fed into three classical machine learning models:
|
||||
\gls{svm}, \gls{k-nn}, and a \gls{dt} using the \gls{cart}
|
||||
algorithm. On the deep learning side, they used their own \gls{cnn}
|
||||
architecture and the pretrained ResNet-18 (see
|
||||
section~\ref{sssec:theory-resnet}) model. The accuracy scores for the
|
||||
classical models was in the range of $\qty{60}{\percent}$ to
|
||||
$\qty{73}{\percent}$ with the \gls{svm} outperforming the two
|
||||
others. The \gls{cnn} achieved higher scores at $\qty{72}{\percent}$
|
||||
to $\qty{78}{\percent}$ and ResNet-18 achieved the highest scores at
|
||||
$\qty{82}{\percent}$ to $\qty{86}{\percent}$. The results clearly show
|
||||
the superiority of deep learning over classical machine learning. A
|
||||
downside of their approach lies in the collection of the images. The
|
||||
background in all images was uniformly white and the plants were
|
||||
prominently placed in the center. It should, therefore, not be assumed
|
||||
that the same classification scores can be achieved on plants in the
|
||||
field with messy and noisy backgrounds as well as illumination changes
|
||||
and so forth.
|
||||
|
||||
A significant problem in the detection of water stress is posed by the
|
||||
evolution of indicators across time. Since physiological features such
|
||||
@ -2189,27 +2190,28 @@ validation and testing, respectively.
|
||||
Of the 91479 images around 10\% were used for the test phase. These
|
||||
images contain a total of 12238 ground truth
|
||||
labels. Table~\ref{tab:yolo-metrics} shows precision, recall and the
|
||||
harmonic mean of both (F1-score). The results indicate that the model
|
||||
errs on the side of sensitivity because recall is higher than
|
||||
precision. Although some detections are not labeled as plants in the
|
||||
dataset, if there is a labeled plant in the ground truth data, the
|
||||
chance is high that it will be detected. This behavior is in line with
|
||||
how the model's detections are handled in practice. The detections are
|
||||
drawn on the original image and the user is able to check the bounding
|
||||
boxes visually. If there are wrong detections, the user can ignore
|
||||
them and focus on the relevant ones instead. A higher recall will thus
|
||||
serve the user's needs better than a high precision.
|
||||
harmonic mean of both ($\mathrm{F}_1$-score). The results indicate
|
||||
that the model errs on the side of sensitivity because recall is
|
||||
higher than precision. Although some detections are not labeled as
|
||||
plants in the dataset, if there is a labeled plant in the ground truth
|
||||
data, the chance is high that it will be detected. This behavior is in
|
||||
line with how the model's detections are handled in practice. The
|
||||
detections are drawn on the original image and the user is able to
|
||||
check the bounding boxes visually. If there are wrong detections, the
|
||||
user can ignore them and focus on the relevant ones instead. A higher
|
||||
recall will thus serve the user's needs better than a high precision.
|
||||
|
||||
\begin{table}[h]
|
||||
\centering
|
||||
\begin{tabular}{lrrrr}
|
||||
\toprule
|
||||
{} & Precision & Recall & F1-score & Support \\
|
||||
{} & Precision & Recall & $\mathrm{F}_1$-score & Support \\
|
||||
\midrule
|
||||
Plant & 0.547571 & 0.737866 & 0.628633 & 12238.0 \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\caption{Precision, recall and F1-score for the object detection model.}
|
||||
\caption{Precision, recall and $\mathrm{F}_1$-score for the object
|
||||
detection model.}
|
||||
\label{tab:yolo-metrics}
|
||||
\end{table}
|
||||
|
||||
@ -2330,26 +2332,26 @@ increase again after epoch 27.
|
||||
\centering
|
||||
\begin{tabular}{lrrrr}
|
||||
\toprule
|
||||
{} & Precision & Recall & F1-score & Support \\
|
||||
{} & Precision & Recall & $\mathrm{F}_1$-score & Support \\
|
||||
\midrule
|
||||
Plant & 0.633358 & 0.702811 & 0.666279 & 12238.0 \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\caption{Precision, recall and F1-score for the optimized object
|
||||
detection model.}
|
||||
\caption{Precision, recall and $\mathrm{F}_1$-score for the
|
||||
optimized object detection model.}
|
||||
\label{tab:yolo-metrics-hyp}
|
||||
\end{table}
|
||||
|
||||
Turning to the evaluation of the optimized model on the test dataset,
|
||||
table~\ref{tab:yolo-metrics-hyp} shows precision, recall and the
|
||||
F1-score for the optimized model. Comparing these metrics with the
|
||||
non-optimized version from table~\ref{tab:yolo-metrics}, precision is
|
||||
significantly higher by more than 8.5\%. Recall, however, is 3.5\%
|
||||
lower. The F1-score is higher by more than 3.7\% which indicates that
|
||||
the optimized model is better overall despite the lower recall. We
|
||||
feel that the lower recall value is a suitable trade off for the
|
||||
substantially higher precision considering that the non-optimized
|
||||
model's precision is quite low at 0.55.
|
||||
$\mathrm{F}_1$-score for the optimized model. Comparing these metrics
|
||||
with the non-optimized version from table~\ref{tab:yolo-metrics},
|
||||
precision is significantly higher by more than 8.5\%. Recall, however,
|
||||
is 3.5\% lower. The $\mathrm{F}_1$-score is higher by more than 3.7\%
|
||||
which indicates that the optimized model is better overall despite the
|
||||
lower recall. We feel that the lower recall value is a suitable trade
|
||||
off for the substantially higher precision considering that the
|
||||
non-optimized model's precision is quite low at 0.55.
|
||||
|
||||
The precision-recall curves in figure~\ref{fig:yolo-ap-hyp} for the
|
||||
optimized model show that the model draws looser bounding boxes than
|
||||
@ -2438,7 +2440,7 @@ The random search was run for 138 iterations which equates to a 75\%
|
||||
probability that the best solution lies within 1\% of the theoretical
|
||||
maximum~\eqref{eq:opt-prob}. Figure~\ref{fig:classifier-hyp-results}
|
||||
shows three of the eight parameters and their impact on a high
|
||||
F1-score. \gls{sgd} has less variation in its results than
|
||||
$\mathrm{F}_1$-score. \gls{sgd} has less variation in its results than
|
||||
Adam~\cite{kingma2017} and manages to provide eight out of the ten
|
||||
best results. The number of epochs to train for was chosen based on
|
||||
the observation that almost all configurations converge well before
|
||||
@ -2456,17 +2458,17 @@ figure~\ref{fig:classifier-training-metrics}.
|
||||
\includegraphics{graphics/classifier-hyp-metrics.pdf}
|
||||
\caption[Classifier hyper-parameter optimization results.]{This
|
||||
figure shows three of the eight hyper-parameters and their
|
||||
performance measured by the F1-score during 138
|
||||
performance measured by the $\mathrm{F}_1$-score during 138
|
||||
trials. Differently colored markers show the batch size with
|
||||
darker colors representing a larger batch size. The type of marker
|
||||
(circle or cross) shows which optimizer was used. The x-axis shows
|
||||
the learning rate on a logarithmic scale. In general, a learning
|
||||
rate between 0.003 and 0.01 results in more robust and better
|
||||
F1-scores. Larger batch sizes more often lead to better
|
||||
performance as well. As for the type of optimizer, \gls{sgd}
|
||||
produced the best iteration with an F1-score of 0.9783. Adam tends
|
||||
to require more customization of its parameters than \gls{sgd} to
|
||||
achieve good results.}
|
||||
$\mathrm{F}_1$-scores. Larger batch sizes more often lead to
|
||||
better performance as well. As for the type of optimizer,
|
||||
\gls{sgd} produced the best iteration with an $\mathrm{F}_1$-score
|
||||
of 0.9783. Adam tends to require more customization of its
|
||||
parameters than \gls{sgd} to achieve good results.}
|
||||
\label{fig:classifier-hyp-results}
|
||||
\end{figure}
|
||||
|
||||
@ -2477,14 +2479,15 @@ chance due to a coincidentally advantageous train/test split, we
|
||||
perform stratified $10$-fold cross validation on the dataset. Each
|
||||
fold contains 90\% training and 10\% test data and was trained for 25
|
||||
epochs. Figure~\ref{fig:classifier-hyp-roc} shows the performance of
|
||||
the epoch with the highest F1-score of each fold as measured against
|
||||
the test split. The mean \gls{roc} curve provides a robust metric for
|
||||
a classifier's performance because it averages out the variability of
|
||||
the evaluation. Each fold manages to achieve at least an \gls{auc} of
|
||||
0.94, while the best fold reaches 0.98. The mean \gls{roc} has an
|
||||
\gls{auc} of 0.96 with a standard deviation of 0.02. These results
|
||||
indicate that the model is accurately predicting the correct class and
|
||||
is robust against variations in the training set.
|
||||
the epoch with the highest $\mathrm{F}_1$-score of each fold as
|
||||
measured against the test split. The mean \gls{roc} curve provides a
|
||||
robust metric for a classifier's performance because it averages out
|
||||
the variability of the evaluation. Each fold manages to achieve at
|
||||
least an \gls{auc} of 0.94, while the best fold reaches 0.98. The mean
|
||||
\gls{roc} has an \gls{auc} of 0.96 with a standard deviation of
|
||||
0.02. These results indicate that the model is accurately predicting
|
||||
the correct class and is robust against variations in the training
|
||||
set.
|
||||
|
||||
\begin{table}
|
||||
\centering
|
||||
@ -2508,47 +2511,49 @@ is robust against variations in the training set.
|
||||
\includegraphics{graphics/classifier-hyp-folds-roc.pdf}
|
||||
\caption[Mean \gls{roc} and variability of hyper-parameter-optimized
|
||||
model.]{This plot shows the \gls{roc} curve for the epoch with the
|
||||
highest F1-score of each fold as well as the \gls{auc}. To get a
|
||||
less variable performance metric of the classifier, the mean
|
||||
\gls{roc} curve is shown as a thick line and the variability is
|
||||
shown in gray. The overall mean \gls{auc} is 0.96 with a standard
|
||||
deviation of 0.02. The best-performing fold reaches an \gls{auc}
|
||||
of 0.99 and the worst an \gls{auc} of 0.94. The black dashed line
|
||||
indicates the performance of a classifier which picks classes at
|
||||
random ($\mathrm{\gls{auc}} = 0.5$). The shapes of the \gls{roc}
|
||||
curves show that the classifier performs well and is robust
|
||||
against variations in the training set.}
|
||||
highest $\mathrm{F}_1$-score of each fold as well as the
|
||||
\gls{auc}. To get a less variable performance metric of the
|
||||
classifier, the mean \gls{roc} curve is shown as a thick line and
|
||||
the variability is shown in gray. The overall mean \gls{auc} is
|
||||
0.96 with a standard deviation of 0.02. The best-performing fold
|
||||
reaches an \gls{auc} of 0.99 and the worst an \gls{auc} of
|
||||
0.94. The black dashed line indicates the performance of a
|
||||
classifier which picks classes at random
|
||||
($\mathrm{\gls{auc}} = 0.5$). The shapes of the \gls{roc} curves
|
||||
show that the classifier performs well and is robust against
|
||||
variations in the training set.}
|
||||
\label{fig:classifier-hyp-roc}
|
||||
\end{figure}
|
||||
|
||||
The classifier shows good performance so far, but care has to be taken
|
||||
to not overfit the model to the training set. Comparing the F1-score
|
||||
during training with the F1-score during testing gives insight into
|
||||
when the model tries to increase its performance during training at
|
||||
the expense of generalizability. Figure~\ref{fig:classifier-hyp-folds}
|
||||
shows the F1-scores of each epoch and fold. The classifier converges
|
||||
to not overfit the model to the training set. Comparing the
|
||||
$\mathrm{F}_1$-score during training with the $\mathrm{F}_1$-score
|
||||
during testing gives insight into when the model tries to increase its
|
||||
performance during training at the expense of
|
||||
generalizability. Figure~\ref{fig:classifier-hyp-folds} shows the
|
||||
$\mathrm{F}_1$-scores of each epoch and fold. The classifier converges
|
||||
quickly to 1 for the training set at which point it experiences a
|
||||
slight drop in generalizability. Training the model for at most five
|
||||
epochs is sufficient because there are generally no improvements
|
||||
afterwards. The best-performing epoch for each fold is between the
|
||||
second and fourth epoch which is just before the model achieves an
|
||||
F1-score of 1 on the training set.
|
||||
$\mathrm{F}_1$-score of 1 on the training set.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
\includegraphics[width=.9\textwidth]{graphics/classifier-hyp-folds-f1.pdf}
|
||||
\caption[F1-score of stratified $10$-fold cross validation.]{These
|
||||
plots show the F1-score during training as well as testing for
|
||||
each of the folds. The classifier converges to 1 by the third
|
||||
epoch during the training phase, which might indicate
|
||||
overfitting. However, the performance during testing increases
|
||||
until epoch three in most cases and then stabilizes at
|
||||
approximately 2-3\% lower than the best epoch. We believe that the
|
||||
third, or in some cases fourth, epoch is detrimental to
|
||||
performance and results in overfitting, because the model achieves
|
||||
an F1-score of 1 for the training set, but that gain does not
|
||||
transfer to the test set. Early stopping during training
|
||||
alleviates this problem.}
|
||||
\caption[$\mathrm{F}_1$-score of stratified $10$-fold cross
|
||||
validation.]{These plots show the $\mathrm{F}_1$-score during
|
||||
training as well as testing for each of the folds. The classifier
|
||||
converges to 1 by the third epoch during the training phase, which
|
||||
might indicate overfitting. However, the performance during
|
||||
testing increases until epoch three in most cases and then
|
||||
stabilizes at approximately 2-3\% lower than the best epoch. We
|
||||
believe that the third, or in some cases fourth, epoch is
|
||||
detrimental to performance and results in overfitting, because the
|
||||
model achieves an $\mathrm{F}_1$-score of 1 for the training set,
|
||||
but that gain does not transfer to the test set. Early stopping
|
||||
during training alleviates this problem.}
|
||||
\label{fig:classifier-hyp-folds}
|
||||
\end{figure}
|
||||
|
||||
@ -2655,7 +2660,7 @@ bounding boxes of healthy plants and 494 of stressed plants.
|
||||
\centering
|
||||
\begin{tabular}{lrrrr}
|
||||
\toprule
|
||||
{} & precision & recall & f1-score & support \\
|
||||
{} & precision & recall & $\mathrm{F}_{1}$-score & support \\
|
||||
\midrule
|
||||
Healthy & 0.665 & 0.554 & 0.604 & 766 \\
|
||||
Stressed & 0.639 & 0.502 & 0.562 & 494 \\
|
||||
@ -2664,15 +2669,17 @@ bounding boxes of healthy plants and 494 of stressed plants.
|
||||
weighted avg & 0.655 & 0.533 & 0.588 & 1260 \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\caption{Precision, recall and F1-score for the aggregate model.}
|
||||
\caption{Precision, recall and $\mathrm{F}_1$-score for the
|
||||
aggregate model.}
|
||||
\label{tab:model-metrics}
|
||||
\end{table}
|
||||
|
||||
Table~\ref{tab:model-metrics} shows precision, recall and the F1-score
|
||||
for both classes \emph{Healthy} and \emph{Stressed}. Precision is
|
||||
higher than recall for both classes and the F1-score is at
|
||||
0.59. Unfortunately, these values do not take the accuracy of bounding
|
||||
boxes into account and thus have only limited expressive power.
|
||||
Table~\ref{tab:model-metrics} shows precision, recall and the
|
||||
$\mathrm{F}_1$-score for both classes \emph{Healthy} and
|
||||
\emph{Stressed}. Precision is higher than recall for both classes and
|
||||
the $\mathrm{F}_1$-score is at 0.59. Unfortunately, these values do
|
||||
not take the accuracy of bounding boxes into account and thus have
|
||||
only limited expressive power.
|
||||
|
||||
Figure~\ref{fig:aggregate-ap} shows the precision and recall curves
|
||||
for both classes at different \gls{iou} thresholds. The left plot
|
||||
@ -2716,7 +2723,7 @@ section~\ref{ssec:aggregate-model}.
|
||||
\centering
|
||||
\begin{tabular}{lrrrr}
|
||||
\toprule
|
||||
{} & precision & recall & f1-score & support \\
|
||||
{} & precision & recall & $\mathrm{F}_{1}$-score & support \\
|
||||
\midrule
|
||||
Healthy & 0.711 & 0.555 & 0.623 & 766 \\
|
||||
Stressed & 0.570 & 0.623 & 0.596 & 494 \\
|
||||
@ -2725,22 +2732,23 @@ section~\ref{ssec:aggregate-model}.
|
||||
weighted avg & 0.656 & 0.582 & 0.612 & 1260 \\
|
||||
\bottomrule
|
||||
\end{tabular}
|
||||
\caption{Precision, recall and F1-score for the optimized aggregate
|
||||
model.}
|
||||
\caption{Precision, recall and $\mathrm{F}_1$-score for the
|
||||
optimized aggregate model.}
|
||||
\label{tab:model-metrics-hyp}
|
||||
\end{table}
|
||||
|
||||
Table~\ref{tab:model-metrics-hyp} shows precision, recall and F1-score
|
||||
for the optimized model on the same test dataset of 640 images. All of
|
||||
the metrics are better for the optimized model. In particular,
|
||||
precision for the healthy class could be improved significantly while
|
||||
recall remains at the same level. This results in a better F1-score
|
||||
for the healthy class. Precision for the stressed class is lower with
|
||||
the optimized model, but recall is significantly higher (0.502
|
||||
vs. 0.623). The higher recall results in a 3\% gain for the F1-score
|
||||
in the stressed class. Overall, precision is the same but recall has
|
||||
Table~\ref{tab:model-metrics-hyp} shows precision, recall and
|
||||
$\mathrm{F}_1$-score for the optimized model on the same test dataset
|
||||
of 640 images. All of the metrics are better for the optimized
|
||||
model. In particular, precision for the healthy class could be
|
||||
improved significantly while recall remains at the same level. This
|
||||
results in a better $\mathrm{F}_1$-score for the healthy
|
||||
class. Precision for the stressed class is lower with the optimized
|
||||
model, but recall is significantly higher (0.502 vs. 0.623). The
|
||||
higher recall results in a 3\% gain for the $\mathrm{F}_1$-score in
|
||||
the stressed class. Overall, precision is the same but recall has
|
||||
improved significantly, which also results in a noticeable improvement
|
||||
for the average F1-score across both classes.
|
||||
for the average $\mathrm{F}_1$-score across both classes.
|
||||
|
||||
\begin{figure}
|
||||
\centering
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user