diff --git a/thesis/thesis.tex b/thesis/thesis.tex
index 8f2cee2..1894b59 100644
--- a/thesis/thesis.tex
+++ b/thesis/thesis.tex
@@ -183,7 +183,7 @@ learning.
 Large-scale as well as small local farmers are able to survey their
 fields and gardens with drones or stationary cameras to determine soil
 and plant condition as well as when to water or
-fertilize~\cite{ramos-giraldo2020}. Machine learning models play an
+fertilize \cite{ramos-giraldo2020}. Machine learning models play an
 important role in that process because they allow automated
 decision-making in real time. While machine learning has been used in
 large-scale agriculture, it is also a valuable tool for household
@@ -199,11 +199,11 @@ are numerous. First, gathering data in the field requires a network of
 sensors which are linked to a central server for processing. Since
 communication between sensors is difficult without proper
 infrastructure, there is a high demand for processing the data on the
-sensor itself~\cite{mcenroe2022}. Second, differences in local soil,
+sensor itself \cite{mcenroe2022}. Second, differences in local soil,
 plant and weather conditions require models to be optimized for these
 diverse inputs. Centrally trained models often lose the nuances
 present in the data because they have to provide actionable
-information for a larger area~\cite{awad2019}. Third, specialized
+information for a larger area \cite{awad2019}. Third, specialized
 methods such as hyper- or multispectral imaging in the field provide
 fine-grained information about the object of interest but come with
 substantial upfront costs and are of limited interest for gardeners.
@@ -224,7 +224,7 @@ plants in the field of view and then to determine if the plants need
 water or not. The model should be suitable for edge devices equipped
 with a \gls{tpu} or \gls{gpu} but with otherwise limited processing
 capabilities. Examples of such systems include Google's Coral
-development board and the Nvidia Jetson series of~\glspl{sbc}. The
+development board and the Nvidia Jetson series of \glspl{sbc}. The
 model should make use of state-of-the-art algorithms from either
 classical machine learning or deep learning. The literature review
 will yield an appropriate machine learning method. Furthermore, the
@@ -325,19 +325,19 @@ further insights about the type of models which are commonly used.
 
 In order to find and select appropriate datasets to train the models
 on, we will survey the existing big datasets for classes we can
-use. Datasets such as the \gls{coco}~\cite{lin2015} and
-\gls{voc}~\cite{everingham2010} contain the highly relevant class
+use. Datasets such as the \gls{coco} \cite{lin2015} and
+\gls{voc} \cite{everingham2010} contain the highly relevant class
 \emph{Potted Plant}. By extracting only these classes from multiple
 datasets and concatenating them together, it is possible to create one
 unified dataset which only contains the classes necessary for training
 the model.
 
 The training of the models will happen in an environment where more
-computational resources are available than what the~\gls{sbc}
-offers. We will deploy the final model with the~\gls{api} to
-the~\gls{sbc} after training and optimization. Furthermore, training
-will happen in tandem with a continuous evaluation process. After
-every iteration of the model, an evaluation run against the test set
+computational resources are available than what the \gls{sbc}
+offers. We will deploy the final model with the \gls{api} to the
+\gls{sbc} after training and optimization. Furthermore, training will
+happen in tandem with a continuous evaluation process. After every
+iteration of the model, an evaluation run against the test set
 determines if there has been an improvement in performance. The
 results of the evaluation feed back into the parameter selection at
 the beginning of each training phase. Small changes to the training
@@ -357,7 +357,7 @@ has been met, and—if not—give reasons for the rejection of all or part
 of the hypotheses.
 
 Overall, the development of our application follows an evolutionary
-pototyping process~\cite{davis1992,sears2007}. Instead of producing a
+prototyping process \cite{davis1992,sears2007}. Instead of producing a
 full-fledged product from the start, development happens iteratively
 in phases. The main phases and their order for the prototype at hand
 are: model selection, implementation, and evaluation. The results of
@@ -404,7 +404,7 @@ results of the testing phases as well as the performance of the
 aggregate model. Futhermore, the results are compared with the
 expectations and it is discussed whether they are explainable in the
 context of the task at hand as well as benchmark results from other
-datasets (\gls{coco}~\cite{lin2015}). Chapter~\ref{chap:conclusion}
+datasets (\gls{coco} \cite{lin2015}). Chapter~\ref{chap:conclusion}
 concludes the thesis with a summary and an outlook on possible
 improvements and further research questions.
 
@@ -685,8 +685,8 @@ network and is, therefore, not suitable for complex intra-data
 relationships. A major downside to using the Heaviside step function
 is that it is not differentiable at $x = 0$ and has a $0$ derivative
 elsewhere. These properties make it unsuitable for use with gradient
-descent during back-propagation (section
-\ref{ssec:theory-back-propagation}).
+descent during backpropagation
+(section~\ref{ssec:theory-backprop}).
 
 \subsubsection{Sigmoid}
 \label{sssec:theory-sigmoid}
@@ -852,28 +852,28 @@ there is the case of binary random variables, i.e. only two classes to
 classify exist, the measure is called binary
 cross-entropy. Cross-entropy loss is known to outperform \gls{mse} for
 classification tasks and allows the model to be trained
-faster~\cite{simard2003}.
+faster \cite{simard2003}.
 
-\subsection{Back-Propagation}
-\label{ssec:theory-back-propagation}
+\subsection{Backpropagation}
+\label{ssec:theory-backprop}
 
 So far, information only flows forward through the network whenever a
 prediction for a particular input should be made. In order for a
 neural network to learn, information about the computed loss has to
 flow backward through the network. Only then can the weights at the
 individual neurons be updated. This type of information flow is termed
-\emph{back-propagation} \cite{rumelhart1986}. Back-propagation
-computes the gradient of a loss function with respect to the weights
-of a network for an input-output pair. The algorithm computes the
-gradient iteratively starting from the last layer and works its way
-backward through the network until it reaches the first layer.
+\emph{backpropagation} \cite{rumelhart1986}. Backpropagation computes
+the gradient of a loss function with respect to the weights of a
+network for an input-output pair. The algorithm computes the gradient
+iteratively starting from the last layer and works its way backward
+through the network until it reaches the first layer.
 
-Strictly speaking, back-propagation only computes the gradient, but
+Strictly speaking, backpropagation only computes the gradient, but
 does not determine how the gradient is used to learn the new
-weights. Once the back-propagation algorithm has computed the
-gradient, that gradient is passed to an algorithm which finds a local
-minimum of it. This step is usually performed by some variant of
-gradient descent \cite{cauchy1847}.
+weights. Once the backpropagation algorithm has computed the gradient,
+that gradient is passed to an algorithm which finds a local minimum of
+it. This step is usually performed by some variant of gradient descent
+\cite{cauchy1847}.
 
 \section{Object Detection}
 \label{sec:background-detection}
@@ -900,7 +900,7 @@ time.
 \label{sssec:obj-viola-jones}
 
 The first milestone was the face detector by
-~\textcite{viola2001,viola2001} which is able to perform face
+\textcite{viola2001,viola2001} which is able to perform face
 recognition on $384$ by $288$ pixel (grayscale) images with
 \qty{15}{fps} on a \qty{700}{\MHz} Intel Pentium III processor. The
 authors use an integral image representation where every pixel is the
@@ -909,7 +909,7 @@ representation allows them to quickly and efficiently calculate
 Haar-like features.
 
 The Haar-like features are passed to a modified AdaBoost
-algorithm~\cite{freund1995} which only selects the (presumably) most
+algorithm \cite{freund1995} which only selects the (presumably) most
 important features. At the end there is a cascading stage of
 classifiers where regions are only considered further if they are
 promising. Every additional classifier adds complexity, but once a
@@ -921,7 +921,7 @@ achieves comparable results to the state of the art in 2001.
 \subsubsection{HOG Detector}
 \label{sssec:obj-hog}
 
-The \gls{hog}~\cite{dalal2005} is a feature descriptor used in
+The \gls{hog} \cite{dalal2005} is a feature descriptor used in
 computer vision and image processing to detect objects in images. It
 is a detector which detects shape like other methods such as
 \gls{sift} \cite{lowe1999}. The idea is to use the distribution of
@@ -940,14 +940,14 @@ with images of 64 by 128 pixels and make sure that the image contains
 a margin of 16 pixels around the person. Decreasing the border by
 either enlarging the person or reducing the overall image size results
 in worse performance. Unfortunately, their method is far from being
-able to process images in real time—a 320 by 240 image takes roughly a
-second to process.
+able to process images in real time—a $320$ by $240$ image takes
+roughly a second to process.
 
 \subsubsection{Deformable Part-Based Model}
 \label{sssec:obj-dpm}
 
-\glspl{dpm}~\cite{felzenszwalb2008a} were the winners of the \gls{voc}
-challenge in the years 2007, 2008 and 2009. The method is heavily
+\glspl{dpm} \cite{felzenszwalb2008a} were the winners of the \gls{voc}
+challenge in the years 2007, 2008, and 2009. The method is heavily
 based on the previously discussed \gls{hog} since it also uses
 \gls{hog} descriptors internally. The authors addition is the idea of
 learning how to decompose objects during training and
@@ -1008,25 +1008,25 @@ often not as efficient as one-stage detectors.
 
 \textcite{girshick2014} were the first to propose using feature
 representations of \glspl{cnn} for object detection. Their approach
-consists of generating around 2000 region proposals and passing these
-on to a \gls{cnn} for feature extraction. The fixed-length feature
-vector is used as input for a linear \gls{svm} which classifies the
-region. They name their method R-\gls{cnn}, where the R stands for
-region.
+consists of generating around $2000$ region proposals and passing
+these on to a \gls{cnn} for feature extraction. The fixed-length
+feature vector is used as input for a linear \gls{svm} which
+classifies the region. They name their method R-\gls{cnn}, where the R
+stands for region.
 
 R-\gls{cnn} uses selective search to generate region proposals
 \cite{uijlings2013}.The authors use selective search's \emph{fast
-mode} to generate the 2000 proposals and warp (i.e. aspect ratios are
-not retained) each proposal into the image dimensions required by the
-\gls{cnn}. The \gls{cnn}, which matches the architecture of AlexNet
-\cite{krizhevsky2012}, generates a $4096$-dimensional feature vector
-and each feature vector is scored by a linear \gls{svm} for each
-class. Scored regions are selected/discarded by comparing each region
-to other regions within the same class and rejecting them if there
-exists another region with a higher score and greater \gls{iou} than a
-threshold. The linear \gls{svm} classifiers are trained to only label
-a region as positive if the overlap, as measured by \gls{iou}, is
-above $0.3$.
+mode} to generate the $2000$ proposals and warp (i.e. aspect ratios
+are not retained) each proposal into the image dimensions required by
+the \gls{cnn}. The \gls{cnn}, which matches the architecture of
+AlexNet \cite{krizhevsky2012}, generates a $4096$-dimensional feature
+vector and each feature vector is scored by a linear \gls{svm} for
+each class. Scored regions are selected/discarded by comparing each
+region to other regions within the same class and rejecting them if
+there exists another region with a higher score and greater \gls{iou}
+than a threshold. The linear \gls{svm} classifiers are trained to only
+label a region as positive if the overlap, as measured by \gls{iou},
+is above $0.3$.
 
 While the approach of generating region proposals is not new, using a
 \gls{cnn} purely for feature extraction is. Unfortunately, R-\gls{cnn}
@@ -1132,15 +1132,15 @@ on all levels. \glspl{fpn} are an important building block of many
 state-of-the-art object detectors.
 
 A \gls{fpn} first computes the feature pyramid bottom-up with a
-scaling step of 2. The lower levels capture less semantic information
+scaling step of two. The lower levels capture less semantic information
 than the higher levels, but include more spatial information due to
 the higher granularity. In a second step, the \gls{fpn} upsamples the
 higher levels such that the dimensions of two consecutive layers are
 the same. The upsampled top layer is merged with the layer beneath it
-via element-wise addition and convolved with a $1\times 1$ convolutional
-layer to reduce channel dimensions and to smooth out potential
-artifacts introduced during the upsampling step. The results of that
-operation constitute the new \emph{top layer} and the process
+via element-wise addition and convolved with a one by one
+convolutional layer to reduce channel dimensions and to smooth out
+potential artifacts introduced during the upsampling step. The results
+of that operation constitute the new \emph{top layer} and the process
 continues with the layer below it until the finest resolution feature
 map is generated. In this way, the features of the different layers at
 different scales are fused to obtain a feature map with high semantic
@@ -1216,7 +1216,7 @@ detect smaller and denser objects as well.
 
 The authors report results on \gls{voc} 2007 for their \gls{ssd}300
 and \gls{ssd}512 model varieties. The number refers to the size of the
-input images. \gls{ssd}300 outperforms Fast R-\gls{cnn} by 1.1
+input images. \gls{ssd}300 outperforms Fast R-\gls{cnn} by $1.1$
 percentage points (\gls{map} 66.9\% vs 68\%). \gls{ssd}512 outperforms
 Faster R-\gls{cnn} by 1.7\% \gls{map}. If trained on the \gls{voc}
 2007, 2012 and \gls{coco} train sets, \gls{ssd}512 achieves a
@@ -1343,7 +1343,7 @@ The idea of automatic generation of feature maps via \glspl{ann} gave
 rise to \glspl{cnn}. Early \glspl{cnn} \cite{lecun1989} were mostly
 discarded for practical applications because they require much more
 data during training than traditional methods and also more processing
-power during inference. Passing $224\times 224$ pixel images to a
+power during inference. Passing $224$ by $224$ pixel images to a
 \gls{cnn}, as is common today, was simply not feasible if one wanted a
 reasonable inference time. With the development of \glspl{gpu} and
 supporting software such as the \gls{cuda} toolkit, it was possible to
@@ -1367,24 +1367,24 @@ function. The error function with which the weights are updated is
 
 The architecture of LeNet-5 is composed of two convolutional layers,
 two pooling layers and a dense block of three fully-connected
-layers. The input image is a grayscale image of 32 by 32 pixels. The
-first convolutional layer generates six feature maps, each with a
-scale of 28 by 28 pixels. Each feature map is fed to a pooling layer
-which effectively downsamples the image by a factor of two. By
-aggregating each two by two area in the feature map via averaging, the
-authors are more likely to obtain relative (to each other) instead of
-absolute positions of the features. To make up for the loss in spatial
-resolution, the following convolutional layer increases the amount of
-feature maps to 16 which aims to increase the richness of the learned
-representations. Another pooling layer follows which reduces the size
-of each of the 16 feature maps to five by five pixels. A dense block
-of three fully-connected layers of 120, 84 and 10 neurons respectively
-serves as the actual classifier in the network. The last layer uses
-the euclidean \gls{rbf} to compute the class an image belongs to (0-9
-digits).
+layers. The input image is a grayscale image of $32$ by $32$
+pixels. The first convolutional layer generates six feature maps, each
+with a scale of $28$ by $28$ pixels. Each feature map is fed to a
+pooling layer which effectively downsamples the image by a factor of
+two. By aggregating each two by two area in the feature map via
+averaging, the authors are more likely to obtain relative (to each
+other) instead of absolute positions of the features. To make up for
+the loss in spatial resolution, the following convolutional layer
+increases the amount of feature maps to $16$ which aims to increase
+the richness of the learned representations. Another pooling layer
+follows which reduces the size of each of the $16$ feature maps to
+five by five pixels. A dense block of three fully-connected layers of
+120, 84 and 10 neurons respectively serves as the actual classifier in
+the network. The last layer uses the euclidean \gls{rbf} to compute
+the class an image belongs to (0-9 digits).
 
 The performance of LeNet-5 was measured on the \gls{mnist} database
-which consists of 70.000 labeled images of handwritten digits. The
+which consists of $70000$ labeled images of handwritten digits. The
 \gls{mse} on the test set is 0.95\%. This result is impressive
 considering that character recognition with a \gls{cnn} had not been
 done before. However, standard machine learning methods of the time,
@@ -1453,7 +1453,7 @@ second layers of the feature maps present in AlexNet. They identify
 multiple problems with their structure such as aliasing artifacts and
 a mix of low and high frequency information without any mid
 frequencies. These results indicate that the filter size in AlexNet is
-too large at 11 by 11 and the authors reduce it to seven by
+too large at $11$ by $11$ and the authors reduce it to seven by
 seven. Additionally, they modify the original stride of four to
 two. These two changes result in an improvement in the top-5 error
 rate of 1.6\% over their own replicated AlexNet result of 18.1\%.
@@ -1461,7 +1461,7 @@ rate of 1.6\% over their own replicated AlexNet result of 18.1\%.
 \subsubsection{GoogLeNet}
 \label{sssec:theory-googlenet}
 
-GoogLeNet, also known as Inception-v1, was proposed by
+GoogLeNet, also known as Inception v1, was proposed by
 \textcite{szegedy2015} to increase the depth of the network without
 introducing too much additional complexity. Since the relevant parts
 of an image can often be of different sizes, but kernels within
@@ -1504,15 +1504,15 @@ non-linearities by having two \glspl{relu} instead of only one. The
 authors provide five different networks with increasing number of
 parameters based on these principles. The smallest network has a depth
 of eight convolutional layers and three fully-connected layers for the
-head (11 in total). The largest network has 16 convolutional and three
-fully-connected layers (19 in total). The fully-connected layers are
-the same for each architecture, only the layout of the convolutional
-layers varies.
+head ($11$ in total). The largest network has $16$ convolutional and
+three fully-connected layers ($19$ in total). The fully-connected
+layers are the same for each architecture, only the layout of the
+convolutional layers varies.
 
-The deepest network with 19 layers achieves a top-5 error rate on
+The deepest network with $19$ layers achieves a top-5 error rate on
 \gls{ilsvrc} 2014 of 9\%. If trained with different image scales in
 the range of $S \in [256, 512]$, the same network achieves a top-5 error
-rate of 8\% (test set at scale 256). By combining their two largest
+rate of 8\% (test set at scale $256$). By combining their two largest
 architectures and multi-crop as well as dense evaluation, they achieve
 an ensemble top-5 error rate of 6.8\%, while their best single network
 with multi-crop and dense evaluation results in 7\%, thus beating the
@@ -1522,8 +1522,8 @@ section~\ref{sssec:theory-googlenet}) by 0.9\%.
 \subsubsection{ResNet}
 \label{sssec:theory-resnet}
 
-The 22-layer structure of GoogLeNet \cite{szegedy2015} and the
-19-layer structure of VGGNet \cite{simonyan2015} showed that
+The $22$-layer structure of GoogLeNet \cite{szegedy2015} and the
+$19$-layer structure of VGGNet \cite{simonyan2015} showed that
 \emph{going deeper} is beneficial for achieving better classification
 performance. However, the authors of VGGNet already note that stacking
 even more layers does not lead to better performance because the model
@@ -1706,13 +1706,13 @@ Estimated 3 pages for this section.
 
 The literature on machine learning in agriculture is broadly divided
 into four main areas:~livestock management, soil management, water
-management, and crop management~\cite{benos2021}. Of those four, water
+management, and crop management \cite{benos2021}. Of those four, water
 management only makes up about 10\% of all surveyed papers during the
 years 2018--2020. This highlights the potential for research in this
 area to have a high real-world impact.
 
 \textcite{su2020} used traditional feature extraction and
-pre-processing techniques to train various machine learning models for
+preprocessing techniques to train various machine learning models for
 classifying water stress for a wheat field. They took top-down images
 of the field using an \gls{uav}, segmented wheat pixels from
 background pixels and constructed features based on spectral
@@ -1742,47 +1742,49 @@ their results do not transfer well to the other seasons under survey
 \textcite{zhuang2017} showed that water stress in maize can be
 detected early on and, therefore, still provide actionable information
 before the plants succumb to drought. They installed a camera which
-took $640\times480$ pixel RGB images every two hours. A simple linear
-classifier (SVM) segmented the image into foreground and background
-using the green color channel. The authors constructed a
-fourteen-dimensional feature space consisting of color and texture
-features. A gradient boosted decision tree (GBDT) model classified the
-images into water stressed and non-stressed and achieved an accuracy
-of $\qty{90.39}{\percent}$. Remarkably, the classification was not
+took $640$ by $480$ pixel RGB images every two hours. A simple linear
+classifier (\gls{svm}) segmented the image into foreground and
+background using the green color channel. The authors constructed a
+$14$-dimensional feature space consisting of color and texture
+features. A \gls{gbdt} model classified the images into water stressed
+and non-stressed and achieved an accuracy of
+$\qty{90.39}{\percent}$. Remarkably, the classification was not
 significantly impacted by illumination changes throughout the day.
 
-\textcite{an2019} used the ResNet50 model as a basis for transfer
-learning and achieved high classification scores (ca. 95\%) on
-maize. Their model was fed with $640\times480$ pixel images of maize
-from three different viewpoints and across three different growth
-phases. The images were converted to grayscale which turned out to
-slightly lower classification accuracy. Their results also highlight
-the superiority of deep convolutional neural networks (DCNNs) compared
-to manual feature extraction and gradient boosted decision trees
-(GBDTs).
+\textcite{an2019} used the ResNet50 model (see
+section~\ref{sssec:theory-resnet}) as a basis for transfer learning and
+achieved high classification scores (ca. 95\%) on maize. Their model
+was fed with $640$ by $480$ pixel images of maize from three different
+viewpoints and across three different growth phases. The images were
+converted to grayscale which turned out to slightly lower
+classification accuracy. Their results also highlight the superiority
+of \glspl{dcnn} compared to manual feature extraction and
+\glspl{gbdt}.
 
 \textcite{chandel2021} investigated deep learning models in depth by
-comparing three well-known CNNs. The models under scrutiny were
-AlexNet, GoogLeNet, and Inception V3. Each model was trained with a
-dataset containing images of maize, okra, and soybean at different
-stages of growth and under stress and no stress. The researchers did
-not include an object detection step before image classification and
-compiled a fairly small dataset of 1200 images. Of the three models,
-GoogLeNet beat the other two with a sizable lead at a classification
-accuracy of >94\% for all three types of crop. The authors attribute
-its success to its inherently deeper structure and application of
-multiple convolutional layers at different stages. Unfortunately, all
-of the images were taken at the same $\ang{45}\pm\ang{5}$ angle and it
-stands to reason that the models would perform significantly worse on
-images taken under different conditions.
+comparing three well-known \glspl{cnn}. The models under scrutiny were
+AlexNet (see section~\ref{sssec:theory-alexnet}), GoogLeNet (see
+section~\ref{sssec:theory-googlenet}), and Inception v3. Each model
+was trained with a dataset containing images of maize, okra, and
+soybean at different stages of growth and under stress and no
+stress. The researchers did not include an object detection step
+before image classification and compiled a fairly small dataset of
+$1200$ images. Of the three models, GoogLeNet beat the other two with
+a sizable lead at a classification accuracy of >94\% for all three
+types of crop. The authors attribute its success to its inherently
+deeper structure and application of multiple convolutional layers at
+different stages. Unfortunately, all of the images were taken at the
+same $\ang{45}\pm\ang{5}$ angle and it stands to reason that the models
+would perform significantly worse on images taken under different
+conditions.
 
 \textcite{ramos-giraldo2020} detected water stress in soybean and corn
-crops with a pretrained model based on DenseNet-121. Low-cost cameras
-deployed in the field provided the training data over a 70-day
-period. They achieved a classification accuracy for the degree of
-wilting of 88\%.
+crops with a pretrained model based on DenseNet-121 (see
+section~\ref{sssec:theory-densenet}). Low-cost cameras deployed in the
+field provided the training data over a $70$-day period. They achieved
+a classification accuracy for the degree of wilting of 88\%.
 
-In a later study, the same authors~\cite{ramos-giraldo2020a} deployed
+In a later study, the same authors \cite{ramos-giraldo2020a} deployed
 their machine learning model in the field to test it for production
 use. They installed multiple Raspberry Pis with attached Raspberry Pi
 Cameras which took images in $\qty{30}{\minute}$ intervals. The
@@ -1797,27 +1799,26 @@ classification scores on corn and soybean with a low-cost setup.
 \textcite{azimi2020} demonstrate the efficacy of deep learning models
 versus classical machine learning models on chickpea plants. The
 authors created their own dataset in a laboratory setting for stressed
-and non-stressed plants. They acquired 8000 images at eight different
-angles in total. For the classical machine learning models, they
-extracted feature vectors using scale-invariant feature transform
-(SIFT) and histogram of oriented gradients (HOG). The features are fed
-into three classical machine learning models: support vector machine
-(SVM), k-nearest neighbors (KNN), and a decision tree (DT) using the
-classification and regression (CART) algorithm. On the deep learning
-side, they used their own CNN architecture and the pre-trained
-ResNet-18 model. The accuracy scores for the classical models was in
-the range of $\qty{60}{\percent}$ to $\qty{73}{\percent}$ with the SVM
-outperforming the two others. The CNN achieved higher scores at
-$\qty{72}{\percent}$ to $\qty{78}{\percent}$ and ResNet-18 achieved
-the highest scores at $\qty{82}{\percent}$ to
-$\qty{86}{\percent}$. The results clearly show the superiority of deep
-learning over classical machine learning. A downside of their approach
-lies in the collection of the images. The background in all images was
-uniformly white and the plants were prominently placed in the
-center. It should, therefore, not be assumed that the same
-classification scores can be achieved on plants in the field with
-messy and noisy backgrounds as well as illumination changes and so
-forth.
+and non-stressed plants. They acquired $8000$ images at eight
+different angles in total. For the classical machine learning models,
+they extracted feature vectors using \gls{sift} and \gls{hog}. The
+features are fed into three classical machine learning models:
+\gls{svm}, \gls{k-nn}, and a \gls{dt} using the \gls{cart}
+algorithm. On the deep learning side, they used their own \gls{cnn}
+architecture and the pretrained ResNet-18 (see
+section~\ref{sssec:theory-resnet}) model. The accuracy scores for the
+classical models was in the range of $\qty{60}{\percent}$ to
+$\qty{73}{\percent}$ with the \gls{svm} outperforming the two
+others. The \gls{cnn} achieved higher scores at $\qty{72}{\percent}$
+to $\qty{78}{\percent}$ and ResNet-18 achieved the highest scores at
+$\qty{82}{\percent}$ to $\qty{86}{\percent}$. The results clearly show
+the superiority of deep learning over classical machine learning. A
+downside of their approach lies in the collection of the images. The
+background in all images was uniformly white and the plants were
+prominently placed in the center. It should, therefore, not be assumed
+that the same classification scores can be achieved on plants in the
+field with messy and noisy backgrounds as well as illumination changes
+and so forth.
 
 A significant problem in the detection of water stress is posed by the
 evolution of indicators across time. Since physiological features such
@@ -2189,27 +2190,28 @@ validation and testing, respectively.
 Of the 91479 images around 10\% were used for the test phase. These
 images contain a total of 12238 ground truth
 labels. Table~\ref{tab:yolo-metrics} shows precision, recall and the
-harmonic mean of both (F1-score). The results indicate that the model
-errs on the side of sensitivity because recall is higher than
-precision. Although some detections are not labeled as plants in the
-dataset, if there is a labeled plant in the ground truth data, the
-chance is high that it will be detected. This behavior is in line with
-how the model's detections are handled in practice. The detections are
-drawn on the original image and the user is able to check the bounding
-boxes visually. If there are wrong detections, the user can ignore
-them and focus on the relevant ones instead. A higher recall will thus
-serve the user's needs better than a high precision.
+harmonic mean of both ($\mathrm{F}_1$-score). The results indicate
+that the model errs on the side of sensitivity because recall is
+higher than precision. Although some detections are not labeled as
+plants in the dataset, if there is a labeled plant in the ground truth
+data, the chance is high that it will be detected. This behavior is in
+line with how the model's detections are handled in practice. The
+detections are drawn on the original image and the user is able to
+check the bounding boxes visually. If there are wrong detections, the
+user can ignore them and focus on the relevant ones instead. A higher
+recall will thus serve the user's needs better than a high precision.
 
 \begin{table}[h]
   \centering
   \begin{tabular}{lrrrr}
     \toprule
-    {} &  Precision &    Recall &  F1-score &  Support \\
+    {} &  Precision &    Recall &  $\mathrm{F}_1$-score &  Support \\
     \midrule
     Plant        &   0.547571 &  0.737866 &  0.628633 &  12238.0 \\
     \bottomrule
   \end{tabular}
-  \caption{Precision, recall and F1-score for the object detection model.}
+  \caption{Precision, recall and $\mathrm{F}_1$-score for the object
+    detection model.}
   \label{tab:yolo-metrics}
 \end{table}
 
@@ -2330,26 +2332,26 @@ increase again after epoch 27.
   \centering
   \begin{tabular}{lrrrr}
     \toprule
-    {} &  Precision &    Recall &  F1-score &  Support \\
+    {} &  Precision &    Recall &  $\mathrm{F}_1$-score &  Support \\
     \midrule
     Plant        &   0.633358 &  0.702811 &  0.666279 &  12238.0 \\
     \bottomrule
   \end{tabular}
-  \caption{Precision, recall and F1-score for the optimized object
-    detection model.}
+  \caption{Precision, recall and $\mathrm{F}_1$-score for the
+    optimized object detection model.}
   \label{tab:yolo-metrics-hyp}
 \end{table}
 
 Turning to the evaluation of the optimized model on the test dataset,
 table~\ref{tab:yolo-metrics-hyp} shows precision, recall and the
-F1-score for the optimized model. Comparing these metrics with the
-non-optimized version from table~\ref{tab:yolo-metrics}, precision is
-significantly higher by more than 8.5\%. Recall, however, is 3.5\%
-lower. The F1-score is higher by more than 3.7\% which indicates that
-the optimized model is better overall despite the lower recall. We
-feel that the lower recall value is a suitable trade off for the
-substantially higher precision considering that the non-optimized
-model's precision is quite low at 0.55.
+$\mathrm{F}_1$-score for the optimized model. Comparing these metrics
+with the non-optimized version from table~\ref{tab:yolo-metrics},
+precision is significantly higher by more than 8.5\%. Recall, however,
+is 3.5\% lower. The $\mathrm{F}_1$-score is higher by more than 3.7\%
+which indicates that the optimized model is better overall despite the
+lower recall. We feel that the lower recall value is a suitable trade
+off for the substantially higher precision considering that the
+non-optimized model's precision is quite low at 0.55.
 
 The precision-recall curves in figure~\ref{fig:yolo-ap-hyp} for the
 optimized model show that the model draws looser bounding boxes than
@@ -2438,7 +2440,7 @@ The random search was run for 138 iterations which equates to a 75\%
 probability that the best solution lies within 1\% of the theoretical
 maximum~\eqref{eq:opt-prob}. Figure~\ref{fig:classifier-hyp-results}
 shows three of the eight parameters and their impact on a high
-F1-score. \gls{sgd} has less variation in its results than
+$\mathrm{F}_1$-score. \gls{sgd} has less variation in its results than
 Adam~\cite{kingma2017} and manages to provide eight out of the ten
 best results. The number of epochs to train for was chosen based on
 the observation that almost all configurations converge well before
@@ -2456,17 +2458,17 @@ figure~\ref{fig:classifier-training-metrics}.
   \includegraphics{graphics/classifier-hyp-metrics.pdf}
   \caption[Classifier hyper-parameter optimization results.]{This
     figure shows three of the eight hyper-parameters and their
-    performance measured by the F1-score during 138
+    performance measured by the $\mathrm{F}_1$-score during 138
     trials. Differently colored markers show the batch size with
     darker colors representing a larger batch size. The type of marker
     (circle or cross) shows which optimizer was used. The x-axis shows
     the learning rate on a logarithmic scale. In general, a learning
     rate between 0.003 and 0.01 results in more robust and better
-    F1-scores. Larger batch sizes more often lead to better
-    performance as well. As for the type of optimizer, \gls{sgd}
-    produced the best iteration with an F1-score of 0.9783. Adam tends
-    to require more customization of its parameters than \gls{sgd} to
-    achieve good results.}
+    $\mathrm{F}_1$-scores. Larger batch sizes more often lead to
+    better performance as well. As for the type of optimizer,
+    \gls{sgd} produced the best iteration with an $\mathrm{F}_1$-score
+    of 0.9783. Adam tends to require more customization of its
+    parameters than \gls{sgd} to achieve good results.}
   \label{fig:classifier-hyp-results}
 \end{figure}
 
@@ -2477,14 +2479,15 @@ chance due to a coincidentally advantageous train/test split, we
 perform stratified $10$-fold cross validation on the dataset. Each
 fold contains 90\% training and 10\% test data and was trained for 25
 epochs. Figure~\ref{fig:classifier-hyp-roc} shows the performance of
-the epoch with the highest F1-score of each fold as measured against
-the test split. The mean \gls{roc} curve provides a robust metric for
-a classifier's performance because it averages out the variability of
-the evaluation. Each fold manages to achieve at least an \gls{auc} of
-0.94, while the best fold reaches 0.98. The mean \gls{roc} has an
-\gls{auc} of 0.96 with a standard deviation of 0.02. These results
-indicate that the model is accurately predicting the correct class and
-is robust against variations in the training set.
+the epoch with the highest $\mathrm{F}_1$-score of each fold as
+measured against the test split. The mean \gls{roc} curve provides a
+robust metric for a classifier's performance because it averages out
+the variability of the evaluation. Each fold manages to achieve at
+least an \gls{auc} of 0.94, while the best fold reaches 0.98. The mean
+\gls{roc} has an \gls{auc} of 0.96 with a standard deviation of
+0.02. These results indicate that the model is accurately predicting
+the correct class and is robust against variations in the training
+set.
 
 \begin{table}
   \centering
@@ -2508,47 +2511,49 @@ is robust against variations in the training set.
   \includegraphics{graphics/classifier-hyp-folds-roc.pdf}
   \caption[Mean \gls{roc} and variability of hyper-parameter-optimized
   model.]{This plot shows the \gls{roc} curve for the epoch with the
-    highest F1-score of each fold as well as the \gls{auc}. To get a
-    less variable performance metric of the classifier, the mean
-    \gls{roc} curve is shown as a thick line and the variability is
-    shown in gray. The overall mean \gls{auc} is 0.96 with a standard
-    deviation of 0.02. The best-performing fold reaches an \gls{auc}
-    of 0.99 and the worst an \gls{auc} of 0.94. The black dashed line
-    indicates the performance of a classifier which picks classes at
-    random ($\mathrm{\gls{auc}} = 0.5$). The shapes of the \gls{roc}
-    curves show that the classifier performs well and is robust
-    against variations in the training set.}
+    highest $\mathrm{F}_1$-score of each fold as well as the
+    \gls{auc}. To get a less variable performance metric of the
+    classifier, the mean \gls{roc} curve is shown as a thick line and
+    the variability is shown in gray. The overall mean \gls{auc} is
+    0.96 with a standard deviation of 0.02. The best-performing fold
+    reaches an \gls{auc} of 0.99 and the worst an \gls{auc} of
+    0.94. The black dashed line indicates the performance of a
+    classifier which picks classes at random
+    ($\mathrm{\gls{auc}} = 0.5$). The shapes of the \gls{roc} curves
+    show that the classifier performs well and is robust against
+    variations in the training set.}
   \label{fig:classifier-hyp-roc}
 \end{figure}
 
 The classifier shows good performance so far, but care has to be taken
-to not overfit the model to the training set. Comparing the F1-score
-during training with the F1-score during testing gives insight into
-when the model tries to increase its performance during training at
-the expense of generalizability. Figure~\ref{fig:classifier-hyp-folds}
-shows the F1-scores of each epoch and fold. The classifier converges
+to not overfit the model to the training set. Comparing the
+$\mathrm{F}_1$-score during training with the $\mathrm{F}_1$-score
+during testing gives insight into when the model tries to increase its
+performance during training at the expense of
+generalizability. Figure~\ref{fig:classifier-hyp-folds} shows the
+$\mathrm{F}_1$-scores of each epoch and fold. The classifier converges
 quickly to 1 for the training set at which point it experiences a
 slight drop in generalizability. Training the model for at most five
 epochs is sufficient because there are generally no improvements
 afterwards. The best-performing epoch for each fold is between the
 second and fourth epoch which is just before the model achieves an
-F1-score of 1 on the training set.
+$\mathrm{F}_1$-score of 1 on the training set.
 
 \begin{figure}
   \centering
   \includegraphics[width=.9\textwidth]{graphics/classifier-hyp-folds-f1.pdf}
-  \caption[F1-score of stratified $10$-fold cross validation.]{These
-    plots show the F1-score during training as well as testing for
-    each of the folds. The classifier converges to 1 by the third
-    epoch during the training phase, which might indicate
-    overfitting. However, the performance during testing increases
-    until epoch three in most cases and then stabilizes at
-    approximately 2-3\% lower than the best epoch. We believe that the
-    third, or in some cases fourth, epoch is detrimental to
-    performance and results in overfitting, because the model achieves
-    an F1-score of 1 for the training set, but that gain does not
-    transfer to the test set. Early stopping during training
-    alleviates this problem.}
+  \caption[$\mathrm{F}_1$-score of stratified $10$-fold cross
+  validation.]{These plots show the $\mathrm{F}_1$-score during
+    training as well as testing for each of the folds. The classifier
+    converges to 1 by the third epoch during the training phase, which
+    might indicate overfitting. However, the performance during
+    testing increases until epoch three in most cases and then
+    stabilizes at approximately 2-3\% lower than the best epoch. We
+    believe that the third, or in some cases fourth, epoch is
+    detrimental to performance and results in overfitting, because the
+    model achieves an $\mathrm{F}_1$-score of 1 for the training set,
+    but that gain does not transfer to the test set. Early stopping
+    during training alleviates this problem.}
   \label{fig:classifier-hyp-folds}
 \end{figure}
 
@@ -2655,7 +2660,7 @@ bounding boxes of healthy plants and 494 of stressed plants.
   \centering
   \begin{tabular}{lrrrr}
     \toprule
-    {} &  precision &  recall &  f1-score &  support \\
+    {} &  precision &  recall &  $\mathrm{F}_{1}$-score &  support \\
     \midrule
     Healthy      &      0.665 &   0.554 &     0.604 &    766 \\
     Stressed     &      0.639 &   0.502 &     0.562 &    494 \\
@@ -2664,15 +2669,17 @@ bounding boxes of healthy plants and 494 of stressed plants.
     weighted avg &      0.655 &   0.533 &     0.588 &   1260 \\
     \bottomrule
   \end{tabular}
-  \caption{Precision, recall and F1-score for the aggregate model.}
+  \caption{Precision, recall and $\mathrm{F}_1$-score for the
+    aggregate model.}
   \label{tab:model-metrics}
 \end{table}
 
-Table~\ref{tab:model-metrics} shows precision, recall and the F1-score
-for both classes \emph{Healthy} and \emph{Stressed}. Precision is
-higher than recall for both classes and the F1-score is at
-0.59. Unfortunately, these values do not take the accuracy of bounding
-boxes into account and thus have only limited expressive power.
+Table~\ref{tab:model-metrics} shows precision, recall and the
+$\mathrm{F}_1$-score for both classes \emph{Healthy} and
+\emph{Stressed}. Precision is higher than recall for both classes and
+the $\mathrm{F}_1$-score is at 0.59. Unfortunately, these values do
+not take the accuracy of bounding boxes into account and thus have
+only limited expressive power.
 
 Figure~\ref{fig:aggregate-ap} shows the precision and recall curves
 for both classes at different \gls{iou} thresholds. The left plot
@@ -2716,7 +2723,7 @@ section~\ref{ssec:aggregate-model}.
   \centering
   \begin{tabular}{lrrrr}
     \toprule
-    {} &  precision &  recall &  f1-score &  support \\
+    {} &  precision &  recall &  $\mathrm{F}_{1}$-score &  support \\
     \midrule
     Healthy      &      0.711 &   0.555 &     0.623 &    766 \\
     Stressed     &      0.570 &   0.623 &     0.596 &    494 \\
@@ -2725,22 +2732,23 @@ section~\ref{ssec:aggregate-model}.
     weighted avg &      0.656 &   0.582 &     0.612 &   1260 \\
     \bottomrule
   \end{tabular}
-  \caption{Precision, recall and F1-score for the optimized aggregate
-    model.}
+  \caption{Precision, recall and $\mathrm{F}_1$-score for the
+    optimized aggregate model.}
   \label{tab:model-metrics-hyp}
 \end{table}
 
-Table~\ref{tab:model-metrics-hyp} shows precision, recall and F1-score
-for the optimized model on the same test dataset of 640 images. All of
-the metrics are better for the optimized model. In particular,
-precision for the healthy class could be improved significantly while
-recall remains at the same level. This results in a better F1-score
-for the healthy class. Precision for the stressed class is lower with
-the optimized model, but recall is significantly higher (0.502
-vs. 0.623). The higher recall results in a 3\% gain for the F1-score
-in the stressed class. Overall, precision is the same but recall has
+Table~\ref{tab:model-metrics-hyp} shows precision, recall and
+$\mathrm{F}_1$-score for the optimized model on the same test dataset
+of 640 images. All of the metrics are better for the optimized
+model. In particular, precision for the healthy class could be
+improved significantly while recall remains at the same level. This
+results in a better $\mathrm{F}_1$-score for the healthy
+class. Precision for the stressed class is lower with the optimized
+model, but recall is significantly higher (0.502 vs. 0.623). The
+higher recall results in a 3\% gain for the $\mathrm{F}_1$-score in
+the stressed class. Overall, precision is the same but recall has
 improved significantly, which also results in a noticeable improvement
-for the average F1-score across both classes.
+for the average $\mathrm{F}_1$-score across both classes.
 
 \begin{figure}
   \centering