Finish YOLO selected methods section

This commit is contained in:
Tobias Eidelpes 2023-11-29 17:41:57 +01:00
parent f664ad2b40
commit 35acd07570
3 changed files with 333 additions and 45 deletions

File diff suppressed because one or more lines are too long

Binary file not shown.

View File

@ -132,6 +132,14 @@ Challenge}
\newacronym{bn}{BN}{Batch Normalization}
\newacronym{uav}{UAV}{Unmanned Aerial Vehicle}
\newacronym{csi}{CSI}{Camera Serial Interface}
\newacronym{nms}{NMS}{Non Maximum Suppression}
\newacronym{sam}{SAM}{Spatial Attention Module}
\newacronym{panet}{PANet}{Path Aggregation Network}
\newacronym{ciou}{CIoU}{Complete Intersection over Union}
\newacronym{siou}{SIoU}{Scylla Intersection over Union}
\newacronym{giou}{GIoU}{Generalized Intersection over Union}
\newacronym{elan}{ELAN}{Efficient Layer Aggregation Network}
\newacronym{eelan}{E-ELAN}{Extended Efficient Layer Aggregation Network}
\begin{document}
@ -2084,20 +2092,18 @@ models.
\section{Selected Methods}
\label{sec:selected-methods}
Estimated 7 pages for this section.
In the following sections we will go into detail about the two
selected architectures for our prototype. The object detector we
chose---\gls{yolo}v7---is part of a larger family of models which all
function similarly, but have undergone substantial changes from
version to version. In order to understand the used model, we trace
the improvements to the \gls{yolo} family from version one to version
seven. For the classification stage, we have opted for a ResNet
architecture which is also described in detail.
\subsection{You Only Look Once}
\label{sec:methods-detection}
Describe the inner workings of the YOLOv7 model structure and contrast
it with previous versions as well as other object detectors. What has
changed and how did these improvements manifest themselves? Reference
the original paper~\cite{wang2022} and papers of previous versions of
the same model (YOLOv5~\cite{jocher2022},
YOLOv4~\cite{bochkovskiy2020}).
Estimated 2 pages for this section.
The \gls{yolo} family of object detection models started in 2015 when
\cite{redmon2016} published the first version. Since then there have
been up to 16 updated versions depending on how one counts. The
@ -2205,16 +2211,130 @@ the \gls{voc} 2007 data set compared to 63.4\% of the previous
at \qty{40}{fps} (\gls{map} 78.6\%) and up to \qty{91}{fps} (\gls{map}
69\%).
\subsubsection{\gls{yolo}v3}
\label{sssec:yolov3}
\gls{yolo}v3 \cite{redmon2018} provided additional updates to the
\gls{yolo}v2 model. To be competitive with the deeper network
structures of state-of-the-art models at the time, the authors
introduce a deeper feature extractor called Darknet-53. It makes use
of the residual connections popularized by ResNet \cite{he2016} (see
section~\ref{sssec:theory-resnet}). Darknet-53 is more accurate than
Darknet-19 and compares to ResNet-101, but can process more images per
second (\qty{78}{fps} versus \qty{53}{fps}). The activation function
throughout the network is still leaky \gls{relu}, as in earlier
versions.
\gls{yolo}v3 uses multi-scale predictions to achieve better detection
ratios across object sizes. Inspired by \glspl{fpn} (see
section~\ref{sssec:theory-fpn}), \gls{yolo}v3 uses predictions at
different scales from the feature extractor and combines them to form
a final prediction. Combining the features from multiple scales is
often done in the \emph{neck} of the object detection architecture.
Around the time of the publication of \gls{yolo}v3, researchers
started to use the terminology \emph{backbone}, \emph{neck} and
\emph{head} to describe the architecture of object detection
models. The feature extractor (Darknet-53 in this case) is the
\emph{backbone} and provides the feature maps which are aggregated in
the \emph{neck} and passed to the \emph{head} which outputs the final
predictions. In some cases there are additional postprocessing steps
in the head such as \gls{nms} to eliminate duplicate or suboptimal
detections.
While \gls{yolo}v2 had problems detecting small objects, \gls{yolo}v3
performs much better on them (\gls{ap} of 18.3\% versus 5\% on
\gls{coco}). The authors note, however, that the new model sometimes
has comparatively worse results with larger objects. The reasons for
this behavior are unknown. Additionally, \gls{yolo}v3 is still lagging
behind other detectors when it comes to accurately localizing
objects. The \gls{coco} evaluation metric was changed from the
previous \gls{ap}$_{0.5}$ to the \gls{map} between $0.5$ to $0.95$
which penalizes detectors which do not achieve close to perfect
\gls{iou} scores. This change highlights \gls{yolo}v3's weakness in
that area.
\subsubsection{\gls{yolo}v4}
\label{sssec:yolov4}
Keeping in line with the aim of carefully balancing accuracy and speed
of detection, \textcite{bochkovskiy2020} publish the fourth version of
\gls{yolo}. The authors investigate the use of what they term
\emph{bag of freebies}---methods which increase training time while
increasing inference accuracy without sacrificing inference speed. A
prominent example of such methods is data augmentation (see
section~\ref{sec:methods-augmentation}). Specifically, the authors
propose to use mosaic augmentation which lowers the need for large
mini-batch sizes. They also use new features such as weighted residual
connections \cite{shen2016}, a modified \gls{sam} \cite{woo2018}, a
modified \gls{panet} \cite{liu2018} for the neck, \gls{ciou} loss
\cite{zheng2020} for the detector and the Mish activation function
\cite{misra2020}.
Taken together, these additional improvements yield a \gls{map} of
43.5\% on the \gls{coco} test set while maintaining a speed of above
\qty{30}{fps} on modern \glspl{gpu}. \gls{yolo}v4 was the first
version which provided results on all scales (S, M, L) that were
better than almost all other detectors at the time without sacrificing
speed.
\subsubsection{\gls{yolo}v5}
\label{sssec:yolov5}
The author of \gls{yolo}v5 \cite{jocher2020} ported the code from
\gls{yolo}v4 from the Darknet framework to PyTorch which facilitated
better interoperability with other Python utilities. New in this
version is the pretraining algorithm called AutoAnchor which adjusts
the anchor boxes based on the data set at hand. This version also
implements a genetic algorithm for hyperparameter optimization (see
section~\ref{ssec:hypopt-evo}) which is used in our work as well.
Version 5 comes in multiple architectures of various complexity. The
smallest---and therefore fastest---version is called \gls{yolo}v5n where
the \emph{n} stands for \emph{nano}. Additional versions with
increasing parameters are \gls{yolo}v5s (small), \gls{yolo}v5m
(medium), \gls{yolo}v5l (large), and \gls{yolo}v5x (extra large). The
smaller models are intended to be used in resource constrained
environments such as edge devices, but come with a cost in
accuracy. Conversely, the larger models are for tasks where high
accuracy is paramount and enough computational resources are
available. The \gls{yolo}v5x model achieves a \gls{map} of 50.7\% on
the \gls{coco} test data set.
\subsubsection{\gls{yolo}v6}
\label{sssec:yolov6}
The authors of \gls{yolo}v6 \cite{li2022a} use a new backbone based on
RepVGG \cite{ding2021} which they call EfficientRep. They also use
different losses for classification (Varifocal loss \cite{zhang2021})
and bounding box regression (\gls{siou}
\cite{gevorgyan2022}/\gls{giou} \cite{rezatofighi2019}). \gls{yolo}v6
is made available in eight scaled version of which the largest
achieves a \gls{map} of 57.2\% on the \gls{coco} test set.
\subsubsection{\gls{yolo}v7}
\label{sssec:yolov7}
At the time of implementation of our own plant detector, \gls{yolo}v7
\cite{wang2022b} was the newest version within the \gls{yolo}
family. Similarly to \gls{yolo}v4, it introduces more trainable bag of
freebies which do not impact inference time. The improvements include
the use of \glspl{eelan} (based on \glspl{elan} \cite{wang2022a}),
joint depth and width model scaling techniques, reparameterization on
module level, and an auxiliary head---similarly to GoogleNet (see
section~\ref{sssec:theory-googlenet})---which assists during
training. The model does not use a pretrained backbone, it is instead
trained from scratch on the \gls{coco} data set. These changes result
in much smaller model sizes compared to \gls{yolo}v4 and a \gls{map}
of 56.8\% with a detection speed of over \qty{30}{fps}.
We use \gls{yolo}v7 in our own work during the plant detection stage
because it was the fastest and most accurate object detector at the
time of implementation.
\subsection{ResNet}
\label{sec:methods-classification}
Introduce the approach of the \emph{ResNet} networks which implement
residual connections to allow deeper layers. Describe the inner
workings of the ResNet model structure. Reference the original
paper~\cite{he2016}.
Estimated 2 pages for this section.
Early research \cite{bengio1994,glorot2010} already demonstrated that
the vanishing/exploding gradient problem with standard gradient
descent and random initialization adversely affects convergence during
@ -3099,8 +3219,8 @@ Estimated 1 page for this section
\listoftables % Starred version, i.e., \listoftables*, removes the toc entry.
% Use an optional list of algorithms.
\listofalgorithms
\addcontentsline{toc}{chapter}{List of Algorithms}
% \listofalgorithms
% \addcontentsline{toc}{chapter}{List of Algorithms}
% Add an index.
\printindex
@ -3117,18 +3237,4 @@ Estimated 1 page for this section
%%% mode: latex
%%% TeX-master: "thesis"
%%% TeX-master: t
%%% TeX-master: t
%%% TeX-master: t
%%% TeX-master: t
%%% TeX-master: t
%%% TeX-master: t
%%% TeX-master: t
%%% TeX-master: t
%%% TeX-master: t
%%% TeX-master: t
%%% TeX-master: t
%%% TeX-master: t
%%% TeX-master: t
%%% TeX-master: t
%%% TeX-master: t
%%% End: