Add Feature Extraction

This commit is contained in:
Tobias Eidelpes 2021-10-20 18:06:46 +02:00
parent 89621232bd
commit c72ed9de6b

72
sim.tex
View File

@ -125,6 +125,78 @@ used in the computer science domain.
\section{Feature Engineering 500 words}
Contrary to popular opinion, the rise of deep learning methods in areas such as
object recognition has not superseded the classical approach of feature
engineering in other areas. Particularly in the audio domain and for motion
detection in videos for example, feature engineering is still the dominant
method. This is highlighted by the fact that classical methods require much less
processing which can be beneficial or even crucial for certain applications
(e.g. edge computing).
Feature engineering is part of the pipeline which transforms input data into
classes and labels for that data. After modeling comes feature extraction so
that these features can be mapped in the feature space. After the classification
step, we end up with labels corresponding to the input data and the features we
want. In practice, feature engineering deals with analyzing input signals.
Common features one might be interested in during analysis is the loudness
(amplitude), rhythm or motion of a signal.
There are four main features of interest when analyzing visual data: color,
texture, shape and foreground versus background. Starting with color, the first
thing that springs to mind is to use the RGB color space to detect specific
colors. Depending on the application, this might not be the best choice due to
the three colors being represented by their \emph{pure} versions and different
hues of a color requiring a change of all three parameters (red, green and
blue). Other color spaces such as hue, saturation and value (HSV) are better
suited for color recognition, since we are usually only interested in the hue of
a color and can therefore better generalize the detection space. Another option
is posed by the \emph{CIE XYZ} color space which is applicable to situations
where adherence to how the human vision works is beneficial. For broadcasting
applications, color is often encoded using \emph{YCrCb}, where \emph{Y}
represents lightness and \emph{Cr} and \emph{Cb} represent $Y-R$ and $Y-B$
respectively. To find a dominant color within an image, we can choose to only
look at certain sections of the frame, e.g. the center or the largest continuous
region of color. Another approach is to use a color histogram to count the
number of different hues within the frame.
Recognizing objects by their texture can be divided into three different
methods. One approach is to look at the direction pixels are oriented towards to
get a measure of \emph{directionality}. Secondly, \emph{rhythm} allows us to
detect if a patch of information (micro block) is repeated in its neighborhood
through \emph{autocorrelation}. Autocorrelation takes one neighborhood and
compares it—usually using a generalized distance measure—to all other
neighborhoods. If the similarity exceeds a certain threshold, there is a high
probability that a rhythm exists. Third, coarseness can be detected by applying
a similar process, but by looking at different window sizes to determine if
there is any loss of information. If there is no loss of information in the
compressed (smaller) window, the image information is coarse.
Shape detection can be realized using \emph{kernels} of different sizes and with
different values. An edge detection algorithm might use a sobel matrix to
compare neighborhoods of an image. If the similarity is high, there is a high
probability of there being an edge in that neighborhood.
Foreground and background detection relies on the assumption that the coarseness
is on average higher for the background than for the foreground. This only makes
sense if videos have been properly recorded using depth of field so that the
background is much more blurred out than the foreground.
For audio feature extraction, three properties are of relevance: loudness,
fundamental frequency and rhythm. Specific audio sources have a distinct
loudness to them where for example classical music has a higher standard
deviation of loudness than metal. The fundamental frequency can be particularly
helpful in distinguishing speech from music by analyzing the \emph{zero
crossings rate} (ZCR). Speech has a lower ZCR than music, because there is a
limit on how fast humans can speak. Audio signals can often times be made up of
distinct patterns which are described by the attack, sustain, decay and release
model. This model is effective in rhythm detection.
Motion in videos is easily detected using crosscorrelation between previous or
subsequent frames. Similarly to crosscorrelation in other domains, a similarity
measure is calculated from two frames and if the result exceeds a threshold,
there is movement. The similarity measurements can be aggregated to provide a
robust detection of camera movement.
\section{Classification 500 words}
\section{Evaluation 200 words}