489 lines
28 KiB
TeX
489 lines
28 KiB
TeX
\documentclass[conference]{IEEEtran}
|
|
\usepackage{cite}
|
|
\usepackage{amsmath,amssymb,amsfonts}
|
|
\usepackage{algorithmic}
|
|
\usepackage{graphicx}
|
|
\usepackage{textcomp}
|
|
\usepackage{xcolor}
|
|
\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
|
|
T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
|
|
\begin{document}
|
|
|
|
\title{Similarity Modeling 1/2 Abstracts}
|
|
|
|
\author{\IEEEauthorblockN{Tobias Eidelpes}
|
|
\IEEEauthorblockA{\textit{TU Wien}\\
|
|
Vienna, Austria \\
|
|
e1527193@student.tuwien.ac.at}
|
|
}
|
|
|
|
\maketitle
|
|
|
|
% \begin{abstract}
|
|
% \end{abstract}
|
|
|
|
% \begin{IEEEkeywords}
|
|
% component, formatting, style, styling, insert
|
|
% \end{IEEEkeywords}
|
|
|
|
\section{Setting}
|
|
|
|
To understand the term \emph{Similarity Modeling} and what it encompasses, it is
|
|
first important to know how we as humans perceive and understand the things we
|
|
pick up. An illustrative example for this process is the process of seeing
|
|
(\emph{detecting}) a face, \emph{recognizing} it and deriving the emotion
|
|
attached to it. These three steps are placed on a figurative \emph{semantic
|
|
ladder}, where detecting a face sits on the bottom and recognizing emotion on
|
|
the top. Face detection thus carries a relatively low semantic meaning, whereas
|
|
recognizing emotion is a much more sophisticated process. All three steps are
|
|
only possible to be carried out by humans, because they have internal models for
|
|
the faces they see, whether they have seen them before and if there is a known
|
|
emotion attached to how the face looks. These models are acquired from a young
|
|
age through the process of learning. Visual stimuli and models alone are not
|
|
enough to be able to conclude that a certain face appears similar or not. The
|
|
process connecting stimuli and models is comparing the two, also called
|
|
\emph{looking for similarities}. Together, modeling and looking for
|
|
similarities, they can be summarized under the term \emph{Similarity Modeling}.
|
|
|
|
The goal of Similarity Modeling is usually to find a \emph{class} for the object
|
|
of interest. The flow of information thus starts with the stimulus, continues on
|
|
to the modeling part, where we derive a model of the stimulus and—after finding
|
|
similarities to existing knowledge—ends in a class or label. As mentioned
|
|
previously, the existing knowledge is fed during the modeling process which
|
|
describes the feedback loop we call learning. The difficult part lies in
|
|
properly modeling the input stimulus. It is impossible to store every stimulus
|
|
verbatim into our existing knowledge base, because it would be too much data if
|
|
every variety of a stimulus would have to be saved. Therefore, classification
|
|
systems need the modeling step to \emph{break down} the stimulus into small
|
|
components which generalize well. The similarity part is generally the same for
|
|
various domains. Once a proper model of a stimulus exists, checking for
|
|
similarities in the preexisting knowledge base follows the same patterns,
|
|
regardless of the type of stimulus.
|
|
|
|
Common problems that arise when engineers try to model and classify stimuli come
|
|
from the fact that there is a wide variety of input signals. This variety is
|
|
represented by signals which can be local and have large and sudden increases or
|
|
drops. Others are smooth and the defining characteristic is the absence of
|
|
sudden variations. Still different signals can have recurring patterns (e.g.
|
|
EEG) or none at all (e.g. stocks). After detection the most crucial problem
|
|
remains, which is understanding semantics (also known as the \emph{semantic
|
|
gap}). The next problem is getting away from the individual samples to be able
|
|
to construct a model. This is known as the \emph{gravity of the sample}. Another
|
|
problem is commonly referred to as the \emph{curse of dimensionality}, where we
|
|
end up with a huge parameter space and have to optimize those parameters to find
|
|
good models. The last problem is bad data. This can be missing data, misleading
|
|
data or noisy data.
|
|
|
|
\section{Similarity Measurement}
|
|
|
|
The artificial process of measuring similarity in computers is shaped by the
|
|
same rules and fundamentals which are governing similarity measurements in
|
|
humans. Understanding how similarity measurements work in humans is thus
|
|
invaluable for any kind of measurement done using computers. A concept which
|
|
appears in both domains is the \emph{feature space}. An example for a feature
|
|
space is one where we have two characteristics of humans, gender and age, which
|
|
we want to explore with regards to their relation to each other. Gender exists
|
|
on a continuum which goes from male to female. Age, on the other hand, goes from
|
|
young to old. Because we are only concerned with two characteristics, we have a
|
|
\mbox{two-dimensional} feature space. Theoretically, a feature space can be
|
|
$n$-dimensional, where increasing values for $n$ result in increasing
|
|
complexity. In our brains processing of inputs happens in neurons which receive
|
|
weighted signals from synapses. The neuron contains a summarization operation
|
|
and a comparison to a threshold. If the threshold is exceeded, the neuron fires
|
|
and sends the information to an axon. The weights constitute the dimensions of
|
|
the feature space. In computers we can populate the feature space with samples
|
|
and then do either a distance (negative convolution) or a cosine similarity
|
|
measurement (positive convolution). Since the cosine similarity measurement uses
|
|
the product of two vectors, it is at its maximum when the two factors are the
|
|
same. It is much more discriminatory than the distance measurement. Distance
|
|
measurements are also called \emph{thematic} or \emph{integral}, whereas cosine
|
|
similarity measurements are called \emph{taxonomic} or \emph{separable}. Due to
|
|
the latter exhibiting highly taxonomic traits, questions of high semantics such
|
|
as ``is this person old?'', which require a \emph{true} (1) or \emph{false} (0)
|
|
answer, fit the discriminatory properties of cosine similarity.
|
|
|
|
The relationship between distance and similarity measurements is described by
|
|
the \emph{Generalization} function. Whenever the distance is zero, the similarity
|
|
measurement is one. Conversely, similarity is at its lowest when the distance is
|
|
at its highest. The relationship in-between the extremes is nonlinear and
|
|
described by the function $g(d)=s=e^{-d}$, which means that only small increases
|
|
in distance disproportionately affect similarity. Generalization allows us to
|
|
convert distance measurements to similarity measurements and vice-versa.
|
|
|
|
\begin{equation}
|
|
\label{eq:dpm}
|
|
\mathrm{dpm} = \alpha\cdot\vec{s} + (1-\alpha)\cdot
|
|
g(\vec{d})\quad\mathrm{with}\quad\alpha\in[0,1]
|
|
\end{equation}
|
|
|
|
Both, cosine similarity and distance measurements, can be combined to form
|
|
\emph{Dual Process Models of Similarity} (DPMs). One such example is given in
|
|
\eqref{eq:dpm} where both measurements are weighted and the distance
|
|
measurement is expressed as a similarity measure using the generalization
|
|
function. DPMs model humans' perception particularly well, but are not widely
|
|
used in the computer science domain.
|
|
|
|
\section{Feature Engineering}
|
|
|
|
Contrary to popular opinion, the rise of deep learning methods in areas such as
|
|
object recognition has not superseded the classical approach of feature
|
|
engineering in other areas. Particularly in the audio domain and for motion
|
|
detection in videos for example, feature engineering is still the dominant
|
|
method. This is highlighted by the fact that classical methods require much less
|
|
processing which can be beneficial or even crucial for certain applications
|
|
(e.g. edge computing).
|
|
|
|
Feature engineering is part of the pipeline which transforms input data into
|
|
classes and labels for that data. After modeling comes feature extraction so
|
|
that these features can be mapped in the feature space. After the classification
|
|
step, we end up with labels corresponding to the input data and the features we
|
|
want. In practice, feature engineering deals with analyzing input signals.
|
|
Common features one might be interested in during analysis is the loudness
|
|
(amplitude), rhythm or motion of a signal.
|
|
|
|
There are four main features of interest when analyzing visual data: color,
|
|
texture, shape and foreground versus background. Starting with color, the first
|
|
thing that springs to mind is to use the RGB color space to detect specific
|
|
colors. Depending on the application, this might not be the best choice due to
|
|
the three colors being represented by their \emph{pure} versions and different
|
|
hues of a color requiring a change of all three parameters (red, green and
|
|
blue). Other color spaces such as hue, saturation and value (HSV) are better
|
|
suited for color recognition, since we are usually only interested in the hue of
|
|
a color and can therefore better generalize the detection space. Another option
|
|
is posed by the \emph{CIE XYZ} color space which is applicable to situations
|
|
where adherence to how the human vision works is beneficial. For broadcasting
|
|
applications, color is often encoded using \emph{YCrCb}, where \emph{Y}
|
|
represents lightness and \emph{Cr} and \emph{Cb} represent $Y-R$ and $Y-B$
|
|
respectively. To find a dominant color within an image, we can choose to only
|
|
look at certain sections of the frame, e.g. the center or the largest continuous
|
|
region of color. Another approach is to use a color histogram to count the
|
|
number of different hues within the frame.
|
|
|
|
Recognizing objects by their texture can be divided into three different
|
|
methods. One approach is to look at the direction pixels are oriented towards to
|
|
get a measure of \emph{directionality}. Secondly, \emph{rhythm} allows us to
|
|
detect if a patch of information (micro block) is repeated in its neighborhood
|
|
through \emph{autocorrelation}. Autocorrelation takes one neighborhood and
|
|
compares it—usually using a generalized distance measure—to all other
|
|
neighborhoods. If the similarity exceeds a certain threshold, there is a high
|
|
probability that a rhythm exists. Third, coarseness can be detected by applying
|
|
a similar process, but by looking at different window sizes to determine if
|
|
there is any loss of information. If there is no loss of information in the
|
|
compressed (smaller) window, the image information is coarse.
|
|
|
|
Shape detection can be realized using \emph{kernels} of different sizes and with
|
|
different values. An edge detection algorithm might use a sobel matrix to
|
|
compare neighborhoods of an image. If the similarity is high, there is a high
|
|
probability of there being an edge in that neighborhood.
|
|
|
|
Foreground and background detection relies on the assumption that the coarseness
|
|
is on average higher for the background than for the foreground. This only makes
|
|
sense if videos have been properly recorded using depth of field so that the
|
|
background is much more blurred out than the foreground.
|
|
|
|
For audio feature extraction, three properties are of relevance: loudness,
|
|
fundamental frequency and rhythm. Specific audio sources have a distinct
|
|
loudness to them where for example classical music has a higher standard
|
|
deviation of loudness than metal. The fundamental frequency can be particularly
|
|
helpful in distinguishing speech from music by analyzing the \emph{zero
|
|
crossings rate} (ZCR). Speech has a lower ZCR than music, because there is a
|
|
limit on how fast humans can speak. Audio signals can often times be made up of
|
|
distinct patterns which are described by the attack, sustain, decay and release
|
|
model. This model is effective in rhythm detection.
|
|
|
|
Motion in videos is easily detected using crosscorrelation between previous or
|
|
subsequent frames. Similarly to crosscorrelation in other domains, a similarity
|
|
measure is calculated from two frames and if the result exceeds a threshold,
|
|
there is movement. The similarity measurements can be aggregated to provide a
|
|
robust detection of camera movement.
|
|
|
|
\section{Classification}
|
|
|
|
The setting for classification is described by taking a feature space and
|
|
clustering the samples within that feature space. The smaller and well-defined
|
|
the clusters are, the better the classification works. At the same time we want
|
|
to have a high covariance between clusters so that different classes are easily
|
|
distinguishable. Classification is another filtering method which reduces the
|
|
input data—sometimes on the order of millions of dimensions—into simple
|
|
predicates, e.g. \emph{yes} or \emph{no} instances. The goal of classification
|
|
is therefore that semantic enrichment comes along with the filtering process.
|
|
|
|
The two fundamental methods used in classification are \emph{separation} and
|
|
\emph{hedging}. Separation tries to draw a line between different classes in the
|
|
feature space. Hedging, on the other hand, uses perimeters to cluster samples.
|
|
Additionally, the centroid of each cluster is calculated and the covariance
|
|
between two centroids acts as a measure of separation. Both methods can be
|
|
linked to \emph{concept theories} such as the \emph{classical} and
|
|
\emph{prototype} theory. While concept theory classifies different things based
|
|
on their necessary and sufficient conditions, prototype theory uses typical
|
|
examples to come to a conclusion about a particular thing. The first can be
|
|
mapped to the fundamental method of separation in machine learning, whereas the
|
|
latter is mapped to the method of hedging. In the big picture, hedging is
|
|
remarkably similar to negative convolution, as discussed earlier. Separation,
|
|
on the other hand, has parallels with positive convolution.
|
|
|
|
If we take separation as an example, there are multiple ways how we can split
|
|
classes using a simple line. One could draw a straight line between two classes
|
|
without caring about individual samples, which are then misclassified. This
|
|
often results in so-called \emph{underfitting}, because the classifier would not
|
|
work well on a dataset which it has not seen before. Conversely, if the line
|
|
includes too many individual samples and is a function of high degree, the
|
|
classifier is likely \emph{overfitting}. Both, underfitting and overfitting, are
|
|
common pitfalls to avoid as the best classifier lies somewhere in-between the
|
|
two. To be able to properly train, test and validate a classifier, the test data
|
|
are split into these three different categories.
|
|
|
|
\emph{Unsupervised classification} or \emph{clustering} employs either a
|
|
bottom-up or top-down approach. Regardless of the chosen method, unsupervised
|
|
classification works with unlabeled data. The goal is to construct a
|
|
\emph{dendrogram} which consists of distance measures between the samples and
|
|
their centroids with different samples. In the bottom-up approach an individual
|
|
sample marks a leaf of the tree-like dendrogram and is connected through a
|
|
negative convolution measurement to neighboring samples. In the top-down
|
|
approach the dendrogram is not built from the leaves, but by starting from the
|
|
centroid of the entire feature space. Distance measurements to samples within
|
|
the field recursively construct the dendrogram until all samples are included.
|
|
|
|
One method of \emph{supervised classification} is the \emph{vector space model}.
|
|
It is well-suited for finding items which are similar to a given item (= the
|
|
query or hedge). Usually, a simple distance measurement such as the euclidian
|
|
distance provides results which are good enough, especially for online shops
|
|
where there are millions of products on offer and a more sophisticated approach
|
|
is too costly.
|
|
|
|
Another method is \emph{k-nearest-neighbors}, which requires ground truth data.
|
|
Here, a new sample is classified by calculating the distance to all neighbors in
|
|
a given diameter. The new datum is added to the cluster which contains the
|
|
closest samples.
|
|
|
|
\emph{K-means} requires information about the centroids of the individual
|
|
clusters. Distance measurements to the centroids determine to which cluster the
|
|
new sample belongs to.
|
|
|
|
\emph{Self-organizing maps} are similar to k-means, but with two changes. First,
|
|
all data outside of the area of interest is ignored. Second, after a winning
|
|
cluster is found, it is moved closer to the query object. The process is
|
|
repeated for all other clusters. This second variation on k-means constitutes
|
|
the first application of the concept of \emph{learning}.
|
|
|
|
\emph{Decision trees} divide the feature space into arbitrarily-sized regions.
|
|
Multiple regions define a particular class. This method is in practice highly
|
|
prone to overfitting, which is why they are combined to form a random forest
|
|
classifier.
|
|
|
|
\emph{Random forest classifiers} construct many decision trees and pick the
|
|
best-performing ones. Such classifiers are also called \emph{ensemble methods}.
|
|
|
|
\emph{Deep networks} started off as simple \emph{perceptrons} which were
|
|
ineffective at solving the XOR-Problem. The conclusion was that there had to be
|
|
additional hidden layers and back propagation to adjust the weights of the
|
|
layers. It turned out that hidden layers are ineffective too, because back
|
|
propagation would disproportionately affect later layers (\emph{vanishing
|
|
gradients}). With \emph{convolutional neural networks} (CNNs) all that changed,
|
|
because they combine automatic feature engineering with simple classification,
|
|
processing on the GPU and effective training.
|
|
|
|
The \emph{radial basis function} is a simpler classifier which consists of one
|
|
input layer, one hidden layer and one output layer. In the first layer we
|
|
compare the input values to codebook vectors and employ a generalization of
|
|
negative convolution. In the second layer the outputs from the first layer are
|
|
multiplied by the weights from the hidden layer. This results in the
|
|
aforementioned dual process model where negative convolution and positive
|
|
convolution are employed to form the output.
|
|
|
|
\section{Evaluation}
|
|
|
|
An important, if not the most important, part of similarity modeling is
|
|
evaluating the performance of classifiers. A straightforward way to do so is
|
|
analyzing the \emph{confusion matrix}. A confusion matrix contains the output of
|
|
the classifier on one axis and the ground truth on the other axis. If the
|
|
classifier says something is relevant and the ground truth says that as well, we
|
|
have a true positive. The same applies to negatives where both agree and these
|
|
are called true negatives. However, if the ground truth says something is
|
|
irrelevant, but the classifier says it is relevant, we have a false positive.
|
|
Conversely, false negatives require the classifier to say something is
|
|
irrelevant when it is in fact actually relevant.
|
|
|
|
From the confusion matrix we can derive \emph{recall} or \emph{true positive
|
|
rate}. It is calculated by dividing the true positives by the sum of the true
|
|
positives and false negatives. If the ratio is close to one, the classifier
|
|
recognizes almost everything correctly. Recall on its own is not always helpful
|
|
because there is the possibility that the classifier recognizes everything
|
|
correctly but has a high \emph{false positive rate}. It is defined by the false
|
|
positives divided by the sum of the false positives and the true negatives. A
|
|
low value of the false positive rate combined with a high value of recall is
|
|
desirable. Third, \emph{precision} is another measure for pollution, similarly
|
|
to the false positive rate. It is defined as the true positives divided by the
|
|
sum of the true positives and the false positives. An advantage of precision is
|
|
that it can be calculated just from the output of the classifier. Precision and
|
|
recall are inversely correlated, in that a recall of one can always be achieved
|
|
by classifying everything as relevant, but then the precision is zero and
|
|
vice-versa.
|
|
|
|
All three measures can be visualized by the \emph{recall-precision-graph} or the
|
|
\emph{receiver operating characteristics curve} (ROC curve). The latter plots
|
|
the false positive rate on the x-axis against the true positive rate.
|
|
|
|
\section{Perception and Psychophysics}
|
|
|
|
The human perception happens in the brain where we have approximately $10^{10}$
|
|
neurons and even more synapses ($10^{13}$). Neurons are connected to around 3\%
|
|
of their neighbors and new connections never cease to be built, unlike neurons
|
|
which are only created up to a young age. The perceptual load on our senses is
|
|
at around 1Gb/s. To deal with all these data, most of the data is ignored and
|
|
only later reconstructed in the brain, if they are needed. The olfactory sense
|
|
requires the most amount of cells, whereas the aural sense requires the least.
|
|
One reason for the low amount of cells for hearing is that the pre-processing in
|
|
our ears is very sophisticated and thus less processing is needed during the
|
|
later stages. Vision also requires a lot of cells (on the order of $10^6$).
|
|
Vision is handled in part by rods (for brightness) and cones (for color). The
|
|
relationship of rods to cones is about 20 to 1, although the ratio varies a lot
|
|
from one human to the next.
|
|
|
|
Psychophysics is the study of physical stimuli ($=\Phi$) and the sensations and
|
|
perceptions they produce ($=\Psi$). The relationship between the two is not a
|
|
linear, but a logarithmic one and it is described by the Weber-Fechner law
|
|
\eqref{eq:wf-law}.
|
|
|
|
\begin{equation}
|
|
\label{eq:wf-law}
|
|
\Psi = c\cdot\log(\Phi) + a
|
|
\end{equation}
|
|
|
|
The Weber law \eqref{eq:w-law} states that, in order to get a similar response,
|
|
stimuli have to be increasing in intensity over time.
|
|
|
|
\begin{equation}
|
|
\label{eq:w-law}
|
|
\Delta\Phi = f(\Phi)
|
|
\end{equation}
|
|
|
|
In later years, Stanley Smith Stevens empirically developed the Stevens' power
|
|
law \eqref{eq:s-power-law} whereby our perception is dependent on a factor $c$
|
|
multiplied with the stimulus which is raised to the power of \emph{Stevens'
|
|
exponent} and added to a constant $b$.
|
|
|
|
\begin{equation}
|
|
\label{eq:s-power-law}
|
|
\Psi = c\cdot\Phi^{a} + b
|
|
\end{equation}
|
|
|
|
The eye detects incoming visual stimuli with the aforementioned rods and cones.
|
|
Cones are further split into three different types to detect color. Blue cones
|
|
fire upon receiving light in the 420nm range, whereas green cones react to 534nm
|
|
and red cones to 564nm. These wavelengths are only indicative of where the
|
|
visual system reacts the strongest. Furthermore, these numbers are averages of a
|
|
large population of humans, but can be different for individuals. Green and red
|
|
are perceptually very close and it is postulated by scientists that the
|
|
perception of red only recently separated from the perception of green in our
|
|
evolution and is therefore still very close. Visual information that enters the
|
|
retina is first processed by the ganglion cells which do edge detection. They
|
|
receive their information from \emph{bipolar cells} which either pass along the
|
|
signal or block it. The ganglion cells process multiple such signals in a
|
|
neighborhood and detect length and angle of edges. After edge detection the
|
|
signal is forwarded to the visual cortex which does object detection via the
|
|
\emph{ventral pathway} and motion detection via the \emph{dorsal pathway}.
|
|
Before the signal is forwarded to either of the pathways, the occipital cortex
|
|
processes edge information, color blobs, texture and contours. The three-color
|
|
stimulus is converted into a hue, saturation and value encoding. After that
|
|
motion and 3D information is processed. The flow of information is one of
|
|
semantic enrichment, starting from edge detection and ending in motion
|
|
detection. In the ventral pathway an object is detected invariant to its type,
|
|
size, position or occlusion. The dorsal pathway for motion detection has to deal
|
|
with multiple degrees of freedom due to the eye moving on its own and the object
|
|
moving as well.
|
|
|
|
The ear consists of the ear canal, which does some filtering, the eardrum for
|
|
amplification, the ossicles and the cochlea. The cochlea is the most important
|
|
part because it translates air waves first into liquid waves and then to
|
|
electrical signals which are transferred to the brain. It contains a
|
|
\emph{staircase} on which there are hairs of different lengths. Depending on
|
|
their length they react to either high or low frequencies. Their movement within
|
|
the liquid is then transformed into electrical signals through the tip links.
|
|
The thresholds of hearing exist on the lower end due to physical limits when the
|
|
hairs inside the ear receive a too small stimulus and therefore do not move
|
|
noticeably. The lower threshold is dependent on the received frequency and is
|
|
lowest at around 4kHz, where we hear best. High energies are needed to hear very
|
|
low frequencies. The threshold on the higher end marks the point at which sounds
|
|
become painful and it seeks to protect us from damaging our hearing.
|
|
|
|
\section{Spectral Features}
|
|
|
|
Because analysis of audio in the time domain is very hard to do, especially for
|
|
identifying overtones, we employ transforms to the original data to extract more
|
|
information. One such transformation was used by Pierre-Simon Laplace in 1785 to
|
|
transform problems requiring difficult operations into other problems which
|
|
are solvable with simpler operations. The results of the easier calculation
|
|
would then be transformed back again into the original problem. This first type
|
|
of transformation uses a function and applies a kernel to it with positive
|
|
convolution to result in a \emph{spectrum}. Applying the kernel to the spectrum
|
|
gives a result to the original function again (=back transformation). The kernel
|
|
which proved suitable for this operation is given in \eqref{eq:laplace-kernel}.
|
|
|
|
\begin{equation}
|
|
\label{eq:laplace-kernel}
|
|
K_{xy} = e^{-xy}
|
|
\end{equation}
|
|
|
|
In 1823, instead of just having $-xy$ in the exponent, Fourier proposed to add
|
|
an imaginary part $i$. The function can be rewritten as $cos(xy) - i\cdot
|
|
sin(xy)$, which makes it possible to interpret the original function much more
|
|
easily using simple angular functions. The fourier transformation is a
|
|
similarity measurement of taking a set of coefficients and measuring the
|
|
similarity to a set of angular functions which are overlaid on each other. The
|
|
imaginary part of the fourier transform can be dealt with by either throwing the
|
|
$sin$ part of the function away or by computing the magnitude by taking the
|
|
root of the squared real part plus the squared imaginary part.
|
|
|
|
One property of the fourier transform is that high frequencies get increasingly
|
|
less well-sampled the more information is thrown away during the process. Steep
|
|
changes in frequency are spread out over more samples and if only a small
|
|
fraction of coefficients is used, the transformation results in a basic sine
|
|
wave. Another property for image information is that the most important parts of
|
|
the image are located at the ends of the spectrum. There's hardly any
|
|
information in the mid-range of the spectrum, which is why it can be concluded
|
|
that the bulk of the information in images lies in the edges. Since the middle
|
|
part of any spectrum is usually smoothed by the extremes at the edges, smoothing
|
|
functions are used to more accurately represent the data we are interested in.
|
|
They work by doing an element-wise operation on the spectrum with a window
|
|
function. Important ones are the triangular function (Bartlett), gaussian
|
|
function (Hamming), a sine function (Kaiser) and a simple rectangular function.
|
|
This step is known as \emph{windowing}.
|
|
|
|
A third transformation is the \emph{discreet cosine transform}. This applies a
|
|
kernel of the form $K = cos(xy+\frac{y}{2})$. Due to the uniform nature of the
|
|
fourier transform, a lot of image information is quickly lost when coefficients
|
|
are thrown away. The cosine transformation, however, manages to retain much more
|
|
relevant information even until almost 90\% of coefficients have been thrown
|
|
away. In contrast to the fourier transform, which is uniform, it discriminates.
|
|
Other wavelets of interest are the \emph{mexican hat} and the \emph{Gabor
|
|
wavelet}. In the area of optics, \emph{zernike polynomials} are used to compare
|
|
a measurement of a lens to \emph{ideal} optics modeled by a zernike function. If
|
|
a pre-defined threshold for the error is exceeded, the optics require a closer
|
|
look during the quality assurance process.
|
|
|
|
While integral transforms allow the original signal to be reconstructed from the
|
|
transformed spectrum, \emph{parametric transforms} cannot provide that property.
|
|
One such transformation is the \emph{Radon transformation} where an axis is
|
|
defined and the luminance values along that axis are added. The axis is then
|
|
rotated and the process repeated until all angles have been traversed. The
|
|
resulting spectrum is rotation-invariant and the transformation is therefore a
|
|
useful pre-processing step for feature engineering. The \emph{Hough
|
|
transformation} uses the gradients of an image to construct a histogram. The
|
|
information presented in the histogram can be valuable to detect regular
|
|
patterns or long edges in an image.
|
|
|
|
Applications for the fourier transform are identifying spectral peaks, tune
|
|
recognition and timbre recognition. All of these use a form of \emph{short-time
|
|
fourier transform} (STFT). The FT can also be used for optical flow by shifting
|
|
the image and recomputing the spectrum (\emph{phase correlation}). Both the FT
|
|
and the CT are successfully used in music recognition and speech recognition
|
|
with the \emph{mel-frequency cepstrum coefficients} (MFCC). The CT is used in
|
|
MPEG7 for \emph{color histogram encoding} and for texture computation.
|
|
|
|
\section{Semantic Modeling 200 words}
|
|
|
|
\section{Learning over Time 600 words}
|
|
|
|
\end{document}
|