similarity-modeling-abstracts/sim.tex

\documentclass[conference]{IEEEtran}
\usepackage{cite}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{algorithmic}
\usepackage{graphicx}
\usepackage{textcomp}
\usepackage{xcolor}
\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
    T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
\begin{document}

\title{Similarity Modeling 1/2 Abstracts}

\author{\IEEEauthorblockN{Tobias Eidelpes}
\IEEEauthorblockA{\textit{TU Wien}\\
Vienna, Austria \\
e1527193@student.tuwien.ac.at}
}

\maketitle

% \begin{abstract}
% \end{abstract}

% \begin{IEEEkeywords}
% component, formatting, style, styling, insert
% \end{IEEEkeywords}

\section{Setting}

To understand the term \emph{Similarity Modeling} and what it encompasses, it is
first important to know how we as humans perceive and understand the things we
pick up. An illustrative example for this process is the process of seeing
(\emph{detecting}) a face, \emph{recognizing} it and deriving the emotion
attached to it. These three steps are placed on a figurative \emph{semantic
ladder}, where detecting a face sits on the bottom and recognizing emotion on
the top. Face detection thus carries a relatively low semantic meaning, whereas
recognizing emotion is a much more sophisticated process. All three steps are
only possible to be carried out by humans, because they have internal models for
the faces they see, whether they have seen them before and if there is a known
emotion attached to how the face looks. These models are acquired from a young
age through the process of learning. Visual stimuli and models alone are not
enough to be able to conclude that a certain face appears similar or not. The
process connecting stimuli and models is comparing the two, also called
\emph{looking for similarities}. Together, modeling and looking for
similarities, they can be summarized under the term \emph{Similarity Modeling}.

The goal of Similarity Modeling is usually to find a \emph{class} for the object
of interest. The flow of information thus starts with the stimulus, continues on
to the modeling part, where we derive a model of the stimulus and—after finding
similarities to existing knowledge—ends in a class or label. As mentioned
previously, the existing knowledge is fed during the modeling process which
describes the feedback loop we call learning. The difficult part lies in
properly modeling the input stimulus. It is impossible to store every stimulus
verbatim into our existing knowledge base, because it would be too much data if
every variety of a stimulus would have to be saved. Therefore, classification
systems need the modeling step to \emph{break down} the stimulus into small
components which generalize well. The similarity part is generally the same for
various domains. Once a proper model of a stimulus exists, checking for
similarities in the preexisting knowledge base follows the same patterns,
regardless of the type of stimulus.

Common problems that arise when engineers try to model and classify stimuli come
from the fact that there is a wide variety of input signals. This variety is
represented by signals which can be local and have large and sudden increases or
drops. Others are smooth and the defining characteristic is the absence of
sudden variations. Still different signals can have recurring patterns (e.g.
EEG) or none at all (e.g. stocks). After detection the most crucial problem
remains, which is understanding semantics (also known as the \emph{semantic
gap}). The next problem is getting away from the individual samples to be able
to construct a model. This is known as the \emph{gravity of the sample}. Another
problem is commonly referred to as the \emph{curse of dimensionality}, where we
end up with a huge parameter space and have to optimize those parameters to find
good models. The last problem is bad data. This can be missing data, misleading
data or noisy data.

\section{Similarity Measurement}

The artificial process of measuring similarity in computers is shaped by the
same rules and fundamentals which are governing similarity measurements in
humans. Understanding how similarity measurements work in humans is thus
invaluable for any kind of measurement done using computers. A concept which
appears in both domains is the \emph{feature space}. An example for a feature
space is one where we have two characteristics of humans, gender and age, which
we want to explore with regards to their relation to each other. Gender exists
on a continuum which goes from male to female. Age, on the other hand, goes from
young to old. Because we are only concerned with two characteristics, we have a
\mbox{two-dimensional} feature space. Theoretically, a feature space can be
$n$-dimensional, where increasing values for $n$ result in increasing
complexity. In our brains processing of inputs happens in neurons which receive
weighted signals from synapses. The neuron contains a summarization operation
and a comparison to a threshold. If the threshold is exceeded, the neuron fires
and sends the information to an axon. The weights constitute the dimensions of
the feature space. In computers we can populate the feature space with samples
and then do either a distance (negative convolution) or a cosine similarity
measurement (positive convolution). Since the cosine similarity measurement uses
the product of two vectors, it is at its maximum when the two factors are the
same. It is much more discriminatory than the distance measurement. Distance
measurements are also called \emph{thematic} or \emph{integral}, whereas cosine
similarity measurements are called \emph{taxonomic} or \emph{separable}. Due to
the latter exhibiting highly taxonomic traits, questions of high semantics such
as ``is this person old?'', which require a \emph{true} (1) or \emph{false} (0)
answer, fit the discriminatory properties of cosine similarity.

The relationship between distance and similarity measurements is described by
the \emph{Generalization} function. Whenever the distance is zero, the similarity
measurement is one. Conversely, similarity is at its lowest when the distance is
at its highest. The relationship in-between the extremes is nonlinear and
described by the function $g(d)=s=e^{-d}$, which means that only small increases
in distance disproportionately affect similarity. Generalization allows us to
convert distance measurements to similarity measurements and vice-versa.

\begin{equation}
\label{eq:dpm}
\mathrm{dpm} = \alpha\cdot\vec{s} + (1-\alpha)\cdot
g(\vec{d})\quad\mathrm{with}\quad\alpha\in[0,1]
\end{equation}

Both, cosine similarity and distance measurements, can be combined to form
\emph{Dual Process Models of Similarity} (DPMs). One such example is given in
\eqref{eq:dpm} where both measurements are weighted and the distance
measurement is expressed as a similarity measure using the generalization
function. DPMs model humans' perception particularly well, but are not widely
used in the computer science domain.

\section{Feature Engineering}

Contrary to popular opinion, the rise of deep learning methods in areas such as
object recognition has not superseded the classical approach of feature
engineering in other areas. Particularly in the audio domain and for motion
detection in videos for example, feature engineering is still the dominant
method. This is highlighted by the fact that classical methods require much less
processing which can be beneficial or even crucial for certain applications
(e.g. edge computing).

Feature engineering is part of the pipeline which transforms input data into
classes and labels for that data. After modeling comes feature extraction so
that these features can be mapped in the feature space. After the classification
step, we end up with labels corresponding to the input data and the features we
want. In practice, feature engineering deals with analyzing input signals.
Common features one might be interested in during analysis is the loudness
(amplitude), rhythm or motion of a signal.

There are four main features of interest when analyzing visual data: color,
texture, shape and foreground versus background. Starting with color, the first
thing that springs to mind is to use the RGB color space to detect specific
colors. Depending on the application, this might not be the best choice due to
the three colors being represented by their \emph{pure} versions and different
hues of a color requiring a change of all three parameters (red, green and
blue). Other color spaces such as hue, saturation and value (HSV) are better
suited for color recognition, since we are usually only interested in the hue of
a color and can therefore better generalize the detection space. Another option
is posed by the \emph{CIE XYZ} color space which is applicable to situations
where adherence to how the human vision works is beneficial. For broadcasting
applications, color is often encoded using \emph{YCrCb}, where \emph{Y}
represents lightness and \emph{Cr} and \emph{Cb} represent $Y-R$ and $Y-B$
respectively. To find a dominant color within an image, we can choose to only
look at certain sections of the frame, e.g. the center or the largest continuous
region of color. Another approach is to use a color histogram to count the
number of different hues within the frame.

Recognizing objects by their texture can be divided into three different
methods. One approach is to look at the direction pixels are oriented towards to
get a measure of \emph{directionality}. Secondly, \emph{rhythm} allows us to
detect if a patch of information (micro block) is repeated in its neighborhood
through \emph{autocorrelation}. Autocorrelation takes one neighborhood and
compares it—usually using a generalized distance measure—to all other
neighborhoods. If the similarity exceeds a certain threshold, there is a high
probability that a rhythm exists. Third, coarseness can be detected by applying
a similar process, but by looking at different window sizes to determine if
there is any loss of information. If there is no loss of information in the
compressed (smaller) window, the image information is coarse.

Shape detection can be realized using \emph{kernels} of different sizes and with
different values. An edge detection algorithm might use a sobel matrix to
compare neighborhoods of an image. If the similarity is high, there is a high
probability of there being an edge in that neighborhood.

Foreground and background detection relies on the assumption that the coarseness
is on average higher for the background than for the foreground. This only makes
sense if videos have been properly recorded using depth of field so that the
background is much more blurred out than the foreground.

For audio feature extraction, three properties are of relevance: loudness,
fundamental frequency and rhythm. Specific audio sources have a distinct
loudness to them where for example classical music has a higher standard
deviation of loudness than metal. The fundamental frequency can be particularly
helpful in distinguishing speech from music by analyzing the \emph{zero
crossings rate} (ZCR). Speech has a lower ZCR than music, because there is a
limit on how fast humans can speak. Audio signals can often times be made up of
distinct patterns which are described by the attack, sustain, decay and release
model. This model is effective in rhythm detection.

Motion in videos is easily detected using crosscorrelation between previous or
subsequent frames. Similarly to crosscorrelation in other domains, a similarity
measure is calculated from two frames and if the result exceeds a threshold,
there is movement. The similarity measurements can be aggregated to provide a
robust detection of camera movement.

\section{Classification}

The setting for classification is described by taking a feature space and
clustering the samples within that feature space. The smaller and well-defined
the clusters are, the better the classification works. At the same time we want
to have a high covariance between clusters so that different classes are easily
distinguishable. Classification is another filtering method which reduces the
input data—sometimes on the order of millions of dimensions—into simple
predicates, e.g. \emph{yes} or \emph{no} instances. The goal of classification
is therefore that semantic enrichment comes along with the filtering process.

The two fundamental methods used in classification are \emph{separation} and
\emph{hedging}. Separation tries to draw a line between different classes in the
feature space. Hedging, on the other hand, uses perimeters to cluster samples.
Additionally, the centroid of each cluster is calculated and the covariance
between two centroids acts as a measure of separation. Both methods can be
linked to \emph{concept theories} such as the \emph{classical} and
\emph{prototype} theory. While concept theory classifies different things based
on their necessary and sufficient conditions, prototype theory uses typical
examples to come to a conclusion about a particular thing. The first can be
mapped to the fundamental method of separation in machine learning, whereas the
latter is mapped to the method of hedging. In the big picture, hedging is
remarkably similar to negative convolution, as discussed earlier. Separation,
on the other hand, has parallels with positive convolution.

If we take separation as an example, there are multiple ways how we can split
classes using a simple line. One could draw a straight line between two classes
without caring about individual samples, which are then misclassified. This
often results in so-called \emph{underfitting}, because the classifier would not
work well on a dataset which it has not seen before. Conversely, if the line
includes too many individual samples and is a function of high degree, the
classifier is likely \emph{overfitting}. Both, underfitting and overfitting, are
common pitfalls to avoid as the best classifier lies somewhere in-between the
two. To be able to properly train, test and validate a classifier, the test data
are split into these three different categories.

\emph{Unsupervised classification} or \emph{clustering} employs either a
bottom-up or top-down approach. Regardless of the chosen method, unsupervised
classification works with unlabeled data. The goal is to construct a
\emph{dendrogram} which consists of distance measures between the samples and
their centroids with different samples. In the bottom-up approach an individual
sample marks a leaf of the tree-like dendrogram and is connected through a
negative convolution measurement to neighboring samples. In the top-down
approach the dendrogram is not built from the leaves, but by starting from the
centroid of the entire feature space. Distance measurements to samples within
the field recursively construct the dendrogram until all samples are included.

One method of \emph{supervised classification} is the \emph{vector space model}.
It is well-suited for finding items which are similar to a given item (= the
query or hedge). Usually, a simple distance measurement such as the euclidian
distance provides results which are good enough, especially for online shops
where there are millions of products on offer and a more sophisticated approach
is too costly.

Another method is \emph{k-nearest-neighbors}, which requires ground truth data.
Here, a new sample is classified by calculating the distance to all neighbors in
a given diameter. The new datum is added to the cluster which contains the
closest samples.

\emph{K-means} requires information about the centroids of the individual
clusters. Distance measurements to the centroids determine to which cluster the
new sample belongs to.

\emph{Self-organizing maps} are similar to k-means, but with two changes. First,
all data outside of the area of interest is ignored. Second, after a winning
cluster is found, it is moved closer to the query object. The process is
repeated for all other clusters. This second variation on k-means constitutes
the first application of the concept of \emph{learning}.

\emph{Decision trees} divide the feature space into arbitrarily-sized regions.
Multiple regions define a particular class. This method is in practice highly
prone to overfitting, which is why they are combined to form a random forest
classifier.

\emph{Random forest classifiers} construct many decision trees and pick the
best-performing ones. Such classifiers are also called \emph{ensemble methods}.

\emph{Deep networks} started off as simple \emph{perceptrons} which were
ineffective at solving the XOR-Problem. The conclusion was that there had to be
additional hidden layers and back propagation to adjust the weights of the
layers. It turned out that hidden layers are ineffective too, because back
propagation would disproportionately affect later layers (\emph{vanishing
gradients}). With \emph{convolutional neural networks} (CNNs) all that changed,
because they combine automatic feature engineering with simple classification,
processing on the GPU and effective training.

The \emph{radial basis function} is a simpler classifier which consists of one
input layer, one hidden layer and one output layer. In the first layer we
compare the input values to codebook vectors and employ a generalization of
negative convolution. In the second layer the outputs from the first layer are
multiplied by the weights from the hidden layer. This results in the
aforementioned dual process model where negative convolution and positive
convolution are employed to form the output.

\section{Evaluation 200 words}

\section{Perception and Psychophysics 600 words}

\section{Spectral Features 600 words}

\section{Semantic Modeling 200 words}

\section{Learning over Time 600 words}

\section*{References}

Please number citations consecutively within brackets \cite{b1}. The
sentence punctuation follows the bracket \cite{b2}. Refer simply to the reference
number, as in \cite{b3}---do not use ``Ref. \cite{b3}'' or ``reference \cite{b3}'' except at
the beginning of a sentence: ``Reference \cite{b3} was the first $\ldots$''

Number footnotes separately in superscripts. Place the actual footnote at
the bottom of the column in which it was cited. Do not put footnotes in the
abstract or reference list. Use letters for table footnotes.

Unless there are six authors or more give all authors' names; do not use
``et al.''. Papers that have not been published, even if they have been
submitted for publication, should be cited as ``unpublished'' \cite{b4}. Papers
that have been accepted for publication should be cited as ``in press'' \cite{b5}.
Capitalize only the first word in a paper title, except for proper nouns and
element symbols.

For papers published in translation journals, please give the English
citation first, followed by the original foreign-language citation \cite{b6}.

\begin{thebibliography}{00}
\bibitem{b1} G. Eason, B. Noble, and I. N. Sneddon, ``On certain integrals of Lipschitz-Hankel type involving products of Bessel functions,'' Phil. Trans. Roy. Soc. London, vol. A247, pp. 529--551, April 1955.
\bibitem{b2} J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68--73.
\bibitem{b3} I. S. Jacobs and C. P. Bean, ``Fine particles, thin films and exchange anisotropy,'' in Magnetism, vol. III, G. T. Rado and H. Suhl, Eds. New York: Academic, 1963, pp. 271--350.
\bibitem{b4} K. Elissa, ``Title of paper if known,'' unpublished.
\bibitem{b5} R. Nicole, ``Title of paper with only first word capitalized,'' J. Name Stand. Abbrev., in press.
\bibitem{b6} Y. Yorozu, M. Hirano, K. Oka, and Y. Tagawa, ``Electron spectroscopy studies on magneto-optical media and plastic substrate interface,'' IEEE Transl. J. Magn. Japan, vol. 2, pp. 740--741, August 1987 [Digests 9th Annual Conf. Magnetics Japan, p. 301, 1982].
\bibitem{b7} M. Young, The Technical Writer's Handbook. Mill Valley, CA: University Science, 1989.
\end{thebibliography}

\end{document}