\documentclass[conference]{IEEEtran} \usepackage{cite} \usepackage{amsmath,amssymb,amsfonts} \usepackage{algorithmic} \usepackage{graphicx} \usepackage{textcomp} \usepackage{xcolor} \def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}} \begin{document} \title{Similarity Modeling 1/2 Abstracts} \author{\IEEEauthorblockN{Tobias Eidelpes} \IEEEauthorblockA{\textit{TU Wien}\\ Vienna, Austria \\ e1527193@student.tuwien.ac.at} } \maketitle % \begin{abstract} % \end{abstract} % \begin{IEEEkeywords} % component, formatting, style, styling, insert % \end{IEEEkeywords} \section{Setting} To understand the term \emph{Similarity Modeling} and what it encompasses, it is first important to know how we as humans perceive and understand the things we pick up. An illustrative example for this process is the process of seeing (\emph{detecting}) a face, \emph{recognizing} it and deriving the emotion attached to it. These three steps are placed on a figurative \emph{semantic ladder}, where detecting a face sits on the bottom and recognizing emotion on the top. Face detection thus carries a relatively low semantic meaning, whereas recognizing emotion is a much more sophisticated process. All three steps are only possible to be carried out by humans, because they have internal models for the faces they see, whether they have seen them before and if there is a known emotion attached to how the face looks. These models are acquired from a young age through the process of learning. Visual stimuli and models alone are not enough to be able to conclude that a certain face appears similar or not. The process connecting stimuli and models is comparing the two, also called \emph{looking for similarities}. Together, modeling and looking for similarities, they can be summarized under the term \emph{Similarity Modeling}. The goal of Similarity Modeling is usually to find a \emph{class} for the object of interest. The flow of information thus starts with the stimulus, continues on to the modeling part, where we derive a model of the stimulus and—after finding similarities to existing knowledge—ends in a class or label. As mentioned previously, the existing knowledge is fed during the modeling process which describes the feedback loop we call learning. The difficult part lies in properly modeling the input stimulus. It is impossible to store every stimulus verbatim into our existing knowledge base, because it would be too much data if every variety of a stimulus would have to be saved. Therefore, classification systems need the modeling step to \emph{break down} the stimulus into small components which generalize well. The similarity part is generally the same for various domains. Once a proper model of a stimulus exists, checking for similarities in the preexisting knowledge base follows the same patterns, regardless of the type of stimulus. Common problems that arise when engineers try to model and classify stimuli come from the fact that there is a wide variety of input signals. This variety is represented by signals which can be local and have large and sudden increases or drops. Others are smooth and the defining characteristic is the absence of sudden variations. Still different signals can have recurring patterns (e.g. EEG) or none at all (e.g. stocks). After detection the most crucial problem remains, which is understanding semantics (also known as the \emph{semantic gap}). The next problem is getting away from the individual samples to be able to construct a model. This is known as the \emph{gravity of the sample}. Another problem is commonly referred to as the \emph{curse of dimensionality}, where we end up with a huge parameter space and have to optimize those parameters to find good models. The last problem is bad data. This can be missing data, misleading data or noisy data. \section{Similarity Measurement} The artificial process of measuring similarity in computers is shaped by the same rules and fundamentals which are governing similarity measurements in humans. Understanding how similarity measurements work in humans is thus invaluable for any kind of measurement done using computers. A concept which appears in both domains is the \emph{feature space}. An example for a feature space is one where we have two characteristics of humans, gender and age, which we want to explore with regards to their relation to each other. Gender exists on a continuum which goes from male to female. Age, on the other hand, goes from young to old. Because we are only concerned with two characteristics, we have a \mbox{two-dimensional} feature space. Theoretically, a feature space can be $n$-dimensional, where increasing values for $n$ result in increasing complexity. In our brains processing of inputs happens in neurons which receive weighted signals from synapses. The neuron contains a summarization operation and a comparison to a threshold. If the threshold is exceeded, the neuron fires and sends the information to an axon. The weights constitute the dimensions of the feature space. In computers we can populate the feature space with samples and then do either a distance (negative convolution) or a cosine similarity measurement (positive convolution). Since the cosine similarity measurement uses the product of two vectors, it is at its maximum when the two factors are the same. It is much more discriminatory than the distance measurement. Distance measurements are also called \emph{thematic} or \emph{integral}, whereas cosine similarity measurements are called \emph{taxonomic} or \emph{separable}. Due to the latter exhibiting highly taxonomic traits, questions of high semantics such as ``is this person old?'', which require a \emph{true} (1) or \emph{false} (0) answer, fit the discriminatory properties of cosine similarity. The relationship between distance and similarity measurements is described by the \emph{Generalization} function. Whenever the distance is zero, the similarity measurement is one. Conversely, similarity is at its lowest when the distance is at its highest. The relationship in-between the extremes is nonlinear and described by the function $g(d)=s=e^{-d}$, which means that only small increases in distance disproportionately affect similarity. Generalization allows us to convert distance measurements to similarity measurements and vice-versa. \begin{equation} \label{eq:dpm} \mathrm{dpm} = \alpha\cdot\vec{s} + (1-\alpha)\cdot g(\vec{d})\quad\mathrm{with}\quad\alpha\in[0,1] \end{equation} Both, cosine similarity and distance measurements, can be combined to form \emph{Dual Process Models of Similarity} (DPMs). One such example is given in \eqref{eq:dpm} where both measurements are weighted and the distance measurement is expressed as a similarity measure using the generalization function. DPMs model humans' perception particularly well, but are not widely used in the computer science domain. \section{Feature Engineering} Contrary to popular opinion, the rise of deep learning methods in areas such as object recognition has not superseded the classical approach of feature engineering in other areas. Particularly in the audio domain and for motion detection in videos for example, feature engineering is still the dominant method. This is highlighted by the fact that classical methods require much less processing which can be beneficial or even crucial for certain applications (e.g. edge computing). Feature engineering is part of the pipeline which transforms input data into classes and labels for that data. After modeling comes feature extraction so that these features can be mapped in the feature space. After the classification step, we end up with labels corresponding to the input data and the features we want. In practice, feature engineering deals with analyzing input signals. Common features one might be interested in during analysis is the loudness (amplitude), rhythm or motion of a signal. There are four main features of interest when analyzing visual data: color, texture, shape and foreground versus background. Starting with color, the first thing that springs to mind is to use the RGB color space to detect specific colors. Depending on the application, this might not be the best choice due to the three colors being represented by their \emph{pure} versions and different hues of a color requiring a change of all three parameters (red, green and blue). Other color spaces such as hue, saturation and value (HSV) are better suited for color recognition, since we are usually only interested in the hue of a color and can therefore better generalize the detection space. Another option is posed by the \emph{CIE XYZ} color space which is applicable to situations where adherence to how the human vision works is beneficial. For broadcasting applications, color is often encoded using \emph{YCrCb}, where \emph{Y} represents lightness and \emph{Cr} and \emph{Cb} represent $Y-R$ and $Y-B$ respectively. To find a dominant color within an image, we can choose to only look at certain sections of the frame, e.g. the center or the largest continuous region of color. Another approach is to use a color histogram to count the number of different hues within the frame. Recognizing objects by their texture can be divided into three different methods. One approach is to look at the direction pixels are oriented towards to get a measure of \emph{directionality}. Secondly, \emph{rhythm} allows us to detect if a patch of information (micro block) is repeated in its neighborhood through \emph{autocorrelation}. Autocorrelation takes one neighborhood and compares it—usually using a generalized distance measure—to all other neighborhoods. If the similarity exceeds a certain threshold, there is a high probability that a rhythm exists. Third, coarseness can be detected by applying a similar process, but by looking at different window sizes to determine if there is any loss of information. If there is no loss of information in the compressed (smaller) window, the image information is coarse. Shape detection can be realized using \emph{kernels} of different sizes and with different values. An edge detection algorithm might use a sobel matrix to compare neighborhoods of an image. If the similarity is high, there is a high probability of there being an edge in that neighborhood. Foreground and background detection relies on the assumption that the coarseness is on average higher for the background than for the foreground. This only makes sense if videos have been properly recorded using depth of field so that the background is much more blurred out than the foreground. For audio feature extraction, three properties are of relevance: loudness, fundamental frequency and rhythm. Specific audio sources have a distinct loudness to them where for example classical music has a higher standard deviation of loudness than metal. The fundamental frequency can be particularly helpful in distinguishing speech from music by analyzing the \emph{zero crossings rate} (ZCR). Speech has a lower ZCR than music, because there is a limit on how fast humans can speak. Audio signals can often times be made up of distinct patterns which are described by the attack, sustain, decay and release model. This model is effective in rhythm detection. Motion in videos is easily detected using crosscorrelation between previous or subsequent frames. Similarly to crosscorrelation in other domains, a similarity measure is calculated from two frames and if the result exceeds a threshold, there is movement. The similarity measurements can be aggregated to provide a robust detection of camera movement. \section{Classification} The setting for classification is described by taking a feature space and clustering the samples within that feature space. The smaller and well-defined the clusters are, the better the classification works. At the same time we want to have a high covariance between clusters so that different classes are easily distinguishable. Classification is another filtering method which reduces the input data—sometimes on the order of millions of dimensions—into simple predicates, e.g. \emph{yes} or \emph{no} instances. The goal of classification is therefore that semantic enrichment comes along with the filtering process. The two fundamental methods used in classification are \emph{separation} and \emph{hedging}. Separation tries to draw a line between different classes in the feature space. Hedging, on the other hand, uses perimeters to cluster samples. Additionally, the centroid of each cluster is calculated and the covariance between two centroids acts as a measure of separation. Both methods can be linked to \emph{concept theories} such as the \emph{classical} and \emph{prototype} theory. While concept theory classifies different things based on their necessary and sufficient conditions, prototype theory uses typical examples to come to a conclusion about a particular thing. The first can be mapped to the fundamental method of separation in machine learning, whereas the latter is mapped to the method of hedging. In the big picture, hedging is remarkably similar to negative convolution, as discussed earlier. Separation, on the other hand, has parallels with positive convolution. If we take separation as an example, there are multiple ways how we can split classes using a simple line. One could draw a straight line between two classes without caring about individual samples, which are then misclassified. This often results in so-called \emph{underfitting}, because the classifier would not work well on a dataset which it has not seen before. Conversely, if the line includes too many individual samples and is a function of high degree, the classifier is likely \emph{overfitting}. Both, underfitting and overfitting, are common pitfalls to avoid as the best classifier lies somewhere in-between the two. To be able to properly train, test and validate a classifier, the test data are split into these three different categories. \emph{Unsupervised classification} or \emph{clustering} employs either a bottom-up or top-down approach. Regardless of the chosen method, unsupervised classification works with unlabeled data. The goal is to construct a \emph{dendrogram} which consists of distance measures between the samples and their centroids with different samples. In the bottom-up approach an individual sample marks a leaf of the tree-like dendrogram and is connected through a negative convolution measurement to neighboring samples. In the top-down approach the dendrogram is not built from the leaves, but by starting from the centroid of the entire feature space. Distance measurements to samples within the field recursively construct the dendrogram until all samples are included. One method of \emph{supervised classification} is the \emph{vector space model}. It is well-suited for finding items which are similar to a given item (= the query or hedge). Usually, a simple distance measurement such as the euclidian distance provides results which are good enough, especially for online shops where there are millions of products on offer and a more sophisticated approach is too costly. Another method is \emph{k-nearest-neighbors}, which requires ground truth data. Here, a new sample is classified by calculating the distance to all neighbors in a given diameter. The new datum is added to the cluster which contains the closest samples. \emph{K-means} requires information about the centroids of the individual clusters. Distance measurements to the centroids determine to which cluster the new sample belongs to. \emph{Self-organizing maps} are similar to k-means, but with two changes. First, all data outside of the area of interest is ignored. Second, after a winning cluster is found, it is moved closer to the query object. The process is repeated for all other clusters. This second variation on k-means constitutes the first application of the concept of \emph{learning}. \emph{Decision trees} divide the feature space into arbitrarily-sized regions. Multiple regions define a particular class. This method is in practice highly prone to overfitting, which is why they are combined to form a random forest classifier. \emph{Random forest classifiers} construct many decision trees and pick the best-performing ones. Such classifiers are also called \emph{ensemble methods}. \emph{Deep networks} started off as simple \emph{perceptrons} which were ineffective at solving the XOR-Problem. The conclusion was that there had to be additional hidden layers and back propagation to adjust the weights of the layers. It turned out that hidden layers are ineffective too, because back propagation would disproportionately affect later layers (\emph{vanishing gradients}). With \emph{convolutional neural networks} (CNNs) all that changed, because they combine automatic feature engineering with simple classification, processing on the GPU and effective training. The \emph{radial basis function} is a simpler classifier which consists of one input layer, one hidden layer and one output layer. In the first layer we compare the input values to codebook vectors and employ a generalization of negative convolution. In the second layer the outputs from the first layer are multiplied by the weights from the hidden layer. This results in the aforementioned dual process model where negative convolution and positive convolution are employed to form the output. \section{Evaluation} An important, if not the most important, part of similarity modeling is evaluating the performance of classifiers. A straightforward way to do so is analyzing the \emph{confusion matrix}. A confusion matrix contains the output of the classifier on one axis and the ground truth on the other axis. If the classifier says something is relevant and the ground truth says that as well, we have a true positive. The same applies to negatives where both agree and these are called true negatives. However, if the ground truth says something is irrelevant, but the classifier says it is relevant, we have a false positive. Conversely, false negatives require the classifier to say something is irrelevant when it is in fact actually relevant. From the confusion matrix we can derive \emph{recall} or \emph{true positive rate}. It is calculated by dividing the true positives by the sum of the true positives and false negatives. If the ratio is close to one, the classifier recognizes almost everything correctly. Recall on its own is not always helpful because there is the possibility that the classifier recognizes everything correctly but has a high \emph{false positive rate}. It is defined by the false positives divided by the sum of the false positives and the true negatives. A low value of the false positive rate combined with a high value of recall is desirable. Third, \emph{precision} is another measure for pollution, similarly to the false positive rate. It is defined as the true positives divided by the sum of the true positives and the false positives. An advantage of precision is that it can be calculated just from the output of the classifier. Precision and recall are inversely correlated, in that a recall of one can always be achieved by classifying everything as relevant, but then the precision is zero and vice-versa. All three measures can be visualized by the \emph{recall-precision-graph} or the \emph{receiver operating characteristics curve} (ROC curve). The latter plots the false positive rate on the x-axis against the true positive rate. \section{Perception and Psychophysics} The human perception happens in the brain where we have approximately $10^{10}$ neurons and even more synapses ($10^{13}$). Neurons are connected to around 3\% of their neighbors and new connections never cease to be built, unlike neurons which are only created up to a young age. The perceptual load on our senses is at around 1Gb/s. To deal with all these data, most of the data is ignored and only later reconstructed in the brain, if they are needed. The olfactory sense requires the most amount of cells, whereas the aural sense requires the least. One reason for the low amount of cells for hearing is that the pre-processing in our ears is very sophisticated and thus less processing is needed during the later stages. Vision also requires a lot of cells (on the order of $10^6$). Vision is handled in part by rods (for brightness) and cones (for color). The relationship of rods to cones is about 20 to 1, although the ratio varies a lot from one human to the next. Psychophysics is the study of physical stimuli ($=\Phi$) and the sensations and perceptions they produce ($=\Psi$). The relationship between the two is not a linear, but a logarithmic one and it is described by the Weber-Fechner law \eqref{eq:wf-law}. \begin{equation} \label{eq:wf-law} \Psi = c\cdot\log(\Phi) + a \end{equation} The Weber law \eqref{eq:w-law} states that, in order to get a similar response, stimuli have to be increasing in intensity over time. \begin{equation} \label{eq:w-law} \Delta\Phi = f(\Phi) \end{equation} In later years, Stanley Smith Stevens empirically developed the Stevens' power law \eqref{eq:s-power-law} whereby our perception is dependent on a factor $c$ multiplied with the stimulus which is raised to the power of \emph{Stevens' exponent} and added to a constant $b$. \begin{equation} \label{eq:s-power-law} \Psi = c\cdot\Phi^{a} + b \end{equation} The eye detects incoming visual stimuli with the aforementioned rods and cones. Cones are further split into three different types to detect color. Blue cones fire upon receiving light in the 420nm range, whereas green cones react to 534nm and red cones to 564nm. These wavelengths are only indicative of where the visual system reacts the strongest. Furthermore, these numbers are averages of a large population of humans, but can be different for individuals. Green and red are perceptually very close and it is postulated by scientists that the perception of red only recently separated from the perception of green in our evolution and is therefore still very close. Visual information that enters the retina is first processed by the ganglion cells which do edge detection. They receive their information from \emph{bipolar cells} which either pass along the signal or block it. The ganglion cells process multiple such signals in a neighborhood and detect length and angle of edges. After edge detection the signal is forwarded to the visual cortex which does object detection via the \emph{ventral pathway} and motion detection via the \emph{dorsal pathway}. Before the signal is forwarded to either of the pathways, the occipital cortex processes edge information, color blobs, texture and contours. The three-color stimulus is converted into a hue, saturation and value encoding. After that motion and 3D information is processed. The flow of information is one of semantic enrichment, starting from edge detection and ending in motion detection. In the ventral pathway an object is detected invariant to its type, size, position or occlusion. The dorsal pathway for motion detection has to deal with multiple degrees of freedom due to the eye moving on its own and the object moving as well. The ear consists of the ear canal, which does some filtering, the eardrum for amplification, the ossicles and the cochlea. The cochlea is the most important part because it translates air waves first into liquid waves and then to electrical signals which are transferred to the brain. It contains a \emph{staircase} on which there are hairs of different lengths. Depending on their length they react to either high or low frequencies. Their movement within the liquid is then transformed into electrical signals through the tip links. The thresholds of hearing exist on the lower end due to physical limits when the hairs inside the ear receive a too small stimulus and therefore do not move noticeably. The lower threshold is dependent on the received frequency and is lowest at around 4kHz, where we hear best. High energies are needed to hear very low frequencies. The threshold on the higher end marks the point at which sounds become painful and it seeks to protect us from damaging our hearing. \section{Spectral Features} Because analysis of audio in the time domain is very hard to do, especially for identifying overtones, we employ transforms to the original data to extract more information. One such transformation was used by Pierre-Simon Laplace in 1785 to transform problems requiring difficult operations into other problems which are solvable with simpler operations. The results of the easier calculation would then be transformed back again into the original problem. This first type of transformation uses a function and applies a kernel to it with positive convolution to result in a \emph{spectrum}. Applying the kernel to the spectrum gives a result to the original function again (=back transformation). The kernel which proved suitable for this operation is given in \eqref{eq:laplace-kernel}. \begin{equation} \label{eq:laplace-kernel} K_{xy} = e^{-xy} \end{equation} In 1823, instead of just having $-xy$ in the exponent, Fourier proposed to add an imaginary part $i$. The function can be rewritten as $cos(xy) - i\cdot sin(xy)$, which makes it possible to interpret the original function much more easily using simple angular functions. The fourier transformation is a similarity measurement of taking a set of coefficients and measuring the similarity to a set of angular functions which are overlaid on each other. The imaginary part of the fourier transform can be dealt with by either throwing the $sin$ part of the function away or by computing the magnitude by taking the root of the squared real part plus the squared imaginary part. One property of the fourier transform is that high frequencies get increasingly less well-sampled the more information is thrown away during the process. Steep changes in frequency are spread out over more samples and if only a small fraction of coefficients is used, the transformation results in a basic sine wave. Another property for image information is that the most important parts of the image are located at the ends of the spectrum. There's hardly any information in the mid-range of the spectrum, which is why it can be concluded that the bulk of the information in images lies in the edges. Since the middle part of any spectrum is usually smoothed by the extremes at the edges, smoothing functions are used to more accurately represent the data we are interested in. They work by doing an element-wise operation on the spectrum with a window function. Important ones are the triangular function (Bartlett), gaussian function (Hamming), a sine function (Kaiser) and a simple rectangular function. This step is known as \emph{windowing}. A third transformation is the \emph{discreet cosine transform}. This applies a kernel of the form $K = cos(xy+\frac{y}{2})$. Due to the uniform nature of the fourier transform, a lot of image information is quickly lost when coefficients are thrown away. The cosine transformation, however, manages to retain much more relevant information even until almost 90\% of coefficients have been thrown away. In contrast to the fourier transform, which is uniform, it discriminates. Other wavelets of interest are the \emph{mexican hat} and the \emph{Gabor wavelet}. In the area of optics, \emph{zernike polynomials} are used to compare a measurement of a lens to \emph{ideal} optics modeled by a zernike function. If a pre-defined threshold for the error is exceeded, the optics require a closer look during the quality assurance process. While integral transforms allow the original signal to be reconstructed from the transformed spectrum, \emph{parametric transforms} cannot provide that property. One such transformation is the \emph{Radon transformation} where an axis is defined and the luminance values along that axis are added. The axis is then rotated and the process repeated until all angles have been traversed. The resulting spectrum is rotation-invariant and the transformation is therefore a useful pre-processing step for feature engineering. The \emph{Hough transformation} uses the gradients of an image to construct a histogram. The information presented in the histogram can be valuable to detect regular patterns or long edges in an image. Applications for the fourier transform are identifying spectral peaks, tune recognition and timbre recognition. All of these use a form of \emph{short-time fourier transform} (STFT). The FT can also be used for optical flow by shifting the image and recomputing the spectrum (\emph{phase correlation}). Both the FT and the CT are successfully used in music recognition and speech recognition with the \emph{mel-frequency cepstrum coefficients} (MFCC). The CT is used in MPEG7 for \emph{color histogram encoding} and for texture computation. \section{Semantic Modeling 200 words} \section{Learning over Time 600 words} \end{document}