Add Learning over Time section

This commit is contained in:
Tobias Eidelpes 2021-10-24 13:19:53 +02:00
parent 266b60704e
commit a02c52219f

70
sim.tex
View File

@ -510,6 +510,74 @@ computation of the features. Features should be easily interpretable, which is
not always obvious when some transformation has been applied. If the features
generalize well to different situations regardless of context, they are robust.
\section{Learning over Time 600 words}
\section{Learning over Time}
Classification can be broadly separated into two distinct areas: static and
dynamic. The first can also be described as \emph{micro} and the latter as
\emph{macro}, because it iteratively applies the \emph{micro} process. When we
are dealing with static classification, we want to avoid overfitting, minimize
the loss function and tune the hyperparameters through back propagation. Another
important factor to consider is whether we are interested in the outliers or the
typical values. Typical values are most often described using the standard
moments of statistics such as the mean and standard deviation. A downside to
these methods is that outliers in our data can be highly loaded with semantics.
\emph{Outlier detection} algorithms provide a good way to deal with these higher
semantics.
Dynamic classification deals mostly with a supervised learning approach where
ground truth data and a split are used. Over time dynamic classification
algorithms learn features from the data at a \emph{learning rate} $\alpha$ which
is between 2\% and 5\% per iteration. The learning rate does not stay constant
or linear throughout the process, but mirrors the learning process in humans,
which follows a sigmoid shape. The maximum reachable value of the learning
function is usually capped at around 95\%. This illustrates the amount of
learning that is possible from just looking at the typical values. Everything
beyond that requires a good understanding of outliers also, but comes with
diminishing returns as time goes on. Additionally, the learning function has to
be observed carefully to spot any overfitting after the learning rate is
beginning to stagnate.
An important term when dealing with dynamic classification is \emph{structural
risk minimization}. Structural risk minimization models a curve which plots
complexity of a classifier against its recognition rate. While the performance
of a much more complex classifier increases, the risk associated with it is low
up to a certain point and then it starts to increase again due to potential
overfitting. It is therefore not advisable to build very complex classifiers as
these inherently bear more risk compared to a classifier which minimizes
complexity and delivers the best possible outcome.
\emph{Gaussian mixtures} follow a four step process to learn clusters in the
data and this process is called expectation maximization. The overarching idea
is to represent clusters by a series of probability distributions and weigh
them. Then we can conveniently calculate the maximum likelihood for each sample
and arrive at a result which gives the cluster the sample is most likely
belonging to. The first step is a random initialisation. In the second step, the
model variables are computed. This includes calculating for each sample in the
feature space a weighted distribution of this sample versus all the other
samples in the mixture. In the third step, also called the maximization step, we
compute the weight for each distribution over all the samples in a mixture. In
the fourth step, the second and third step is iterated over again.
\emph{Support vector machines} separate the samples within the data with a
hyperplane (when working with multi-dimensional data). The goal is to maximize
the margin between the hyperplane and the samples, so that a clear separation is
achieved. In practice, there might be many samples which violate the separation
introduced by the hyperplane. To deal with these, slack variables influence the
learning process by adding a penalty. Another method is to map the samples onto
an artificial dimension called the \emph{kernel dimension}. This process is
called the \emph{kernel trick}. Support vector machines are nearly ideal with
respect to the structural risk minimization curve due to their simplicity.
\emph{Bayesian networks \& Markov processes} model events in a tree-like
structure where one or multiple events lead to another one. This requires
knowing the probability of an event happening (a priori) and the probability
that another event is happening if the a priori event happened.
\emph{Recurrent networks} do not process input all at the same time, as is done
in CNNs, but gradually introduce the input into the process.
\emph{Long short-term memory} (LSTM) networks work with multiple cells in each
layer and each cell delivers two outputs: one is the memory and the other is the
input for the next layer. Memory is concatenated from one cell to the next.
\end{document}