Add Learning over Time section
This commit is contained in:
parent
266b60704e
commit
a02c52219f
70
sim.tex
70
sim.tex
@ -510,6 +510,74 @@ computation of the features. Features should be easily interpretable, which is
|
||||
not always obvious when some transformation has been applied. If the features
|
||||
generalize well to different situations regardless of context, they are robust.
|
||||
|
||||
\section{Learning over Time 600 words}
|
||||
\section{Learning over Time}
|
||||
|
||||
Classification can be broadly separated into two distinct areas: static and
|
||||
dynamic. The first can also be described as \emph{micro} and the latter as
|
||||
\emph{macro}, because it iteratively applies the \emph{micro} process. When we
|
||||
are dealing with static classification, we want to avoid overfitting, minimize
|
||||
the loss function and tune the hyperparameters through back propagation. Another
|
||||
important factor to consider is whether we are interested in the outliers or the
|
||||
typical values. Typical values are most often described using the standard
|
||||
moments of statistics such as the mean and standard deviation. A downside to
|
||||
these methods is that outliers in our data can be highly loaded with semantics.
|
||||
\emph{Outlier detection} algorithms provide a good way to deal with these higher
|
||||
semantics.
|
||||
|
||||
Dynamic classification deals mostly with a supervised learning approach where
|
||||
ground truth data and a split are used. Over time dynamic classification
|
||||
algorithms learn features from the data at a \emph{learning rate} $\alpha$ which
|
||||
is between 2\% and 5\% per iteration. The learning rate does not stay constant
|
||||
or linear throughout the process, but mirrors the learning process in humans,
|
||||
which follows a sigmoid shape. The maximum reachable value of the learning
|
||||
function is usually capped at around 95\%. This illustrates the amount of
|
||||
learning that is possible from just looking at the typical values. Everything
|
||||
beyond that requires a good understanding of outliers also, but comes with
|
||||
diminishing returns as time goes on. Additionally, the learning function has to
|
||||
be observed carefully to spot any overfitting after the learning rate is
|
||||
beginning to stagnate.
|
||||
|
||||
An important term when dealing with dynamic classification is \emph{structural
|
||||
risk minimization}. Structural risk minimization models a curve which plots
|
||||
complexity of a classifier against its recognition rate. While the performance
|
||||
of a much more complex classifier increases, the risk associated with it is low
|
||||
up to a certain point and then it starts to increase again due to potential
|
||||
overfitting. It is therefore not advisable to build very complex classifiers as
|
||||
these inherently bear more risk compared to a classifier which minimizes
|
||||
complexity and delivers the best possible outcome.
|
||||
|
||||
\emph{Gaussian mixtures} follow a four step process to learn clusters in the
|
||||
data and this process is called expectation maximization. The overarching idea
|
||||
is to represent clusters by a series of probability distributions and weigh
|
||||
them. Then we can conveniently calculate the maximum likelihood for each sample
|
||||
and arrive at a result which gives the cluster the sample is most likely
|
||||
belonging to. The first step is a random initialisation. In the second step, the
|
||||
model variables are computed. This includes calculating for each sample in the
|
||||
feature space a weighted distribution of this sample versus all the other
|
||||
samples in the mixture. In the third step, also called the maximization step, we
|
||||
compute the weight for each distribution over all the samples in a mixture. In
|
||||
the fourth step, the second and third step is iterated over again.
|
||||
|
||||
\emph{Support vector machines} separate the samples within the data with a
|
||||
hyperplane (when working with multi-dimensional data). The goal is to maximize
|
||||
the margin between the hyperplane and the samples, so that a clear separation is
|
||||
achieved. In practice, there might be many samples which violate the separation
|
||||
introduced by the hyperplane. To deal with these, slack variables influence the
|
||||
learning process by adding a penalty. Another method is to map the samples onto
|
||||
an artificial dimension called the \emph{kernel dimension}. This process is
|
||||
called the \emph{kernel trick}. Support vector machines are nearly ideal with
|
||||
respect to the structural risk minimization curve due to their simplicity.
|
||||
|
||||
\emph{Bayesian networks \& Markov processes} model events in a tree-like
|
||||
structure where one or multiple events lead to another one. This requires
|
||||
knowing the probability of an event happening (a priori) and the probability
|
||||
that another event is happening if the a priori event happened.
|
||||
|
||||
\emph{Recurrent networks} do not process input all at the same time, as is done
|
||||
in CNNs, but gradually introduce the input into the process.
|
||||
|
||||
\emph{Long short-term memory} (LSTM) networks work with multiple cells in each
|
||||
layer and each cell delivers two outputs: one is the memory and the other is the
|
||||
input for the next layer. Memory is concatenated from one cell to the next.
|
||||
|
||||
\end{document}
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user