From a02c52219f4fa7dac2ca9c4f6c11de1a17418ca8 Mon Sep 17 00:00:00 2001
From: Tobias Eidelpes <tobias@eidelpes.info>
Date: Sun, 24 Oct 2021 13:19:53 +0200
Subject: [PATCH] Add Learning over Time section

---
 sim.tex | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 69 insertions(+), 1 deletion(-)

diff --git a/sim.tex b/sim.tex
index c09750d..e6000dc 100644
--- a/sim.tex
+++ b/sim.tex
@@ -510,6 +510,74 @@ computation of the features. Features should be easily interpretable, which is
 not always obvious when some transformation has been applied. If the features
 generalize well to different situations regardless of context, they are robust.
 
-\section{Learning over Time 600 words}
+\section{Learning over Time}
+
+Classification can be broadly separated into two distinct areas: static and
+dynamic. The first can also be described as \emph{micro} and the latter as
+\emph{macro}, because it iteratively applies the \emph{micro} process. When we
+are dealing with static classification, we want to avoid overfitting, minimize
+the loss function and tune the hyperparameters through back propagation. Another
+important factor to consider is whether we are interested in the outliers or the
+typical values. Typical values are most often described using the standard
+moments of statistics such as the mean and standard deviation. A downside to
+these methods is that outliers in our data can be highly loaded with semantics.
+\emph{Outlier detection} algorithms provide a good way to deal with these higher
+semantics.
+
+Dynamic classification deals mostly with a supervised learning approach where
+ground truth data and a split are used. Over time dynamic classification
+algorithms learn features from the data at a \emph{learning rate} $\alpha$ which
+is between 2\% and 5\% per iteration. The learning rate does not stay constant
+or linear throughout the process, but mirrors the learning process in humans,
+which follows a sigmoid shape. The maximum reachable value of the learning
+function is usually capped at around 95\%. This illustrates the amount of
+learning that is possible from just looking at the typical values. Everything
+beyond that requires a good understanding of outliers also, but comes with
+diminishing returns as time goes on. Additionally, the learning function has to
+be observed carefully to spot any overfitting after the learning rate is
+beginning to stagnate.
+
+An important term when dealing with dynamic classification is \emph{structural
+risk minimization}. Structural risk minimization models a curve which plots
+complexity of a classifier against its recognition rate. While the performance
+of a much more complex classifier increases, the risk associated with it is low
+up to a certain point and then it starts to increase again due to potential
+overfitting. It is therefore not advisable to build very complex classifiers as
+these inherently bear more risk compared to a classifier which minimizes
+complexity and delivers the best possible outcome.
+
+\emph{Gaussian mixtures} follow a four step process to learn clusters in the
+data and this process is called expectation maximization. The overarching idea
+is to represent clusters by a series of probability distributions and weigh
+them. Then we can conveniently calculate the maximum likelihood for each sample
+and arrive at a result which gives the cluster the sample is most likely
+belonging to. The first step is a random initialisation. In the second step, the
+model variables are computed. This includes calculating for each sample in the
+feature space a weighted distribution of this sample versus all the other
+samples in the mixture. In the third step, also called the maximization step, we
+compute the weight for each distribution over all the samples in a mixture. In
+the fourth step, the second and third step is iterated over again.
+
+\emph{Support vector machines} separate the samples within the data with a
+hyperplane (when working with multi-dimensional data). The goal is to maximize
+the margin between the hyperplane and the samples, so that a clear separation is
+achieved. In practice, there might be many samples which violate the separation
+introduced by the hyperplane. To deal with these, slack variables influence the
+learning process by adding a penalty. Another method is to map the samples onto
+an artificial dimension called the \emph{kernel dimension}. This process is
+called the \emph{kernel trick}. Support vector machines are nearly ideal with
+respect to the structural risk minimization curve due to their simplicity.
+
+\emph{Bayesian networks \& Markov processes} model events in a tree-like
+structure where one or multiple events lead to another one. This requires
+knowing the probability of an event happening (a priori) and the probability
+that another event is happening if the a priori event happened.
+
+\emph{Recurrent networks} do not process input all at the same time, as is done
+in CNNs, but gradually introduce the input into the process.
+
+\emph{Long short-term memory} (LSTM) networks work with multiple cells in each
+layer and each cell delivers two outputs: one is the memory and the other is the
+input for the next layer. Memory is concatenated from one cell to the next.
 
 \end{document}