diff --git a/thesis/thesis.tex b/thesis/thesis.tex index 2885446..fed3565 100644 --- a/thesis/thesis.tex +++ b/thesis/thesis.tex @@ -701,9 +701,8 @@ process because the outputs do not provide valuable information. In contrast to the Heaviside step function (section~\ref{sssec:theory-heaviside}), it is differentiable which allows it to be used with gradient descent optimization -algorithms. \todo[noline]{link to gradient descent and vanishing -gradient sections} Unfortunately, the sigmoid function suffers from -the vanishing gradient problem, which makes it unsuitable for training +algorithms. Unfortunately, the sigmoid function exacerbates the +vanishing gradient problem, which makes it unsuitable for training deep neural networks. \subsubsection{Rectified Linear Unit} @@ -727,13 +726,12 @@ feature extractor. The \gls{relu} function is nearly linear, and it thus preserves many of the properties that make linear models easy to optimize with gradient-based methods \cite{goodfellow2016}. In contrast to the sigmoid activation function, the \gls{relu} function -overcomes the vanishing gradient problem \todo{link to vanishing -gradient problem} and is therefore suitable for training deep neural -networks. Furthermore, the \gls{relu} function is easier to calculate -than sigmoid functions which allows networks to be trained more -quickly. Even though it is not differentiable at $0$, it is -differentiable everywhere else and often used with gradient descent -during optimization. +partially mitigates the vanishing gradient problem and is therefore +suitable for training deep neural networks. Furthermore, the +\gls{relu} function is easier to calculate than sigmoid functions +which allows networks to be trained more quickly. Even though it is +not differentiable at $0$, it is differentiable everywhere else and +often used with gradient descent during optimization. The \gls{relu} function suffers from the dying \gls{relu} problem, which can cause some neurons to become inactive. Large gradients,