diff --git a/thesis/thesis.tex b/thesis/thesis.tex
index a5ec7f6..7f0b407 100644
--- a/thesis/thesis.tex
+++ b/thesis/thesis.tex
@@ -1613,12 +1613,69 @@ top-1 accuracy of 75.2\% and 67.4\% respectively.
 \section{Transfer Learning}
 \label{sec:background-transfer-learning}
 
-Give a definition of transfer learning and explain how it is
-done. Compare fine-tuning just the last layers vs. propagating changes
-through the whole network. What are advantages to transfer learning?
-Are there any disadvantages?
+Transfer learning refers to the application of a learning algorithm to
+a target domain by utilizing knowledge already learned from a
+different source domain \cite{zhuang2021}. The learned representations
+from the source domain are thus \emph{transferred} to solve a related
+problem in another domain. Transfer learning works because
+semantically meaningful information an algorithm has learned from a
+(large) data set is often meaningful in other contexts as well, even
+though the \emph{new problem} is not exactly the same problem for
+which the original model had been trained for. An analogy to
+day-to-day life as humans can be drawn with sports. Intuitively,
+skills learned during soccer such as ball control, improved endurance
+and strategic thinking are often also useful in other ball
+sports. Someone who is adept at certain kinds of sports will likely be
+able to pick up similar types much faster.
 
-Estimated 2 pages for this section.
+In mathematical terms, \textcite{pan2010} define transfer learning as:
+
+\begin{quote}{\cite[p.1347]{pan2010}}
+  Given a source domain $\mathcal{D}_{S}$ and learning task
+  $\mathcal{T}_{S}$, a target domain $\mathcal{D}_{T}$ and learning task
+  $\mathcal{T}_{T}$, transfer learning aims to help improve the learning of the
+  target predictive function $f_{T}(\cdot)$ in $\mathcal{D}_{T}$ using the knowledge
+  in $\mathcal{D}_{S}$ and $\mathcal{T}_{S}$, where $\mathcal{D}_{S}\neq\mathcal{D}_{T}$, or $\mathcal{T}_{S}\neq\mathcal{T}_{T}$.
+\end{quote}
+
+In the machine learning world, collecting and labeling data for
+training a model is often time consuming, expensive and sometimes not
+possible. Deep learning based models especially require substantial
+amounts of data to be able to robustly classify images or solve other
+tasks. Semi-supervised or unsupervised (see
+section~\ref{sec:theory-ml}) learning approaches can partially
+mitigate this problem, but having accurate ground truth data is
+usually a requirement nonetheless. Through the publication of large
+labeled data sets such as via the \glspl{ilsvrc}, a basis for
+(pre-)training exists from which the model can be optimized for
+downstream tasks.
+
+Transfer learning is not a panacea, however. Care has to be taken to
+only use models which have been pretrained in a source domain which is
+similar to the target domain in terms of feature space. While this may
+seem to be an easy task, it is often not known in advance if transfer
+learning is the correct approach. Furthermore, choosing whether to
+only remove the fully-connected layers at the end of a pretrained
+model or to fine-tune all parameters introduces at least one
+additional hyperparameter. These decisions have to be made by
+comparing the source domain with the target domain, how much data in
+the target domain is available, how much computational resources are
+available and observing which layers are responsible for which
+features. Since earlier layers usually contain low-level and later
+layers high-level information, resetting the weights of the last few
+layers or replacing them with different ones entirely is also an
+option.
+
+To summarize, while transfer learning is an effective tool and is
+likely a major factor in the proliferation of deep learning based
+models, not all domains are suited for it. The additional decisions
+which have to be made as a result of using transfer learning can
+introduce more complexity than would otherwise be necessary for a
+particular problem. It does, however, allow researchers to get started
+quickly and to iterate faster because popular network architectures
+pretrained on Imagenet are integrated into the major machine learning
+frameworks. Transfer learning is used extensively in this work to
+train a classifier as well as an object detection model.
 
 \section{Hyperparameter Optimization}
 \label{sec:background-hypopt}