bachelorarbeit/introduction.tex

\chapter{Introduction}
\label{chap:introduction}

The Internet has seen an unprecedented rise in traffic over the last few years
which is accelerating still. Due to this growth, an increasing amount of user
data is sent over the Internet. This user data is analyzed by companies in big
industries such as social networking, advertising, internet service providers
and news web sites. Although many services online appear to be free for
individual users, the companies behind them have to sustain themselves and make
profits every year. This has led to firms working extensively with user data to
extract meaningful information from the way users use their services. The
collected and inferred information can then be sold to interested parties which
allows those parties to personalize their service, yielding higher customer
engagement and thus higher profits. The end users themselves receive the short
end of the stick by---often unconsciously---giving away their data without
gaining much in turn. Because the means of data collection on the Internet are
becoming increasingly invasive and omnipresent, tools to defend against such
privacy intrusions are developed. It is beneficial to users to know how web
sites are tracking their customers so that they can protect themselves against
these tracking mechanisms. The aim of this thesis is to give an overview of
tracking methods and tools to defend oneself against them. It seeks to answer
the underlying research question of \emph{Which stateful tracking methods are
used to track individuals on the Internet and which countermeasures exist?}

\section{Terms and Scope}
\label{sec:terms and scope}

This thesis will focus on web tracking as employed by for example advertising
companies. When users visit a web site which uses third party content from
advertisers, those advertisers collect bits of information about the user. These
bits of information are not yet associated with a particular user but with an
online identity which is usually tied to a unique identifier. The unique
identifiers are by themselves not meaningful because the same user might get
multiple unique identifiers, each corresponding to other bits of information. To
allow the series of information to be aggregated into one profile which
approximates a user's personality, needs and wants, tracking mechanisms are
used. In many cases the goal is to persist tracking identifiers on the user's
computer for as long as possible and to not assign multiple identifiers to the
same person.

The tracking mechanisms presented in this work are mechanisms which store
information on the user's computer. They are---in other words---\emph{stateful}
mechanisms. Such mechanisms include \gls{HTTP} cookies or various forms of
caches. Contrary to stateful mechanisms, \emph{stateless} mechanisms do not
store information on the user's computer but attempt to infer information by
reading the browser state. This can mean knowing which fonts are installed and
inferring that a particular user is using a Windows operating system instead of
Linux or that they are visiting with a mobile browser and not from a desktop.
This type of tracking is also called \emph{device fingerprinting}. With enough
fingerprints, trackers can uniquely identify a user or device by knowing that no
other entity uses the Internet with the same unique fingerprint.  Stateless
tracking mechanisms are not discussed in this work, instead the focus will be on
stateful tracking mechanisms.

\section{Methodology}
\label{sec:methodology}

This work gives an overview of tracking methods and defenses which have been
studied in the literature. As such, a comprehensive literature review of
relevant research is performed, with a focus on recent developments. Papers will
be collected through the usage of digital libraries and search engines such as
the \emph{ACM Digital Library}, the \emph{IEEE Xplore Library}, \emph{Google
Scholar} and for selected works to appear in peer-reviewed journals
\emph{arXiv.org}. Additionally, well-known journals and proceedings like
\emph{Computers \& Security} and \emph{Proceedings on Privacy Enhancing
Technologies} are manually searched for relevant papers. The used search terms
include but are not limited to keywords such as \emph{Stateful Web Tracking},
\emph{Web Tracking}, \emph{Tracking Measurement} and variants thereof.
Furthermore, queries for the names of particular tracking methods are made. For
information on \emph{Cookie Synchronization} (section~\ref{subsec:cookie
synchronization}) for instance, separate search queries will be performed.

\section{Structure of the Thesis}
\label{sec:structure of the thesis}

The thesis is divided into two major parts: chapter~\ref{chap:tracking methods}
is concerned with how web sites on the Internet track individuals and
chapter~\ref{chap:defenses against tracking} offers users ways to defend
themselves against those tracking methods. Chapter~\ref{chap:tracking methods}
is split into three parts, each focussing on a subset of tracking methods that
can be grouped together. The chapter on defenses against tracking first presents
ways in which users can use existing browser features to limit tracking. The
second part discusses specialized tools which focus on one aspect of tracking
and summarizes research concerned with the effectiveness of these tools. The
thesis is concluded in chapter~\ref{chap:conclusion}.