bachelorarbeit/defenses.tex

\chapter{Defenses against Tracking}%
\label{chap:defenses against tracking}

The proliferation of tracking across the web has led to the development of a
myriad of tools that each have their own advantages and disadvantages. Some
tracking methods can be easily mitigated by changing browser settings or by
disabling certain technologies. More often than not, these methods not only stop
or limit tracking but also severely hamper the Internet experience for end
users. Especially some of the more advanced tools require user input to know
which items to block and which to let through. This in turn requires expertise
that few regular Internet users possess, further complicating defending against
tracking. This chapter introduces methods and tools that have been proven to be
effective against tracking on the web. It is split into two parts, with the
first surveying techniques that can be applied to limit tracking and the second
presenting tools to manage tracking on the web. The focus lies on defending
against the methods discussed in chapter~\ref{chap:tracking methods}.

\section{Techniques}
\label{sec:techniques}

The aim of this section is to present comparatively simple techniques that a
user can employ to limit tracking. The benefit of these methods is that they are
built into modern browsers and therefore do not require specific user knowledge
of installing any additional tools. Although their implementations vary from
one browser to another, the basic idea of the underlying functionality remains
the same.

\subsection{Opt-out and Opt-in}
\label{subsec:opt-out}

To opt-out in the context of web tracking means to make use of the possibility
of turning off data collection by a web site. After the user has opted-out of
either all data collection or only a subset of all the data that a web site
collects, an opt-out cookie is set, indicating the user's preference. Whereas
opting-out generally means that data collection happens by default, opt-in
requires that data collection is turned off by default. In theory it allows
users to have fine-grained control over which aspects of their online presence
they are comfortable with sharing by either opting-out or opting-in (depending
on how web sites ask for consent). In practice however, the seemingly irrelevant
difference between those two lead to very different outcomes with respect to the
amount of users that are tracked.

For either opt-out or opt-in to work, a web site has to provide an option for
doing so. Because web sites increasingly use third parties to manage data
collection on their site, consent or rejection has to be passed to these third
parties and they have to be willing to accept such a decision. Since the
European's \gls{GDPR} \cite{europeanparliamentGeneralDataProtection2016} came
into force in 2018, service providers operating in the European Union are
required to ask users for explicit consent before collecting any data, except
when that data is absolutely necessary to ensure basic functionality. It is not
allowed to notify the user that by continuing to visit the web site, consent to
data collection is given. Furthermore, if consent is not given, the web site
provider is not allowed to block the user from visiting the web site. Even
before the \gls{GDPR}, the EU required web sites to ask for informed consent via
the ePrivacy Directive which came into force in 2013.
\citet{trevisanYearsEUCookie2019} use their tool \emph{CookieCheck} to evaluate
how many of the surveyed 35.000 sites comply with the legislation put forth in
the ePrivacy Directive. Their findings indicate that almost half (49\%) of the
web sites use profiling technologies without consent. Similarly,
\citet{sanchez-rolaCanOptOut2019a} show that tracking is still prevalent  and
happens already before user consent is given after the \gls{GDPR} has been in
force for a year. \citet{huCharacterisingThirdParty2019} come to a a similar
conclusion while only looking at third party tracking: the amount of cookies
stored on a user's computer has not changed significantly since before the
\gls{GDPR}. In yet another survey of the top 500 web sites as ranked by Alexa,
\citet{degelingWeValueYour2019} conclude that the amount of tracking before and
after the \gls{GDPR} stayed the same and only 37 sites ask for consent before
storing any cookies.

Giving users a choice whether they want to share their personal information or
not and given that web sites honor such a request, all of the methods discussed
in chapter~\ref{chap:tracking methods} can be defended against.

\subsection{Clearing Browser History}
\label{subsec:Clearing Browser History}

For our purposes, clearing the browser history means not only clearing the web
sites that have been visited but also cookies and other relevant data that is
saved with a visit to a web site. All major browsers offer this functionality and
what they delete is similar. Firefox, for example, allows clearing the browsing
and search history, form and search history, cookies (also flash cookies), the
cache, active logins, offline web site data and site preferences such as
permissions, zoom level and character encodings. This technique is only
beneficial in the long term if users do it frequently to stop any accumulation
of tracking identifiers in caches, cookies or other site data. The downside is
that not having a history to go back to can hamper user experience depending on
the workflow of each user. Futhermore, opt-out or opt-in preferences are deleted
as well, making the technique in section~\ref{subsec:opt-out} less effective.

Clearing the browser history is effective against some storage-based tracking
methods. Evercookie (section~\ref{subsec:evercookie}) and cookie synchronization
(section~\ref{subsec:cookie synchronization}) are designed to respawn items in
the browser history and can therefore not be mitigated. Almost all cache-based
methods are also mitigated by frequently clearing the browser history as long as
users do not authenticate themselves with a web service.
\citet{kleinDNSCacheBasedUser2019} demonstrate that their \gls{DNS} cache attack
works across history deletions. Session-based methods are not affected by
history clearing because they are intended to track a user for one session only.

\subsection{Private Browsing Mode}
\label{subsec:Private Browsing Mode}

The private browsing mode is a feature offered by all major browser that intends
to improve privacy by not allowing access to storage areas within the browser.
Users associate it with an increase of privacy compared to normal or public
mode. Unfortunately, implementations of the private browsing mode are
inconsistent across browsers and what is deemed worthy of protection is largely
up to browser vendors. \citet[p.~440]{xuUCognitoPrivateBrowsing2015} provide a
comprehensive overview of browsers and their private browsing mode practices.
Most notably, Safari allows access to earlier cookies, history and HTML5 storage
while other browsers disallow it. Table~\ref{tab:private browsing mode} provides
a list of browsers and their protection against tracking when in private
browsing mode with the methods from chapter~\ref{chap:tracking methods}.

\begin{sidewaystable}
    \caption{Private browsing mode for major browsers}
    \label{tab:private browsing mode}
\centering
\begin{tabular}{|l|l|c|c|c|c|}
\hline
\multicolumn{1}{|c|}{\textbf{Section}}                                     & \multicolumn{1}{c|}{\textbf{Tracking Method}}  & \multicolumn{4}{c|}{ \textbf{Tracking in Private Browsing Mode}}       \\
\hline
\multicolumn{2}{|l|}{}                                                                                                      & \textbf{Safari} & \textbf{Firefox} & \textbf{Chrome} & \textbf{IE}     \\
\hline
\multicolumn{6}{|l|}{\textbf{Session-based} }                                                                                                                                                        \\
\hline
\ref{subsec:passing information in urls}                  & Passing Information in URLs                    & NA  & NA   & NA  & NA  \\
\hline
\ref{subsec:hidden form fields}                           & Hidden Form Fields                             & NA  & NA   & NA  & NA  \\
\hline
\ref{subsec:http referer}                                 & HTTP Referer                                   & NA  & NA   & NA  & NA  \\
\hline
\ref{subsec:explicit authentication}                      & Explicit Authentication                        & NA  & NA   & NA  & NA  \\
\hline
\ref{subsec:window.name dom property}                     & window.name DOM property                       & NA  & NA   & NA  & NA  \\
\hline
\multicolumn{6}{|l|}{\textbf{Storage-based} }                                                                                                                                                        \\
\hline
\ref{subsec:http cookies}                                 & HTTP cookies                                   & Yes             & No               & No              & No              \\
\hline
\ref{subsec:flash cookies and java jnlp persistenceservice} & Flash Cookies and Java JNLP PersistenceService & Yes             & Yes              & Yes             & Yes             \\
\hline
\ref{subsec:evercookie}                                   & Evercookie                                     & Yes             & No               & No              & No              \\
\hline
\ref{subsec:cookie synchronization}                       & Cookie Synchronization                         & Yes             & Yes              & Yes             & Yes             \\
\hline
\ref{subsec:silverlight isolated storage}                 & Silverlight Isolated Storage                   & Yes             & No               & No              & No              \\
\hline
\ref{subsec:html5 web storage}                            & HTML5 Web Storage                              & Yes             & No               & No              & No              \\
\hline
\ref{subsec:html5 indexed database api}                   & HTML5 Indexed Database API                     & Yes             & No               & No              & No              \\
\hline
\ref{subsec:web sql database}                             & Web SQL Database                               & Yes             & No               & No              & No              \\
\hline
\multicolumn{6}{|l|}{\textbf{Cache-based} }                                                                                                                                                          \\
\hline
\ref{subsec:web cache}                                    & Web Cache                                      & Yes             & No               & No              & No              \\
\hline
\ref{subsec:cache timing}                                 & Cache Timing                                   & Yes             & No               & No              & No              \\
\hline
\ref{subsec:cache control directives}                     & Cache Control Directives                       & Yes             & No               & No              & No              \\
\hline
\ref{subsec:dns cache}                                    & DNS Cache                                      & Yes             & Yes              & Yes             & Yes             \\
\hline
\ref{subsec:tls session resumption}                       & TLS Session Resumption                         & Yes             & No               & No              & No              \\
\hline
\end{tabular}
\end{sidewaystable}

\subsection{Do Not Track}
\label{subsec:Do Not Track}

\gls{DNT} \cite{w3cTrackingPreferenceExpression2019} is a header field that
browsers can send along with the \gls{HTTP} header to indicate that the user
prefers to not be tracked or prefers to allow tracking. All major browsers have
implemented it and offer the user the possibility of sending the header with
every request. Since its inception in 2011, adoption by trackers has been slow
to a point where \gls{DNT} is considered to be deprecated and development of the
standard has halted. Originally, it was intended to be the main way of
opting-out of tracking but without tracker compliance, it slowly faded into
obscurity.

Due to its voluntary nature and slow to no adoption, \gls{DNT} does not provide
any protection against any of the tracking methods discussed in
chapter~\ref{chap:tracking methods} in practice. Indeed,
\citet{englehardtCookiesThatGive2015} show that the \gls{DNT} header field does
not influence the level of tracking a user experiences at all. For \gls{DNT} to
be effective, the ad-scape would have to change in a way that users see
advertisements as a necessary factor in keeping the Internet `free' and trackers
respect a user's choice to not want to be tracked.

\subsection{Privacy-focused Search Engines}
\label{subsec:Privacy-focused Search Engines}

Using privacy-focused search engines is often the first step in protecting a
users privacy. Search is a cornerstone of the Internet and thus almost every
user searches for something upon opening the browser. With every search request,
the search engine can infer information about the user which gets added to a
profile. This profile is then used to enable personalized search results. Users
trying to protect their privacy by using other search engines than the default
ones (Google, Bing, Yahoo, Baidu, \dots), might find themselves in a dilemma.
Personalized search results usually provide better relevant results overall and
switching to a privacy-focused search engine, which usually has a smaller market
share, might lead to less relevant results. With Google having a market share of
almost 92\% as of June 2020 \cite{statcounterSearchEngineMarket}, users may find
that Google's search results are better than everyone else's, making a switch to
other search engines particularly difficult. Despite the market dominance of
Google, smaller, privacy-focused search engines such as DuckDuckGo
\cite{DuckDuckGoa} and Startpage \cite{StartpageCom} exist. Although those
search engines claim to not collect any personal information, these claims
cannot be verified easily and thus users have to trust them. Other open source
solutions such as searx \cite{tauberAsciimooSearx2020} can be self-hosted by
users with enough expertise and therefore eliminate the need to trust big search
engine providers. As is the case with searx, metasearch engines do not crawl the
Internet on their own but aggregate results from different search engines.

The benefit of using privacy-focused search engines is that they obfuscate the
\gls{HTTP} Referer field (see section~\ref{subsec:http referer}) by not
forwarding search results to the linked web site. Additionally, they often
abstain from showing adverts on result pages, protecting user data from third
parties that seek to monetize it.

\section{Tools}
\label{sec:tools}

This section focuses on external tools that can either be installed as a plugin
within the browser or as a standalone program. Specific user knowledge is only
necessary in some cases when users want to have fine-grained control over their
data sharing preferences.

\subsection{Blacklists}
\label{subsec:blacklists}

Blacklists are a central component of tracking protection on the Web. They block
requests from web sites that are on the blacklist and are known for their
tracking purposes. Only third party requests are blocked by blacklists because
blocking first parties would result in those web sites not being accessible at
all. Blacklists usually start out as small lists of manually selected web sites.
Over time and as their user base grows, more and more web sites are added,
resulting in a good first defense against tracking on moderately popular web
sites. The effectiveness of \glspl{TPL} depends on how quickly new domains
belonging to trackers are added to the list and when old, supposedly inactive,
domains are removed again. Futhermore, modern browser plugins aggregate
multiple, independently maintained blocklists into one big blacklist, improving
the overall detection rate. Since some lists are aimed at blocking for example
cryptocurrency mining applications on web sites and others at regular third party
requests, knowledgeable users can customize their blocking preferences by only
including those lists that they deem necessary. A well-known list used by
popular browser plugins such as Adblock Plus \cite{Adblock} and uBlock Origin
\cite{hillGorhillUBlock2020} is EasyList \cite{EasyList}. This list is used as a
basis and additional lists are added by both browser plugins.

\citet{merzdovnikBlockMeIf2017} provide an evaluation of different browser
plugins (Adblock Plus, disconnect, ghostery, privacy badger and uBlock Origin)
and their tracking protection capabilities. They identify three approaches to
curating rulesets that are then used by these plugins. Adblock Plus and uBlock
Origin rely on EasyList and its additional subscriptions which are
\emph{community-driven}. Here, the community maintains the blocklists and
updates are monitored through a public repository. Ghostery and disconnect use
blocklists that are curated by a \emph{centralized} entity such as a company. In
Ghostery's case, the centralized entity is Cliqz GmbH. Centralized entities
raise the question of how they are funding themselves especially when the
application they develop has been released to the open source community. The
third approach works by curating blocklists \emph{algorithmically}. Privacy
Badger, developed by the \gls{EFF}, does not maintain a regularly updated
blocklist but instead relies on heuristics to detect third party tracking.

In their survey of 120,000 web sites, \citet{merzdovnikBlockMeIf2017} find that
the most popular choice Adblock Plus blocks the least amount of requests by
third parties. Additionally, their results indicate that centralized blocklists
are more effective than community-driven ones in reducing the number of requests
to third parties. Algorithmic approaches such as Privacy Badger lead to a
comparatively high number of web site timeouts. Furthermore, Privacy Badger does
not perform well on analytics.

In general, using blacklists can be very effective against every form of
tracking that relies on third party requests. As soon as a first party performs
the same tracking that the third party does, blacklists do not provide any
protection.

\subsection{Tor}
\label{subsec:tor}

Tor \cite{TorProject} is an open source project aimed at providing secure, anonymous
communication. Its main component is a network of community-hosted relays that
route traffic through several nodes to the destination. The second component is
the \emph{Tor Browser}, which is a modified version of the Firefox browser and
is preconfigured for access to the Tor network. The name stems from the original
acronym TOR which is short for \emph{The Onion Router}. The non-profit
organization \emph{The Tor Project} is the main entity behind the software.

Before a request to a server is tunneled through the Tor network, it is
encrypted in multiple layers using symmetric cryptography, to avoid revealing
the contents to intermediaries. Only the last node (exit node) is able to view
the contents, provided that the original request was not encrypted with
\gls{TLS} or otherwise before handing it over to the Tor network. The route
within the network is selected based on multiple parameters such as bandwidth
requirements, state of the network and weights given to individual relays
\cite{dingledineTorProtocolSpecifications}. Additionally, the exit relay is
changed periodically to limit user profiling based on \gls{IP} addresses.

The Tor browser is of main interest for users wanting to enhance their privacy
online. By default, the browser history is not kept and cookies are cleared
either upon exit or requesting a new identity. The user can choose between three
security modes \emph{Standard}, \emph{Safer} and \emph{Safest}. The Safer mode
disables JavaScript on web sites that are not using \gls{HTTPS}, disables some
fonts to avoid fingerprinting based on the installed fonts and WebGL and other
media is click-to-play only, i.e., they do not run without explicit user
consent. The Safest mode has the same security features as the Safer mode but
disables JavaScript, loading of remote fonts and SVG images on all web sites.
The full list of changes to the Firefox browser and their rationale behind them
can be found in the Tor browser design specification
\cite{perryDesignImplementationTor2018}.

When using the Tor browser to protect oneself against the tracking methods in
chapter~\ref{chap:tracking methods}, Tor is the most promising technology.
Passing information in \glspl{URL} is still possible because the Tor browser
does not look at individual requests and does not strip them of any tracking
identifiers. Users can still be tracked by a first party using hidden form
fields. The \gls{HTTP} Referer field is purposefully not cleared because too
many web sites depend on it functioning properly. One of the most severe
mistakes a user can make when using the Tor browser is to authenticate him- or
herself to a web site, because then every action is tied to the user account. The
browser successfully defends the user against tracking via the window.name
\gls{DOM} property because it is reset every time a new \gls{URL} is requested
or a change from \gls{HTTP} to \gls{HTTPS} or vice-versa happens.  \gls{HTTP}
cookies are deleted after every session and the user has the option to disable
even first party cookies. Flash and Java Applets are disabled by default.
Depending on the settings, users are safe from cookie synchronization.  Since
Silverlight is another plugin, it is disabled by default and therefore no
tracking is possible. HTML5 web storage and IndexedDB are both disabled by
default. Web SQL database is not supported by Firefox and thus not supported by
the Tor browser. The CacheStorage \gls{API} is disabled by default and probing a
user's browser history is not possible using JavaScript if it has been disabled
(Safer or Safest browsing mode). Caching itself is allowed but users can
regularly use the \emph{New Identity} feature, which clears all caches.
Disabling caching within the browser is a possibility but might result in a
considerable impact on performance while browsing. To avoid tracking via cache
timing, timing resources within the browser are disabled and the accuracy of
timing functions is limited to a resolution of 100ms. Tracking via \glspl{ETag}
is possible if caching is enabled. For defending against \gls{DNS} cache
tracking by \citet{kleinDNSCacheBasedUser2019}, the Tor network uses one
\gls{DNS} resolver for multiple identities and identifying a single user is
therefore difficult. \gls{TLS} session resumption is mitigated by disabling
\gls{TLS} session tickets. This happens by default within Tor browser.
Additionally, they are limited to the current \gls{URL} bar domain.

\subsection{Virtual Private Network}
\label{subsec:virtual private network}

\glspl{VPN} are known for increasing privacy and anonymity by tunneling the
traffic through a \gls{VPN} provider's network. One side effect of this
tunneling results in masking the original requesting \gls{IP} address from
potentially malicious web site owners. \gls{VPN} providers additionally require
communication to be encrypted with \gls{TLS} before it is sent to their servers.
Messages encrypted with \gls{TLS} are therefore safe from prying eyes seeking to
intercept communication (\gls{MITM}) in most cases. This is especially useful if
a user is connected to the Internet through a public access point which is open
for everyone and thus does not inhibit \gls{MITM} attacks. Furthermore,
\gls{VPN} clients often use their own \gls{DNS} resolver to resolve \gls{IP}
addresses into domain names and vice versa. An \gls{ISP} interested in knowing
what kind of pages their customers visit is therefore not able to look at their
\gls{DNS} records to obtain a browsing history for individual \gls{IP}
addresses. Besides masking \gls{IP} addresses, \glspl{VPN} are effective tools
for accessing content that is not available in one country. Netflix-hosted
content for example is not the same for different countries and users in Germany
might be able to access content only available in the United States by using a
\gls{VPN} which gives an american \gls{IP} address.

Even though \glspl{VPN} have the aforementioned benefits, their tracking
protection capabilities are limited. \citet{papadopoulosExclusiveHowSynced2018}
demonstrate how correctly secured \gls{VPN} sessions can be breached via Cookie
Synchronization (section~\ref{subsec:cookie synchronization}).
Figure~\ref{fig:cookie-synchronization-vpns} shows their attack model, resulting
in a snooping \gls{ISP} receiving identifying information despite an encrypted
\gls{VPN} session. Every form of session-based tracking still applies to
sessions over \glspl{VPN} with the difference that the unique identifiers set
within the browser do not correspond to the original \gls{IP} address but the
one given by the \gls{VPN} service. Even storage-based and cache-based tracking
methods are unencumbered by \glspl{VPN}. All of these methods work without
knowing the correct \gls{IP} address. Tying tracking information to a particular
user might be more difficult because the \gls{IP} address is not the same but as
soon as there is enough identifying information about one user and across
sessions, these events can be correlated with each other to form a complete
personal profile.

Unfortunately, \gls{VPN} services have left the impression that they are
generally privacy-protecting online on many non-technical people. While the Tor
network (section~\ref{subsec:tor}) provides a much more comprehensive defense
against tracking mechanisms, it appears too technical and complicated for the
average user. \glspl{VPN} appear to be a set-and-forget solution to protecting
ones privacy online. \citet{khanEmpiricalAnalysisCommercial2018} show, however,
that choosing a \gls{VPN} is a difficult task by itself and that many services
do not manage to live up to their promises. In some cases \glspl{VPN} allegedly
intercept traffic and track users themselves (Hotspot Shield Free \gls{VPN}
\cite{centerfordemocracytechnologyComplaintRequestInvestigation2017}). Choosing
a \gls{VPN} is more difficult still because recommendations online happen
usually through affiliate programs, further confusing unknowledgeable users.

\begin{figure}
    \includegraphics[width=1\textwidth]{figures/cookie-syncing-vpns.png}
    \caption{Breaching a \gls{TLS}-encrypted \gls{VPN} session via Cookie
    Synchronization. A user accesses a website \texttt{example.com} over a
    correctly secured \gls{VPN} and \gls{TLS}. \texttt{tracker1.com} receives a
    cookie and performs cookie synchronization over \gls{HTTP} with
    \texttt{tracker2.com}. The snooping \gls{ISP} can identify the user even
    through the \gls{VPN} and across sessions by reading the synced \gls{HTTP}
    cookie \cite[p.~2]{papadopoulosExclusiveHowSynced2018}.}
    \label{fig:cookie-synchronization-vpns}
\end{figure}