bachelorarbeit/methods.tex

933 lines
56 KiB
TeX

\chapter{Tracking Methods}
\label{chap:tracking methods}
This chapter will go into detail about various tracking methods that have been
used during the history of the web. It is important to note that some of those
approaches to tracking date back to when the World Wide Web was still in its
early development stages. Knowing where the techniques come from helps in
correctly judging the impact they had and still have on the Internet as we use
it today. Furthermore, knowledge about the past allows for better predictions of
future changes in the tracking ecosystem.
To aid in understanding how they work and where they fit in the tracking
landscape, three different categories are identified and presented:
session-based, storage-based and cache-based tracking methods. Each category
uses different mechanisms and technologies to enable tracking of users. What
most of them have in common, is that they try to place unique identifiers in
different places, which can then be read on subsequent visits. Thus, a
chronological ordering of events enables interested parties to infer not only
usage statistics but also specific data about the entities behind those
identifiers.
\section{Session-based Tracking Methods}
\label{sec:session-based tracking methods}
One of the simplest and most used forms of tracking on the Internet relies on
sessions. Since \gls{HTTP} is a stateless protocol, web servers cannot by
default keep track of any previous client requests. In order to implement
specific features such as personalized advertising, some means to save current
and recall previous states must be used. For this functionality, sessions were
introduced. Sessions represent a temporary and interactive exchange of
information between two parties. Due to their temporary nature, they have to be
`brought up' at some point and `torn down' at a later point in time. It is not
specified however, how long the period between establishing and stopping a
session has to be. It could be only for a single browser session and terminated
by the user manually, or it could be for as long as a year.
\subsection{Passing Information in URLs}
\label{subsec:passing information in urls}
\glspl{URL} have first been proposed by Berners-Lee in 1994
\cite{berners-leeUniformResourceLocators1994} and are based on \glspl{URI}
\cite{berners-leeUniversalResourceIdentifiers1994}. The latter specifies a way
to uniquely identify a particular resource. The former extends the \gls{URI}
specification to include where and how a particular resource can be found.
\glspl{URI} consist of multiple parts:
\begin{enumerate}
\item a scheme (in some cases a specific protocol),
\item an optional authority (network host or domain name),
\item a path (a specific location on that host),
\item an optional query and
\item an optional fragment preceded by a hashtag (a sub resource pointing to
a specific location within the resource)
\end{enumerate}
To access a section called \texttt{introduction} in a blog post named
\texttt{blog post} on a host with the domain name \texttt{example.com} over
\gls{HTTP}, a user might use the following \gls{URI}:
\begin{verbatim}
http://example.com/blogpost/#introduction
\end{verbatim}
Even though \glspl{URI} and \glspl{URL} are two different things, they are
mostly used interchangeably today. Especially non-technical people refer to an
address on the \gls{WWW} simply as a \gls{URL}.
The optional query parameter is in most cases constructed of multiple
\texttt{(key,value)} pairs, separated by delimiters such as \texttt{\&} and
\texttt{;}. In the tracking context, query parameters can be used to pass
information (e.g. unique identifiers) to the resource that is to be accessed by
appending a unique string to all the links within the downloaded page. Since
requests to pages are generally logged by the server, requesting multiple pages
with the same unique identifier leaves a trail behind that can be used to
compile a browsing history. Sharing information with other parties is not only
limited to unique identifiers. \gls{URL} parameters can also be used to pass the
referrer of a web page containing a query that has been submitted by the user.
\citet{falahrastegarTrackingPersonalIdentifiers2016} demonstrate such an
example where an advertisement tracker logs a user's browsing history by storing
the referrer into a \texttt{(key,value)} pair
\cite[p.~37]{falahrastegarTrackingPersonalIdentifiers2016}. Other possibilities
include encoding geographical data, network properties, user information (e.g.,
e-mails) and authentication credentials.
\citet{westMeasuringPrivacyDisclosures2014} conducted a survey concerning
the use of \gls{URL} Query Strings and found it to be in widespread use on the
web.
\subsection{Hidden Form Fields}
\label{subsec:hidden form fields}
The \gls{HTML} provides a specification for form elements
\cite{whatwgFormsHTMLStandard2020}, which allows users to submit information
(e.g., for authentication) to the server via POST or GET methods. Normally, a
user would input data into a form and on clicking \emph{submit} the input would
be sent to the server. Sometimes it is necessary to include additional
information that the user did not enter. For this reason there exist
\emph{hidden} web forms \cite{whatwgHiddenStateHTML2020}. Hidden web forms do
not show on the web site and therefore the user cannot enter any information.
Similarly to \gls{URL} parameters, the value parameter in a hidden field
contains additional information like the user's preferred language for example.
Since almost anything can be sent in a value parameter, hidden form fields
present another way to maintain a session. A parameter containing a unique
identifier will be sent with the data the user has submitted to the server. The
server can then match the action the user took with the identifier. In case the
server already knows that specific identifier from a previous interaction with
the user, the gained information can now be added to the user's browsing
profile. An example of a hidden web form is given in Listing~\ref{lst:hidden web
form}, which has been adapted from \cite{InputFormInput}. In Line 15 a hidden
web field is created and the \texttt{value} field is set by the server to
contain a unique user identifier. Once the \emph{submit} button has been
clicked, the identifier is sent to the server along with the data the user has
filled in.
\begin{listing}
\inputminted[frame=lines,framesep=2mm,bgcolor=light-gray,baselinestretch=1.2,fontsize=\scriptsize,linenos]{html}{code/hidden-web-form.html}
\caption{Example of an \gls{HTTP} form containing a hidden field with
\texttt{id=userId}. The id is set by the web server dynamically so that every
visitor has his/her unique identifier attached to the form.}
\label{lst:hidden web form}
\end{listing}
\subsection{HTTP Referer}
\label{subsec:http referer}
Providers of web services often want to know where visitors to their web site
come from to understand more about their users and their browsing habits. The
\gls{HTTP} specification accounts for this by introducing the \emph{\gls{HTTP}
Referer field} [\emph{sic}] \cite{fieldingHTTPSemanticsContent2014} in the
header. By checking the referrer, the server can see where the request came
from. In practice, a user clicks on a link on a web page and the current web
page is sent as a \gls{URL} in the \gls{HTTP} Referer field. The header with the
referrer information gets attached to the \gls{HTTP} request which is sent to
the server. The server responds with the requested web page and can establish a
link from the original web page to the new web page. When applied to a majority
of the requests on a site, the resulting data can be analyzed for promotional
and statistical purposes. \citet{malandrinoPrivacyAwarenessInformation2013}
have shown that the \gls{HTTP} Referer is one of the most critical factors in
leaking \gls{PII}, because leakage of information relating to user's health has
been identified as the most severe in terms of identifiability of users on the
web.
\subsection{Explicit Authentication}
\label{subsec:explicit authentication}
Explicit authentication requires a user to \emph{explicitly} log in or register
to the web site. This way, specific resources are only available to the user
when he or she has authenticated themselves to the service. Actions taken on an
authenticated user account are tied to that account and crafting a personal
profile is more or less a built-in function in this case. Since merely asking a
user to authenticate is a simple method, the extent to which it can be used is
limited. Logged in users are generally not logged in across different browser
sessions, unless they are using cookies to do so (see section~\ref{subsec:http
cookies}), therefore limiting tracking to one session at a time. Furthermore,
always requiring a logged in state can be a tiring task for users, because they
have to be authenticated every time they visit a particular service. This can
potentially pose a usability problem where users simply stop using the service
or go to considerable lengths to avoid logging in. This largely depends on a
cost-benefit analysis the users subconsciously undertake. The third factor
where this method is lacking, concerns the awareness of the user being tracked.
Since tracking users depends on them actively logging in to the service,
tracking them transparently is impossible. Even though most tracking efforts
are not detected by the average user, it is known that actions taken on an
account are logged to provide better service through service optimization and
profile personalization.
Making an account on a web site to use their services to their full extent, can
be beneficial in some cases. Facebook for example, allows their users to
configure what they want to share with the public and their friends. Research
has shown however, that managing which posts get shown to whom is not as
straightforward as one might think. \citet{liuAnalyzingFacebookPrivacy2011}
conducted a survey where they asked Facebook users about their desired privacy
and visibility settings and cross-checked them with the actual settings they
have used for their posts. The results showed that in only 37\% of cases the
users' expectations match the reality. Additionally, 36\% of content is left on
the default privacy settings which set the visibility of posts to public,
meaning that any Facebook user can view them.
\subsection{window.name DOM Property}
\label{subsec:window.name dom property}
The \gls{DOM} is a platform and language agnostic \gls{API} which defines the
logical structure of web documents (i.e., \gls{HTML}, \gls{XHTML} and \gls{XML})
and the way they are accessed and manipulated. The \gls{DOM} was originally
introduced by Netscape at the same time as JavaScript as the \gls{DOM} Level 0.
The first recommendation (\gls{DOM} Level 1) was released in 1998 by the
\gls{W3C} \gls{DOM} working group \cite{w3cDocumentObjectModel1998} which
published its final recommendation (\gls{DOM} Level 3) in 2004. Since then the
\gls{WHATWG} took over and in 2015 published the \gls{DOM} Level 4 standard
\cite{whatwgDOMLivingStandard2020} which replaces the Level 3 specification. It
works by organizing all objects in a document in a tree structure which allows
individual parts to be altered when a specific event happens (e.g., user
interaction). Furthermore, each object has properties which are either applied to
all \gls{HTML} elements or only to a subset of all elements.
One useful property for tracking purposes is the \texttt{window.name} property
\cite{whatwgWindowNameHTML2020}. Its original intention was to allow
client-side JavaScript to get or set the name of the current window. Since
windows do not have to have names, the window.name property is being used mostly
for setting targets for hyperlinks and forms. Modern browsers allow storing up
to two megabytes of data in the window.name property, which makes it a viable
option for using it as a data storage or---more specifically---maintaining
session variables. In order to store multiple variables in the window.name
property, the values have first to be packed in some way because only a single
string is allowed. A \gls{JSON} stringifier converts a normal string into a
\gls{JSON} string which is then ready to be stored in the \gls{DOM} property.
Additionally, serializers can also convert JavaScript objects into a \gls{JSON}
string. Normally JavaScript's same-origin policy prohibits making requests to
servers in another domain, but the window.name property is accessible from other
domains and resistant to page reloads. Maintaining a session across domains and
without cookies is therefore possible and multiple implementations exist
\cite{frankSessionVariablesCookies2008,zypWindowNameTransport2008}.
\section{Storage-based Tracking Methods}
\label{sec:storage-based tracking methods}
Storage-based tracking methods are different to session-based tracking methods
in that they try to store information on the client's computer not only for
single sessions but for as long as desired. The following methods can be used to
store session data as well but are not limited to that use case. They generally
enable more advanced tracking approaches because they have information about the
current browser instance and the operating system the browser is running on. Due
to their nature of residing on the user's computer, they are in most cases
harder to circumvent, especially when two or more methods are combined, resulting
in better resilience against simple defenses.
\subsection{HTTP Cookies}
\label{subsec:http cookies}
A method which is most often associated with tracking on the Internet is
tracking with \gls{HTTP} cookies. Cookies are small files that are placed in the
browser's storage on the user's computer. They are limited to four kilobytes in
size and are generally used to identify and authenticate users and to store
web site preferences. They were introduced to the web to allow stateful
information to be stored because the \gls{HTTP} is a stateless protocol and
therefore does not have this capability. It is also a way of reducing the
server's load by not having to recompute states every time a user visits a
web site. Shopping cart functionality for example can thus be implemented by
setting a cookie in the user's browser, saving the items which are currently
added to the shopping cart and giving the user the possibility to resume
shopping at a later point provided that they do not delete their cookies. With
the introduction of cookies, advertising companies could reidentify users by
placing unique identifiers in the browser and reading them on subsequent visits.
The first standard for cookies was published in 1997
\cite{kristolHTTPStateManagement1997} and has since been updated multiple times
\cite{kristolHTTPStateManagement2000,barthHTTPStateManagement2011}.
Cookies can be divided into two categories: first party cookies, which are
created by the domain the user has requested and third party cookies, which are
placed in the user's browser by other domains that are generally not under the
control of the first party \cite{barthThirdPartyCookies2011}. Whereas first
party cookies are commonly not used for tracking but for the aforementioned
shopping cart functionality for example or enabling e-commerce applications to
function properly, third party cookies are popular with data brokerage firms
(e.g., Datalogix, Experian, Equifax), online advertisers (e.g., DoubleClick)
and---belonging to both of these categories in some cases---social media
platforms (e.g., Facebook) \cite{cahnWhatCommunityCookie2016}. The distinction
between these two categories is not always clear, however. Google Analytics for
example is considered to be a third party but offers their analytics services by
setting a first party cookie in the user's browser in addition to loading
JavaScript snippets from their servers. Therefore, categorizing cookies into
those that serve third party web content and those that serve first party web
content presents a more adequate approach.
Cookies are set either by calling scripts that are embedded in a web page (e.g.,
Google's \texttt{analytics.js}) or by using the \gls{HTTP} Set-Cookie response
header. Once a request to a web server has been issued, the server can set a
cookie in the Set-Cookie header and sends the response back to the client. On
the client's side the cookie is stored by the browser and sent with subsequent
requests to the same domain via the cookie \gls{HTTP} header. An example of a
cookie header is given in Listing~\ref{lst:session cookie header}. Because this
example does not set an expiration date for the cookie, it sets a session
cookie. Session cookies are limited to the current session and are deleted as
soon as the session is `torn down'. By adding an expiration date (demonstrated
in Listing~\ref{lst:permanent cookie header}) or a maximum age, the cookie
becomes permanent. Additionally, the domain attribute can be specified, meaning
that cookies which list a different domain than the origin, are rejected by the
user agent \cite[section 4.1.2.3]{barthHTTPStateManagement2011}. The same-origin
policy applies to cookies, disallowing access by other domains.
\begin{listing}
\inputminted[frame=lines,framesep=2mm,bgcolor=light-gray,baselinestretch=1.2,fontsize=\scriptsize,linenos]{http}{code/session-cookie-header}
\caption{An example of an \gls{HTTP} header setting a session cookie.}
\label{lst:session cookie header}
\end{listing}
\begin{listing}
\inputminted[frame=lines,framesep=2mm,bgcolor=light-gray,baselinestretch=1.2,fontsize=\scriptsize,linenos]{http}{code/permanent-cookie-header}
\caption{An example of an \gls{HTTP} header setting a permanent cookie.}
\label{lst:permanent cookie header}
\end{listing}
Distinguishing tracking and non-tracking cookies can be done with high accuracy
by observing their expiration time and the length of the value field.
\citet{liTrackAdvisorTakingBack2015} demonstrate a supervised learning approach
to detecting tracking cookies with their tool \emph{TrackAdvisor}. They found
that tracking cookies generally have a longer expiration time than non-tracking
cookies and they need to have a sufficiently long value field carrying the
unique identifier. Using this method, they found that only 10\% of tracking
cookies have a lifetime of a single day or less while 80\% of non-tracking
cookies expire before a day is over. Additionally, a length of more than 35
characters in the value field applies to 80\% of tracking cookies and a value
field of less than 35 characters applies to 80\% of non-tracking cookies.
\emph{Cookie Chunking}, where a cookie of larger length is split into multiple
cookies with smaller length, did not appear to affect detection by their method
negatively. They also present a site measurement of the Alexa Top 10,000 web
sites, finding that 46\% of web sites use third party tracking. More recent
research \cite{gonzalezCookieRecipeUntangling2017} has shown that tracking
cookies do not have to be long lasting to accumulate data about users. Some
cookies---like the \texttt{\_\_utma} cookie from Google Analytics for
example---save a timestamp of the current visit with the unique identifier,
thereby allowing to use cookies which last a short time but can be afterwards
used in series to complete the whole picture.
\citet{gonzalezCookieRecipeUntangling2017} have also found 20\% of observed
cookies to be \gls{URL} or base64 encoded, making decoding of cookies a
necessary step for analysis. Furthermore---and contrary to previous work---,
cookie values are found in much more varieties than is assumed by approaches
that only try to detect cookies by their expiration date and/or character
length. They also presented an entity based matching algorithm to dissect
cookies which contain more than a unique identifier. This allows for a better
understanding and interpretation of complex cookies as they are found in
advertising networks with a lot of reach (e.g., doubleclick.net). This
information is particularly useful for building applications that effectively
detect and block cookies (see chapter~\ref{chap:defenses against tracking}).
\subsection{Flash Cookies and Java JNLP PersistenceService}
\label{subsec:flash cookies and java jnlp persistenceservice}
Flash Cookies \cite{adobeAdobeFlashPlatform} are similar to HTTP cookies in that
they too are a store of information that helps web sites and servers to
recognize already seen users. They are referred to as \glspl{LSO} by Adobe and
are part of the Adobe Flash Player runtime. Instead of storing data in the
browser's storage, they have their own storage in a different location on the
user's computer. Another difference is that they cannot only store 4 kilobytes
of data but 100 kilobytes and they also have no expiration dates by default
(\gls{HTTP} cookies live until the end of the session unless specified
otherwise). Since Flash cookies are not created by means the browser normally
supports (i.e., \gls{HTTP}, \gls{CSS}) but by Adobe's Flash Player runtime,
browsers are not managing Flash cookies. This means that, due to Flash cookies
not being tied to a specific browser, they function across browsers. This
capability makes them an interesting target for trackers to store their
identifying information in, because out of the box browsers initially did not
support removing Flash cookies and one had to manually set preferences in the
\emph{Web Storage Settings panel} provided by the Flash Player runtime to get
rid of them. Trackers were searching for a new way to store identifiers because
users became increasingly aware of the dangers posed by \gls{HTTP} cookies and
reacted by taking countermeasures.
\citet{soltaniFlashCookiesPrivacy2009} were the first to report on the usage of
Flash cookies by advertisers and popular web sites. While surveying the top 100
web sites at the time, they found that 54\% of them used Flash cookies. Some
web sites were setting Flash cookies as well as \gls{HTTP} cookies with the
same values, suggesting that Flash cookies serve as backup to \gls{HTTP}
cookies. Several web sites were found using Flash cookies to respawn already
deleted \gls{HTTP} cookies, even across domains.
\citet{acarWebNeverForgets2014} automated detecting Flash cookies and access to
them by monitoring file access with the GNU/Linux \emph{strace} tool
\cite{michaelStraceLinuxManual2020}. This allowed them to acquire data about
Flash cookies respawning \gls{HTTP} cookies. Their results show that six of the
top 100 sites use Flash cookies for respawning.
Even though Flash usage has declined during the last few years thanks to the
development of the HTML5 standard, \citet{buhovFLASH20thCentury2018} have shown
that despite major security flaws, Flash content is still served by 7.5\% of
the top one million web sites (2017). The W3Techs Web Technology Survey shows
a similar trend and also offers an up-to-date measurement of 2.7\% of the top
ten million web sites for the year 2020
\cite{w3techsHistoricalYearlyTrends2020}. Due to the security concerns with
using Flash, Google's popular video sharing platform YouTube switched by
default to the HTML5 <video> tag in January of 2015
\cite{youtubeengineeringYouTubeNowDefaults2015}. In 2017 Adobe announced that
they will end-of-life Flash at the end of 2020, stopping updates and
distribution \cite{adobecorporatecommunicationsFlashFutureInteractive2017}.
Consequently, Chrome 76 and Firefox 69 disabled Flash by default and will drop
support entirely in 2020.
Similarly to Flash, Java also provides a way of storing data locally on the
user's computer via the PersistenceService \gls{API}
\cite{PersistenceServiceJNLPAPI2015}. It is used by the evercookie library
(section~\ref{subsec:evercookie}) to store values for cookie respawning by
injecting a Java applet into the \gls{DOM} of a page
\cite{baumanEvercookieApplet2013}.
\subsection{Evercookie}
\label{subsec:evercookie}
Evercookie is JavaScript code that can be embedded in web sites which allows to
permanently store information on the user's computer. When activated,
information is not only stored in standard \gls{HTTP} cookies but also in
various other places, providing redundancy where possible. A full list of
locations used by Evercookie can be found on the project's github page
\cite{kamkarSamykEvercookie2020}. In case the user wants to get rid of all
information stored by visiting a web site that uses evercookies, every location
has to be cleared because if one remains, all the other cookies are restored.
The cookie deletion mechanisms that are provided by browsers by default do not
clear all locations where evercookies are stored, which makes evercookie almost
impossible to avoid. Evercookie is open source and quietly implementing or using
evercookie is therefore not easy to do. Additionally, it is reported on the
project's github page that it might cause severe performance issues in browsers.
Evercookie has been proposed and implemented by
\citet{kamkarEvercookieVirtuallyIrrevocable2010}. Multiple surveys have tried
to quantify the use of evercookie in the wild. \citet{acarWebNeverForgets2014}
provide a heuristic for detecting evercookies stored on the user's computer and
analyze evercookie usage in conjunction with cookie respawning.
\subsection{Cookie Synchronization}
\label{subsec:cookie synchronization}
When trackers are using cookies to store unique identifiers to track users,
every tracker assigns a different identifier to the same user, due to the
same-origin policy disallowing interaction with other trackers. Because of this,
sharing data between multiple trackers is difficult, since there are no easy
ways to accurately match an accumulated profile history of one identifier to
another. This problem has been solved by modern trackers by using a mechanism
called Cookie Synchronization or Cookie Matching
\cite{googleinc.CookieMatchingRealtime2020}. This technique allows multiple
trackers to open an information sharing channel between each other without
necessarily having to know the web site the user visits.
\begin{figure}[ht]
\centering
\includegraphics[width=1\textwidth]{figures/cookiesyncing.pdf}
\caption{Cookie Synchronization in practice between two trackers
\label{fig:cookie synchronization}
\emph{cloudflare.com} and \emph{google.com}.}
\end{figure}
An example of how Cookie Synchronization works in practice is given in
Figure~\ref{fig:cookie synchronization}. The two parties that are interested in
tracking the user are called \emph{cloudflare.com} and \emph{google.com} in this
example. The user they want to track is called \emph{browser}. \emph{Browser}
first visits \emph{web site1.com} which loads JavaScript from
\emph{cloudflare.com}. \emph{Cloudflare.com} sets a cookie in the browser with a
tracking identifier called \emph{userID = 1234}. Next, \emph{browser} visits
another web site called \emph{web site2.com} which loads an advertisement banner
from \emph{google.com}. \emph{Google.com} also sets a cookie with the tracking
identifier \emph{userID = ABCD}. \emph{Browser} has now two cookies from two
different providers, each of them knowing the user under a different identifier.
When \emph{browser} visits a third web site called \emph{website3.com} which
makes a request to \emph{cloudflare.com} and recognizes the user with the
identifier \emph{userID = 1234}, \emph{cloudflare.com} sends an \gls{HTTP}
redirect, redirecting \emph{browser} to \emph{google.com}. The redirect also
contains an \gls{HTTP} Query String (see section~\ref{subsec:passing information
in urls}) which adds a query like \emph{?userID=1234\&publisher=website3.com}.
The complete GET request to \emph{google.com} might look like this:
\begin{minted}[frame=lines,framesep=2mm,bgcolor=light-gray,baselinestretch=1.2,fontsize=\scriptsize,linenos]{http}
GET /index.html?userID=1234&publisher=website3.com HTTP/1.1
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
Host: google.com
Cookie: userID=ABCD
\end{minted}
\emph{Google.com} therefore not only knows that the user with the identifier
\emph{userID=ABCD} visited \emph{website3.com} but also that \emph{browser} is
the same user as \emph{userID=1234}. Since the identifiers can now be traced
back to the same person, the different cookies have been synchronized, allowing
the two trackers to exchange information about the user without him or her
knowing.
Cookie Synchronization has seen widespread adoption especially in \gls{RTB}
based auctions \cite{olejnikSellingPrivacyAuction2014}.
\citet{papadopoulosCookieSynchronizationEverything2019} recorded and analyzed
the browsing habits of 850 users over a time period of one year and found that
97\% of users with regular browsing activity were exposed to Cookie
Synchronization at least once. Furthermore, they found that ``[...] the average
user receives around 1 synchronization per 68 requests''
\cite[p.~7]{papadopoulosCookieSynchronizationEverything2019}. In
\cite{englehardtOnlineTracking1MillionSite2016} the authors crawl the top
100,000 sites and find that 45 of the top 50 (90\%) third parties and 460 of
the top 1000 (46\%) use Cookie Synchronization with at least one other party.
\emph{Doubleclick.net} being at the top sharing 108 cookies with 118 other
third parties. \citet{papadopoulosExclusiveHowSynced2018} show the threat
that Cookie Synchronization poses to encrypted \gls{TLS} sessions by performing
the cookie-syncing over unencrypted \gls{HTTP} even though the original request
to the web site was encrypted. This highlights the serious privacy implications
for users of \gls{VPN} services trying to safeguard their traffic from a
potentially malicious \gls{ISP}.
\subsection{Silverlight Isolated Storage}
\label{subsec:silverlight isolated storage}
Silverlight Isolated Storage can also be used for storing data for tracking
purposes on the user's computer. It has been compared to Adobe's Flash
technology as it too requires a plugin from Microsoft to function. Available for
storage are 100 kilobytes which is the same amount Flash cookies can store.
Silverlight does not work in the private browsing mode and can only be cleaned
manually by deleting a hidden directory in the filesystem or by changing
settings in the Silverlight application. Silverlight's Isolated Storage is one
of the methods evercookie (section~\ref{subsec:evercookie}) uses to make
permanent deletion of cookies hard to do and to facilitate cookie respawning.
Usage of Silverlight has seen a steady decline since 2011 even though it has
been used by popular video streaming web sites such as Netflix
\cite{NetflixBeginsRollOut2010} and Amazon. Microsoft did not include
Silverlight support in Windows 8 and declared end-of-life in a blog post for
October of 2021 \cite{SilverlightEndSupport2015}. Usage of Silverlight currently
hovers around 0.04\% for the top 10 million web sites
\cite{w3techsUsageStatisticsSilverlight2020}.
\subsection{HTML5 Web Storage}
\label{subsec:html5 web storage}
HTML5 Web Storage comes in three different forms: HTML5 Global Storage, HTML5
Local Storage and HTML5 Session Storage. It is part of the HTML specification
\cite{whatwgHTMLStandard2020} and provides means for storing name-value pairs on
the user's computer. HTML5 Web Storage works similarly to cookies but enables
developers to manage transactions that are done by the user simultaneously but
in two different windows. Whereas with cookies the transaction can accidentally
be recorded twice, HTML5 Web Storage allows multiple windows to access the same
storage on the user's computer thereby avoiding this problem. In contrast to
cookies, which are sent to the server every time a request is made, HTML5 Storage
contents do not get sent to the web server. By default the storage limit is
configured to be 5 megabytes per origin \cite{whatwgHTMLStandard2020a}. Even
though this was only a recommendation by the standard, all modern browsers
adhere to it. More space can be allocated upon asking the user for permission to
do so.
Global Storage was part of an initial HTML5 draft and is accessible across
applications. Due to it violating the same-origin policy, most major browsers
have not implemented Global Storage.
Local Storage does, however, obey the same-origin policy by only allowing the
originating domain access to its name-value pairs. Every web site has their own
separate storage area which maintains a clear separation of concerns. Local
Storage lends itself for different use cases. Especially applications that
should function even when no internet connection exists can use Local Storage to
enable that functionality. Coupled with file-based \glspl{API}, which are
generally not limited in storage (except by the available disk space), offering
the full range of features in offline-mode is feasible.
The third category of HTML5 Web Storage is similar to Local Storage, but
requires that the stored data be deleted after a session is closed. While
content that is persisted by Local Storage must be deleted explicitly by the
user, Session Storage has the intended function of providing non-persistent
storage.
HTML5 Web Storage can be used for tracking in the same way that cookies are
used: by storing unique identifiers which are read on subsequent visits.
\citet{ayensonFlashCookiesPrivacy2011} found that 17 of the top 100 web sites
used HTML5 Web Storage with some of them using it for cookie respawing (see
section~\ref{subsec:evercookie}). A recent survey by
\citet{belloroKnowWhatYou2018} looks at Web Storage usage in general and found
that 83.09\% of the top 10K Alexa web sites use it. The authors flagged 63.88\%
of those usages as coming from known tracking domains.
\subsection{HTML5 Indexed Database API}
\label{subsec:html5 indexed database api}
The need for client side storage to provide performant web applications that can
also function offline has prompted the inception of alternative methods to
store and retrieve information. Consequently, the development of the HTML5
standard has tried to fill that need by introducing HTML5 Web Storage and the
HTML5 Indexed Database \gls{API}.
HTML5 Indexed Database \gls{API} provides an interface for storing values and
hierarchical objects using the well-known key-value pair storage principle
\cite{alabbasIndexedDatabaseAPI2020}. This property makes it similar to NoSQL
storage solutions which have seen increasing adoption rates on the web. It is
the successor to the abandonend Web SQL Database (see section~\ref{subsec:web
sql database}) standard and functions similarly to the HTML5 Web Storage,
meaning that it has the same storage limits and privacy implications and has to
obey the same-origin policy. In contrast to HTML5 Web Storage, IndexedDB is
intended for storing larger amounts of data and provides additional functions
such as in-order key retrieval. Reading from and writing to an IndexedDB is done
with JavaScript by opening a connection to the database, preparing a transaction
and committing it. The development of the standard is ongoing with two editions
already published and recommended by the W3C and the third edition existing as
an editors draft until it is ready for recommendation.
HTML5 IndexedDB has been added to the evercookie library (see
section~\ref{subsec:evercookie}) by
\citet{kamkarEvercookieVirtuallyIrrevocable2010}, providing redundancy for
\gls{HTTP} cookies. \citet{acarWebNeverForgets2014} have shown that only 20 of
100.000 surveyed sites use the IndexedDB storage vector with one of them
(\texttt{weibo.com}) using it for respawning \gls{HTTP} cookies. A more recent
study by \citet{belloroKnowWhatYou2018} paints a different picture: On a
dataset provided by the \gls{HTTP} Archive project
\cite{soudersAnnouncingHTTPArchive2011}, they found that 5.56\% of observed
sites use IndexedDB. Of those that use IndexedDB, 31.87\% of usages appear to
be coming from domains that are flagged as trackers.
\subsection{Web SQL Database}
\label{subsec:web sql database}
Web SQL Database \cite{hicksonWebSQLDatabase2010} was initially developed to
provide an \gls{API} for storing data in databases. The stored data can then be
queried using \gls{SQL} or variants thereof. The W3C stopped the development of
the standard in 2010 due to a lack of other backend implementations (other than
SQLite) which is necessary for a recommendation as a standard. Browsers have
turned to HTML5 IndexedDB (see section~\ref{subsec:html5 indexed database api}),
the ``spiritual successor'' to Web SQL Database, for web database storage.
Despite the W3C deprecating Web SQL Database, some browsers such as Chrome,
Safari and Opera still support it and have no plans of discontinuing it.
In the same way that other tracking technologies can maintain a history of web
site visits and actions, Web SQL Database can store identifying information via
the usage of unique identifiers. An arbitrary maximum size of 5 megabytes of
storage per origin is recommended by the standard, with the possibility to ask
the user for more capacity. This limit includes other domains which are
affiliated with the origin but have a different name (e.g. subdomains).
Due to the W3C abandoning the Web SQL Database standard, not many reports on
usage for tracking purposes exist. The method has been added, however, to the
evercookie library by \citet{kamkarEvercookieVirtuallyIrrevocable2010} (see
section~\ref{subsec:evercookie}) to add another layer of redundancy for storing
unique identifiers and respawning deleted ones. By performing static analysis on
a dataset provided by the \gls{HTTP} Archive project
\cite{soudersAnnouncingHTTPArchive2011}, \citet{belloroKnowWhatYou2018}
found that 1.34\% of the surveyed web sites use Web SQL Database in one of their
subresources. 53.59\% of Web SQL Database usage are considered to be coming from
known tracking domains. This ratio is lower for the first 10K web sites as
determined by Alexa (in May 2018): 2.12\% use Web SQL Database and 39.9\% of
those use it for tracking. These percentages show that Web SQL Database is not
used as a means to provide new functionality in most cases, but to increase user
tracking capabilities.
\section{Cache-based Tracking Methods}
\label{sec:cache-based tracking methods}
While the underlying principle of storing unique identifiers on the user agent's
computer remains the same, cache-based methods exploit a type of storage that is
normally used for data that is saved for short periods of time and most commonly
serves to improve performance. Whereas storage-based tracking methods (see
section~\ref{sec:storage-based tracking methods}) exploit storage interfaces
that are meant for persisting data to disk, caches store data that has been
generated by an operation and can be served faster on subsequent requests.
A variety of caches exist and they are utilized for different purposes, leading
to different forms of information exploitability for tracking users. This
section introduces methods which are in most cases not prevalent but are more
sophisticated and can thus be much harder to circumvent or block.
\subsection{Web Cache}
\label{subsec:web cache}
Using the \gls{DOM} \gls{API}'s \texttt{Window.getComputedStyle()} method,
web sites were able to check a user's browsing history by utilizing the \gls{CSS}
\texttt{:visited} selector. Links can be colored depending on whether they have
already been visited or not. The colors can be set by the web site trying to
find out what the user's browsing history is. JavaScript would then be used to
generate links on the fly for web sites that will be cross-checked with the
contents of the browsing history. After generating links, a script can check the
color, compare it with the color that has been set for visited and non-visited
web sites and see if a web site has already been visited or not.
A solution to the problem has been proposed and subsequently implemented by
\citet{baronPreventingAttacksUser2010} in 2010, making
\texttt{getComputedStyle()} and similar functions lie about the state of the
visited links and marking them as unvisited. Another solution has been
developed by \citet{jacksonProtectingBrowserState2006} in form of a browser
extension that enforces the same-origin policy for browser histories as well.
Although their approach limits access to a user's browsing history by third
parties, first parties are unencumbered by the same-origin policy. Their
browser extension does, however, thwart the attack carried out by
\citet{jancWebBrowserHistory2010} where the authors were able to check for up
to 30.000 links per second.
\citet{wondracekPracticalAttackDeanonymize2010} demonstrate the severity of
history stealing attacks (e.g. visited link differentiation) on user privacy by
probing for \glspl{URL} that encode user information such as group membership
in social networks. By constructing a set of group memberships for each user,
the results can uniquely identify a person. Furthermore, information that is
not yet attributed to a single user but to a group as a whole can be used to
more accurately identify members of said group.
Other ways of utilizing a web browser's cache to track users are tracking
whether a web site asset (e.g., an image or script) has already been cached by
the user agent or not. If it has been cached, the web site knows that is has been
visited before and if it has not been cached (the asset is downloaded from the
server), the user agent visits for the first time. Another way is to embed
identifiers in cached documents. An \gls{HTML} file can contain an identifier
which is stored in a \texttt{<div>} tag and is cached by the user agent. The
identifier can then be read from the cache on subsequent visits, even from third
party web sites.
\subsection{Cache Timing}
\label{subsec:cache timing}
Cache timing attacks \cite{feltenTimingAttacksWeb2000} are another form of
history stealing which enables an attacker to probe for already visited
\glspl{URL} by timing how long it takes a client to fetch a resource. Timing
attacks are most commonly used in cryptography to indirectly observe the
generation or usage of a cipher key by measuring cpu noises, frequencies, power
usage or other properties that allow conclusions to be drawn about the key. This
type of attack is referred to as a side-channel attack. Cache timing exploits
the fact that it takes time to load assets for a web site. It works by measuring
the time a client takes to access a specified resource. If the time is short,
the resource has most likely been served from the cache and has thus been
downloaded before, implying a visit to a web site which uses that resource. If
it takes longer than a cache hit would, on the other hand, the resource did not
exist before and has to be downloaded now, suggesting that no other web site
using that resource has been visited before. In practice an attack might look
like this (taken from \cite[p.~2]{feltenTimingAttacksWeb2000}):
\begin{enumerate}
\item Alice visits a web site from Bob called \texttt{bob.com}.
\item Bob wants to find out whether Alice visited Charlie's web site
\texttt{charlie.com} in the past.
\item Bob chooses a file from \texttt{charlie.com} which is regularly
downloaded by visitors to that site.
\item Bob implements a script or program that checks the time it takes
to load the file from \texttt{charlie.com} and embeds it in his
own site.
\item The program is loaded by Alice upon visiting and measures the time
needed to load the file from \texttt{charlie.com}.
\item If the measured time is below a certain threshold, the file has
probably been downloaded into the cache and Alice has therefore
visited \texttt{charlie.com} before.
\end{enumerate}
Bob can do this process for multiple resources and for every user that visits
his web site, collecting browser history information on all of them. Since
caches exist to boost performance and avoid unnecessary loading of content from
servers which has already been downloaded before, timing attacks are very hard
to circumvent because caches exist solely for that purpose. Countermeasures
either cause a massive slowdown when browsing the web due to the ubiquity of
caches, or imply a substantial change in user agent design.
\citet{feltenTimingAttacksWeb2000} were the first to conduct a study on the
feasibility of cache timing attacks and concluded that accuracy in determining
whether a file has been loaded from cache or downloaded from a server is
generally very high ($>95$\%). Furthermore, they evaluated a host of
countermeasures such as turning off caching, altering hit or miss performance
and turning off Java and JavaScript but concluded that they were unattractive
or at worst ineffective. They propose a partial remedy for cache timing by
introducing \emph{Domain Tagging} which requires that resources are tagged with
the domain they have initially been loaded from. Once another web site wants to
determine whether a user has visited a site before by cross-loading a resource,
the domain does not match the tagged domain on the resource. If that is the
case, the initial cache hit gets transformed into a cache miss and the resource
has to be downloaded again, fooling the attacker into believing that the origin
web site has not been visited before. It is necessary to mention that at the
time (2000) \glspl{CDN} were not as widely used as today. Since web sites rely
on \glspl{CDN} to cache resources that are used on multiple sites and can thus
be served much faster from cache, domain tagging would effectively nullify the
performance boost a \gls{CDN} provides by converting every cache hit into a
cache miss. The authors themselves question the effectiveness of such an
approach.
Because the attack presented by \citet{feltenTimingAttacksWeb2000} relies on
being able to accurately time resource loading, a reliable network is needed.
Today a sizeable portion of internet activity comes from mobile devices which
are often not connected via cable but wirelessly.
\citet{vangoethemClockStillTicking2015} have therefore proposed four new
methods to accurately time resource loading over unstable networks. By using
these improved methods, they managed to determine whether a user is a member of
a particular age group (in this case between 23 and 32). The authors also ran
their attacks against other social networks (LinkedIn, Twitter, Google and
Amazon), successfully extracting sensitive information on users. The research
discussed so far has not tackled the problem through a quantitative perspective
but instead focused on individual cases. Due to this missing piece,
\citet{sanchez-rolaBakingTimerPrivacyAnalysis2019} conducted a survey on 10K
web sites to determine how feasible it is to perform a history sniffing attack
on a large scale. Their tool \textsc{BakingTimer} collects timing information
on \gls{HTTP} requests, checking for logged in status and sensitive data. Their
results show that 71.07\% of the surveyed web sites are vulnerable to the
attack.
\subsection{Cache Control Directives}
\label{subsec:cache control directives}
Cache Control Directives can be supplied in the Cache-Control \gls{HTTP} header,
allowing rules about storing, updating and deletion of resources in the cache to
be defined. Cache Control Directives make heavy use of \emph{\glspl{ETag}}
\cite{fieldingHTTPETag} and \emph{Last-Modified \gls{HTTP} Headers}
\cite{fieldingHTTPLastModified} to determine whether a cached resource is stale
and needs to be updated. Commonly, a collision-resistant hash function is used
to generate a unique hash of a cached resource which is sent along with the
resource in the first \gls{HTTP} request. The resource and the hash—which is
stored in the \gls{ETag} header—is then cached by the client. On subsequent
retrievals of the same \gls{URL}, the client checks for an expiration date on
the requested \gls{URL} via the Cache-Control and Expire headers. If the
\gls{URL} has expired, the client sends a request with the \emph{If-None-Match}
field set with the \gls{ETag}. The server then compares the \gls{ETag} received
by the client with the generated \gls{ETag} of the resource on the server side.
If the two values match (i.e., the resource has not changed), the server can
send back an \gls{HTTP} 304 Not-Modified status. Otherwise, the answer contains
a full \gls{HTTP} response with the modified resource and the newly generated
\gls{ETag}, which the client can cache again. Usage of \glspl{ETag} can
therefore improve performance and cache consistency while at the same time
reducing bandwidth usage.
As with most other tracking methods, unique identifiers can be stored inside
the \gls{ETag} header because it offers a storage capacity of 81864 bits. Once
the identifier has been placed in the \gls{ETag} header, the server can answer
requests to check for an updated resource always with an \gls{HTTP} 301
Not-Modified header, effectively persisting the unique identifier in the
client's cache. During their 2011 survey of QuantCast.com's top 100 U.S. based
web sites, \citet{ayensonFlashCookiesPrivacy2011} found \texttt{hulu.com} to be
using \glspl{ETag} as backup for tracking cookies that are set by
\texttt{KISSmetrics} (an analytics platform). This allowed cookies to be
respawned once they had been cleared by checking the \gls{ETag} header.
\subsection{DNS Cache}
\label{subsec:dns cache}
The \gls{DNS} is used every time a connection to a server has to be established.
It's main function is to map hard to remember \gls{IP} addresses to easily
memorizable domain names and vice versa. The hierarchical structure of the
system is particularly suited for caching as requests do not have to be passed
all the way to the top of the hierarchy but can possibly be answered by lower
level \gls{DNS} servers. Caching is especially beneficial when name resolution
is recursive, allowing servers to quickly respond to requests without having to
consult higher level ones. Since name resolution servers change less often the
closer they are to the root node, the probability of having a cache hit
increases with every step in the hierarchy. Caching also happens at the lower
levels of the hierarchy, directly at the client in multiple ways. The host
operating system has it's own cache that applications can ask for name
resolution. Some applications introduce another layer of caching by having their
own cache (e.g., browsers).
\citet{kleinDNSCacheBasedUser2019} demonstrate a tracking method which uses
\gls{DNS} caches to assign unique identifiers to client machines. In order for
the technique to work, the tracker has to have control over one web server (or
multiple) as well as an authoritative \gls{DNS} server which associates the web
servers with a domain name under the control of the tracker. The tracking
process starts once a user agent requests a web site which loads a script from
one of the web servers the attacker is controlling. The process can then be
sketched out as follows (see \cite[p.~5]{kleinDNSCacheBasedUser2019} for a
detailed description).
\begin{enumerate}
\item The snippet loads a resource from muliple domains (\texttt{1.ex.com},
\texttt{2.ex.com}, ...) the tracker controls.
\item The \gls{OS} forwards a \gls{DNS} request to the configured name
server.
\item The name server receives and caches multiple, randomly ordered
\gls{IP} addresses for the domain from the tracker-controlled \gls{DNS}
server. The results are forwarded to the \gls{OS}.
\item The \gls{OS} caches the result as well and the browser makes
\gls{HTTP} requests to (in most cases) the first \gls{IP} address.
\item The web server responsible for the \gls{IP} address responds with a
number. Other web servers would respond with a different number.
\item The script then constructs an identifier from the received values.
\end{enumerate}
Due to the random ordering of the received \gls{IP} addresses from the
authoritative \gls{DNS} server, the identifier that is assembled by the script
is unique and thus allows identification of not only the browser but the client
machine itself.
Advantages of this tracking method are that it works across browsers in most
cases. \citet{kleinDNSCacheBasedUser2019} found that it survives browser
restarts and is resistant to the privacy mode employed by modern browsers.
Futhermore, \glspl{VPN} do not affect the method and it works with different
protocols (\gls{HTTPS}, \gls{IPv6}, \gls{DNSSEC}).
Disadvantages are that the tracking identifiers do not survive a computer
restart. Additionally, switching the network causes the identifiers to be
obsoleted and identifiers generally only live as long as the \gls{TTL} limit
allows.
Due to these constraining factors, \gls{DNS} tracking is best combined with
methods that are intended for tracking over longer time periods such as cookies
for example.
\subsection{TLS Session Resumption}
\label{subsec:tls session resumption}
\gls{TLS} \cite{rescorlaTransportLayerSecurity2018} is widely used today to
securely encapsulate communication across the web. For the secured communication
to work, client and server first have to authenticate themselves and then agree
on protocol version, cipher suite and compression method. The exchange of this
information at the beginning of a connection is called a \emph{handshake}.
Figure~\ref{fig:tls-handshake} shows how the initial handshake is performed
after which both the client and the server are ready for sending and receiving
application data. For bandwidth savings and better performance, it is possible
to cache a \gls{TLS} session to allow reusing an already established secure
connection at a later point in time. Versions prior to \gls{TLS} 1.3 used two
mechanisms to accomplish this: \gls{TLS} session identifiers and session
tickets. Session identifiers are sent by the server along with the initial
handshake with the user agent. The identifier is randomly generated and saved by
the server so that the current session can be found later. To resume a session,
the user agent sends the identifier with the \emph{ClientHello} message to the
server. The server can then match the identifier to the previously initiated
session and responds with the same session identifier to signal to the user
agent that the session can be resumed. Session tickets are only issued by the
server when the client has expressed support for them. They are encrypted and
provided by the server after a successful handshake via an out-of-band message.
The ticket contains all the necessary information to reestablish a secure
connection. When the user agent wishes to resume a connection, the session
ticket is sent along with the first \emph{ClientHello} message and the server
can decrypt the ticket and resume the session.
\begin{figure}
\begin{center}
\includegraphics[width=0.75\textwidth]{figures/tls-handshake.png}
\caption{A \gls{TLS}-handshake between a client and a server. First, the
client sends a \emph{ClientHello} message to the server which the
server has to answer with a \emph{ServerHello} message or else the
connection fails. These two initial messages establish protocol
version, session ID, cipher suite and compression method
\cite[p.~44]{rescorlaTransportLayerSecurity2008}. The server also
checks for a session resumption. If the client sends a session ID
with the \emph{ClientHello} message, the server knows that it should
resume a previously established connection. The next three messages
are used for the key exchange which allows client and server to
authenticate themselves.}
\label{fig:tls-handshake}
\end{center}
\end{figure}
In \gls{TLS} version 1.3 \cite{rescorlaTransportLayerSecurity2018} the session
identifiers and tickets have been replaced with a \gls{PSK}. Instead of sending
a ticket which is not encapsulated in the \gls{TLS}-secured connection, a
\gls{PSK} identity is sent from the server after the initial handshake, usually
avoiding out-of-band communication. The \gls{PSK} identity provides a mechanism
by which information associated with a secure connection (certificates, keys)
can be restored.
Because resuming a connection reuses information that has been exchanged before
to establish secure communication, individual sessions can be linked together
to form a history of information exchanges. This tracking method is described
by \citet{syTrackingUsersWeb2018}. Even though \gls{TLS} session resumption can
be mitigated by restarting the browser because that clears the cache, the
authors argue that due to mobile devices being online without restarts for long
periods the attack remains viable. Futhermore, despite browsers imposing
limits on the lifetime of session identifiers and \glspl{PSK}, it is possible
to maintain a session indefinitely by carrying out a \emph{prolongation
attack}. \citet{syTrackingUsersWeb2018} define a prolongation attack as an
attack where the client asks for a session resumption by sending the identifier
of a previously initiated connection and the server responds with a new
handshake instead of resuming the old one. This effectively resets the time
limit as long as the user is initiating new (or trying to resume old)
connections to the server within the imposed time limit.
The authors present an empirical evaluation of server and browser configurations
with respect to session resumption lifetime by crawling the top 1M web sites as
determined by Alexa. Their results indicate that only 4\% of those sites do not
allow session resumption at all, while the majority (78\%) allows session
identifiers as well as tickets.