bachelorarbeit/chapters/methods.tex

704 lines
42 KiB
TeX

\documentclass[../main.tex]{subfiles}
\externaldocument{defences}
\begin{document}
\chapter{Tracking Methods}
\label{chap:tracking methods}
This chapter will go into detail about various tracking methods that have been
used during the history of the web. It is important to note that some of those
approaches to tracking date back to when the World Wide Web was still in its
early development stages. Knowing where the techniques come from helps in
correctly judging the impact they had and still have on the Internet as we use
it today. Furthermore, knowledge about the past allows for better predictions of
future changes in the tracking ecosystem.
To aid in understanding how they work and where they fit in the tracking
landscape, three different categories are identified and presented:
session-based, storage-based and cache-based tracking methods. Each category
uses different mechanisms and technologies to enable tracking of users. What
most of them have in common, is that they try to place unique identifiers in
different places, which can then be read on subsequent visits. Thus, a
chronological ordering of events enables interested parties to infer not only
usage statistics but also specific data about the entities behind those
identifiers.
\section{Session-based Tracking Methods}
\label{sec:session-based tracking methods}
One of the simplest and most used forms of tracking on the Internet rely on
sessions. Since HTTP is a stateless protocol, web servers cannot by default keep
track of any previous client requests. In order to implement specific features
such as personalized advertising, some means to save current and recall previous
states must be used. For this functionality, sessions were introduced. Sessions
represent a temporary and interactive exchange of information between two
parties. Due to their temporary nature, they have to be `brought up' at some
point and `torn down' at a later point in time. It is not specified however,
how long the period between establishing and stopping a session has to be. It
could be only for a single browser session and terminated by the user manually,
or it could be for as long as a year.
\subsection{Passing Information in URLs}
\label{subsec:passing information in urls}
\glspl{URL} have first been proposed by Berners-Lee in 1994
\cite{berners-leeUniformResourceLocators1994} and are based on \glspl{URI}
\cite{berners-leeUniversalResourceIdentifiers1994}. The latter specifies a way
to uniquely identify a particular resource. The former extends the \gls{URI}
specification to include where and how a particular resource can be found.
\glspl{URI} consist of multiple parts:
\begin{enumerate}
\item a scheme (in some cases a specific protocol),
\item an optional authority (network host or domain name),
\item a path (a specific location on that host),
\item an optional query and
\item an optional fragment preceded by a hashtag (a sub resource pointing to
a specific location within the resource)
\end{enumerate}
To access a section called \texttt{introduction} in a blog post named
\texttt{blog post} on a host with the domain name \texttt{example.com} over the
\gls{HTTP}, a user might use the following \gls{URI}:
\begin{verbatim}
http://example.com/blogpost/#introduction
\end{verbatim}
Even though \glspl{URI} and \glspl{URL} are two different things, they are
mostly used interchangeably today. Especially non-technical people refer to an
address on the \gls{WWW} simply as a \gls{URL}.
The optional query parameter is in most cases constructed of multiple
\texttt{(key,value)} pairs, separated by delimiters such as \texttt{\&} and
\texttt{;}. In the tracking context, query parameters can be used to pass
information (e.g. unique identifiers) to the resource that is to be accessed by
appending a unique string to all the links within the downloaded page. Since
requests to pages are generally logged by the server, requesting multiple pages
with the same unique identifier leaves a trail behind that can be used to
compile a browsing history. Sharing information with other parties is not only
limited to unique identifiers. \gls{URL} parameters can also be used to pass the
referrer of a web page containing a query that has been submitted by the user.
\citeauthor{falahrastegarTrackingPersonalIdentifiers2016} demonstrate such an
example where an advertisement tracker logs a user's browsing history by storing
the referrer into a \texttt{(key,value)} pair
\cite[p.~37]{falahrastegarTrackingPersonalIdentifiers2016}. Other possibilities
include encoding geographical data, network properties, user information (e.g.,
e-mails) and authentication credentials.
\citeauthor{westMeasuringPrivacyDisclosures2014} conducted a survey concerning
the use of \gls{URL} Query Strings and found it to be in widespread use on the
web \cite{westMeasuringPrivacyDisclosures2014}.
\subsection{Hidden Form Fields}
\label{subsec:hidden form fields}
The \gls{HTML} provides a specification for form elements, which allow users to
submit information (e.g., for authentication) to the server via POST or GET
methods. Normally, a user would input data into a form and on clicking
\emph{submit} the input would be sent to the server. Sometimes it is necessary
to include additional information that the user did not enter. For this reason
there exist \emph{hidden} web forms. Hidden web forms do not show on the website
and therefore the user cannot enter any information. Similar to \gls{URL}
parameters, the value parameter in a hidden field contains additional
information like the user's preferred language for example. Since almost
anything can be sent in a value parameter, hidden form fields present another
way to maintain a session. A parameter containing a unique identifier will be
sent with the data the user has submitted to the server. The server can then
match the action the user took with the identifier. In case the server already
knows that specific identifier from a previous interaction with the user, the
gained information can now be added to the user's browsing profile. An example
of a hidden web form is given in Listing~\ref{lst:hidden web form}, which has
been adapted from \cite{InputFormInput}. In Line 15 a hidden web field is
created and the \texttt{value} field is set by the server to contain a unique
user identifier. Once the \emph{submit} button has been clicked, the identifier
is sent to the server along with the data the user has filled in.
\begin{listing}
\inputminted[frame=lines,framesep=2mm,bgcolor=light-gray,baselinestretch=1.2,fontsize=\scriptsize,linenos]{html}{code/hidden-web-form.html}
\caption{Example of an \gls{HTTP} form containing a hidden field with
\texttt{id=userId}. The id is set by the web server dynamically so that every
visitor has his/her unique identifier attached to the form.}
\label{lst:hidden web form}
\end{listing}
\subsection{HTTP Referer}
\label{subsec:http referer}
Providers of web services often want to know where visitors to their website
come from to understand more about their users and their browsing habits. The
\gls{HTTP} specification accounts for this by introducing the \emph{\gls{HTTP}
Referer field} [\emph{sic}] in the header. By checking the referrer, the server
can see where the request came from. In practice, a user clicks on a link on a
web page and the current web page is sent as a \gls{URL} in the \gls{HTTP}
Referer field. The header with the referrer information gets attached to the
\gls{HTTP} request which is sent to the server. The server responds with the
requested web page and can establish a link from the original web page to the
new web page. When applied to a majority of the requests on a site, the
resulting data can be analyzed for promotional and statistical purposes.
\citeauthor{malandrinoPrivacyAwarenessInformation2013} have shown that the
\gls{HTTP} Referer is one of the most critical factors in leaking \gls{PII}
\cite{malandrinoPrivacyAwarenessInformation2013}, because leakage of information
relating to user's health has been identified as the most severe in terms of
identifiability of users on the web.
\subsection{Explicit Authentication}
\label{subsec:explicit authentication}
Explicit authentication requires a user to \emph{explicitly} log in or register
to the website. This way, specific resources are only available to the user when
he or she has authenticated themselves to the service. Actions taken on an
authenticated user account are tied to that account and crafting a personal
profile is more or less a built-in function in this case. Since merely asking a
user to authenticate is a simple method, the extent to which it can be used is
limited. Logged in users are generally not logged in across different browser
sessions, unless they are using cookies to do so (see section~\ref{subsec:http
cookies}), therefore limiting tracking to one session at a time. Furthermore,
always requiring a logged in state can be a tiring task for users, because they
have to be authenticated every time they visit a particular service. This can
potentially pose a usability problem where users simply stop using the service
or go to considerable lengths to avoid logging in. This largely depends on a
cost-benefit analysis the users subconsciously undertake \cite{}. The third
factor where this method is lacking, concerns the awareness of the user being
tracked. Since tracking users depends on them actively logging in to the
service, tracking them transparently is impossible. Even though most tracking
efforts are not detected by the average user \cite{}, it is known that actions
taken on an account are logged to provide better service through service
optimization and profile personalization.
Making an account on a website to use their services to their full extent, can
be beneficial in some cases. Facebook for example, allows their users to
configure what they want to share with the public and their friends. Research
has shown however, that managing which posts get shown to whom is not as
straightforward as one might think.
\todo{Wrong chapter?} \citeauthor{liuAnalyzingFacebookPrivacy2011}
\cite{liuAnalyzingFacebookPrivacy2011} conducted a survey where they asked
Facebook users about their desired privacy and visibility settings and
cross-checked them with the actual settings they have used for their posts. The
results showed that in only 37\% of cases the users' expectations match the
reality. Additionally, 36\% of content is left on the default privacy settings
which set the visibility of posts to public, meaning that any Facebook user can
view them.
\subsection{window.name DOM Property}
\label{subsec:window.name dom property}
The \gls{DOM} is a platform and language agnostic \gls{API} which defines the
logical structure of web documents (i.e., \gls{HTML}, \gls{XHTML} and \gls{XML})
and the way they are accessed and manipulated. The \gls{DOM} was originally
introduced by Netscape at the same time as JavaScript as the \gls{DOM} Level 0.
The first recommendation (\gls{DOM} Level 1) was released in 1998 by the
\gls{W3C} \gls{DOM} working group \cite{w3cDocumentObjectModel1998} which
published its final recommendation (\gls{DOM} Level 3) in 2004. Since then the
\gls{WHATWG} took over and in 2015 published the \gls{DOM} Level 4 standard
\cite{whatwgDOMLivingStandard2020} which replaces the Level 3 specification. It
works by organizing all objects in a document in a tree structure which allows
individual parts to be altered when a specific event happens (e.g., user
interaction). Furthermore, each object has properties which are either applied to
all \gls{HTML} elements or only to a subset of all elements.
One useful property for tracking purposes is the \texttt{window.name} property.
Its original intention was to allow client-side JavaScript to get or set the
name of the current window. Since windows do not have to have names, the
window.name property is being used mostly for setting targets for hyperlinks and
forms. Modern browsers allow storing up to two megabytes of data in the
window.name property, which makes it a viable option for using it as a data
storage or---more specifically---maintaining session variables. In order to
store multiple variables in the window.name property, the values have first to
be packed in some way because only a single string is allowed. A \gls{JSON}
stringifier converts a normal string into a \gls{JSON} string which is then
ready to be stored in the DOM property. Additionally, serializers can also
convert JavaScript objects into a \gls{JSON} string. Normally JavaScript's
same-origin policy prohibits making requests to servers in another domain, but
the window.name property is accessible from other domains and resistant to page
reloads. Maintaining a session across domains and without cookies is therefore
possible and multiple implementations exist
\cite{frankSessionVariablesCookies2008,zypWindowNameTransport2008}.
\section{Storage-based Tracking Methods}
\label{sec:storage-based tracking methods}
Storage-based tracking methods are different to session-based tracking methods
in that they try to store information on the client's computer not only for
single sessions but for as long as desired. The following methods can be used to
store session data as well but are not limited to that use case. They generally
enable more advanced tracking approaches because they have information about the
current browser instance and the operating system the browser is running on. Due
to their nature of residing on the user's computer, they are in most cases
harder to circumvent, especially when two or more methods are combined resulting
in better resilience against simple defences.
\subsection{HTTP Cookies}
\label{subsec:http cookies}
A method which is most often associated with tracking on the Internet is
tracking with \gls{HTTP} cookies. Cookies are small files that are placed in the
browser's storage on the user's computer. They are limited to four kilobytes in
size and are generally used to identify and authenticate users and to store
website preferences. They were introduced to the web to allow stateful
information to be stored because the \gls{HTTP} is a stateless protocol and
therefore does not have this capability. It is also a way of reducing the
server's load by not having to recompute states every time a user visits a
website. Shopping cart functionality for example can thus be implemented by
setting a cookie in the user's browser, saving the items which are currently
added to the shopping cart and giving the user the possibility to resume
shopping at a later point provided that they do not delete their cookies. With
the introduction of cookies, advertising companies could reidentify users by
placing unique identifiers in the browser and reading them on subsequent visits.
The first standard for cookies was published in 1997
\cite{kristolHTTPStateManagement1997} and has since been updated multiple times
\cite{kristolHTTPStateManagement2000,barthHTTPStateManagement2011}.
Cookies can be divided into two categories: first party cookies, which are
created by the domain the user has requested and third party cookies, which are
placed in the user's browser by other domains that are generally not under the
control of the first party. Whereas first party cookies are commonly not used
for tracking but for the aforementioned shopping cart functionality for example
or enabling e-commerce applications to function properly, third party cookies are
popular with data brokerage firms (e.g., Datalogix, Experian, Equifax), online
advertisers (e.g., DoubleClick) and---belonging to both of these categories in
some cases---social media platforms (e.g., Facebook). The distinction between
these two categories is not always clear, however. Google Analytics for example
is considered to be a third party but offers their analytics services by setting
a first party cookie in the user's browser in addition to loading JavaScript
snippets from their servers. Therefore, categorizing cookies into those that
serve third party web content and those that serve first party web content
presents a more adequate approach.
Cookies are set either by calling scripts that are embedded in a web page (e.g.,
Google's \texttt{analytics.js}) or by using the \gls{HTTP} Set-Cookie response
header. Once a request to a web server has been issued, the server can set a
cookie in the Set-Cookie header and sends the response back to the client. On
the client's side the cookie is stored by the browser and sent with subsequent
requests to the same domain via the Cookie \gls{HTTP} header. An example of a
cookie header is given in Listing~\ref{lst:session cookie header}. Because this
example does not set an expiration date for the cookie, it sets a session
cookie. Session cookies are limited to the current session and are deleted as
soon as the session is `torn down'. By adding an expiration date (demonstrated
in Listing~\ref{lst:permanent cookie header}) or a maximum age, the cookie
becomes permanent. Additionally, the domain attribute can be specified, meaning
that cookies which list a different domain than the origin, are rejected by the
user agent \cite[Section 4.1.2.3]{barthHTTPStateManagement2011}. The same-origin
policy applies to cookies, disallowing access by other domains.
\begin{listing}
\inputminted[frame=lines,framesep=2mm,bgcolor=light-gray,baselinestretch=1.2,fontsize=\scriptsize,linenos]{http}{code/session-cookie-header}
\caption{An example of an \gls{HTTP} header setting a session cookie.}
\label{lst:session cookie header}
\end{listing}
\begin{listing}
\inputminted[frame=lines,framesep=2mm,bgcolor=light-gray,baselinestretch=1.2,fontsize=\scriptsize,linenos]{http}{code/permanent-cookie-header}
\caption{An example of an \gls{HTTP} header setting a permanent cookie.}
\label{lst:permanent cookie header}
\end{listing}
Distinguishing tracking and non-tracking cookies can be done with high accuracy
by observing their expiration time and the length of the value field.
\citeauthor{liTrackAdvisorTakingBack2015} \cite{liTrackAdvisorTakingBack2015}
demonstrate a supervised learning approach to detecting tracking cookies with
their tool \emph{TrackAdvisor}. They found that tracking cookies generally have
a longer expiration time than non-tracking cookies and they need to have a
sufficiently long value field carrying the unique identifier. Using this method,
they found that only 10\% of tracking cookies have a lifetime of a single day or
less while 80\% of non-tracking cookies expire before a day is over.
Additionally, a length of more than 35 characters in the value field applies to
80\% of tracking cookies and a value field of less than 35 characters applies to
80\% of non-tracking cookies. \emph{Cookie Chunking}, where a cookie of larger
length is split into multiple cookies with smaller length, did not appear to
affect detection by their method negatively. They also present a site
measurement of the Alexa Top 10,000 websites, finding that 46\% of websites use
third party tracking. More recent research
\cite{gonzalezCookieRecipeUntangling2017} has shown that tracking cookies do not
have to be long lasting to accumulate data about users. Some cookies---like the
\texttt{\_\_utma} cookie from Google Analytics for example---save a timestamp of
the current visit with the unique identifier, thereby allowing to use cookies
which last a short time but can be afterwards used in series to complete the
whole picture. \citeauthor{gonzalezCookieRecipeUntangling2017}
\cite{gonzalezCookieRecipeUntangling2017} have also found 20\% of observed
cookies to be \gls{URL} or base64 encoded, making decoding of cookies a
necessary step for analysis. Furthermore---and contrary to previous work---,
cookie values are found in much more varieties than is assumed by approaches
that only try to detect cookies by their expiration date and/or character
length. They also presented an entity based matching algorithm to dissect
cookies which contain more than a unique identifier. This allows for a better
understanding and interpretation of complex cookies as they are found in
advertising networks with a lot of reach (e.g., doubleclick.net). This
information is particularly useful for building applications that effectively
detect and block cookies (see chapter~\ref{chap:defences against tracking}).
\subsection{Flash Cookies and Java JNLP PersistenceService}
\label{subsec:flash cookies and java jnlp persistenceservice}
Flash Cookies are similar to HTTP cookies in that they too are a store of
information that helps websites and servers to recognize already seen users.
They are referred to as \glspl{LSO} by Adobe and are part of the Adobe Flash
Player runtime. Instead of storing data in the browser's storage, they have
their own storage in a different location on the user's computer. Another
difference is that they cannot only store 4 kilobytes of data but 100 kilobytes
and they also have no expiration dates by default (\gls{HTTP} cookies live until
the end of the session unless specified otherwise). Since Flash cookies are not
created by means the browser normally supports (i.e., \gls{HTTP}, \gls{CSS})
but by Adobe's Flash Player runtime, browsers are not managing Flash cookies.
This means that, due to Flash cookies not being tied to a specific browser, they
function across browsers. This capability makes them an interesting target for
trackers to store their identifying information in, because out of the box
browsers initially did not support removing Flash cookies and one had to
manually set preferences in the \emph{Web Storage Settings panel} provided by
the Flash Player runtime to get rid of them. Trackers were searching for a new
way to store identifiers because users became increasingly aware of the dangers
posed by \gls{HTTP} cookies and reacted by taking countermeasures.
\citeauthor{soltaniFlashCookiesPrivacy2009}
\cite{soltaniFlashCookiesPrivacy2009} were the first to report on the usage of
Flash cookies by advertisers and popular websites. While surveying the top 100
websites at the time, they found that 54\% of them used Flash cookies. Some
websites were setting Flash cookies as well as \gls{HTTP} cookies with the same
values, suggesting that Flash cookies serve as backup to \gls{HTTP} cookies.
Several websites were found using Flash cookies to respawn already deleted
\gls{HTTP} cookies, even across domains. \citeauthor{acarWebNeverForgets2014}
\cite{acarWebNeverForgets2014} automated detecting Flash cookies and access to
them by monitoring file access with the GNU/Linux \emph{strace} tool
\cite{michaelStraceLinuxManual2020}. This allowed them to acquire data about
Flash cookies respawning \gls{HTTP} cookies. Their results show that six of the
top 100 sites use Flash cookies for respawning.
Even though Flash usage has declined during the last few years thanks to the
development of the HTML5 standard, \citeauthor{buhovFLASH20thCentury2018}
\cite{buhovFLASH20thCentury2018} have shown that despite major security flaws,
Flash content is still served by 7.5\% of the top one million websites (2017).
The W3Techs Web Technology Survey shows a similar trend and also offers an
up-to-date measurement of 2.7\% of the top ten million websites for the year
2020 \cite{w3techsHistoricalYearlyTrends2020}. Due to the security concerns in
using Flash, Google's popular video sharing platform YouTube switched by default
to the HTML5 <video> tag in January of 2015
\cite{youtubeengineeringYouTubeNowDefaults2015}. In 2017 Adobe announced that they
will end-of-life Flash at the end of 2020, stopping updates and distribution
\cite{adobecorporatecommunicationsFlashFutureInteractive2017}. Consequently,
Chrome 76 and Firefox 69 disabled Flash by default and will drop support
entirely in 2020.
Similarly to Flash, Java also provides a way of storing data locally on the
user's computer via the PersistenceService \gls{API}
\cite{PersistenceServiceJNLPAPI2015}. It is used by the evercookie library
(section~\ref{subsec:evercookie}) to store values for cookie respawning by
injecting a Java applet into the \gls{DOM} of a page
\cite{baumanEvercookieApplet2013}.
\subsection{Evercookie}
\label{subsec:evercookie}
Evercookie is JavaScript code that can be embedded in websites which allows to
permanently store information on the user's computer. When activated,
information is not only stored in standard \gls{HTTP} cookies but also in
various other places, providing redundancy where possible. A full list of
locations used by Evercookie can be found on the project's github page
\cite{kamkarSamykEvercookie2020}. In case the user wants to get rid of all
information stored by visiting a website that uses evercookies, every location
has to be cleared because if one remains, all the other cookies are restored.
The cookie deletion mechanisms that are provided by browsers by default do not
clear all locations where evercookies are stored, which makes evercookie almost
impossible to avoid. Evercookie is open source and quietly implementing or using
evercookie is therefore not easy to do. Additionally, it is reported on the
project's github page that it might cause severe performance issues in browsers.
Evercookie has been proposed and implemented by
\citeauthor{kamkarEvercookieVirtuallyIrrevocable2010} in
\cite{kamkarEvercookieVirtuallyIrrevocable2010}. Multiple surveys have tried to
quantify the use of evercookie in the wild.
\citeauthor{acarWebNeverForgets2014} provide a heuristic for detecting
evercookies stored on the user's computer \cite{acarWebNeverForgets2014} and
analyze evercookie usage in conjunction with cookie respawning.
\subsection{Cookie Synchronization}
\label{subsec:cookie synchronization}
When trackers are using cookies to store unique identifiers to track users,
every tracker assigns a different identifier to the same user, due to the
same-origin policy disallowing interaction with other trackers. Because of this,
sharing data between multiple trackers is difficult, since there are no easy
ways to accurately match an accumulated profile history of one identifier to
another. This problem has been solved by modern trackers by using a mechanism
called Cookie Synchronization or Cookie Matching. This technique allows multiple
trackers to open an information sharing channel between each other without
necessarily having to know the website the user visits.
\begin{figure}[ht]
\centering
\includegraphics[width=1\textwidth]{cookiesyncing}
\label{fig:cookie synchronization}
\caption{Cookie Synchronization in practice between two trackers
\emph{cloudflare.com} and \emph{google.com}.}
\end{figure}
An example of how Cookie Synchronization works in practice is given in
Figure~\ref{fig:cookie synchronization}. The two parties that are interested in
tracking the user are called \emph{cloudflare.com} and \emph{google.com} in this
example. The user they want to track is called \emph{browser}. \emph{Browser}
first visits \emph{website1.com} which loads JavaScript from
\emph{cloudflare.com}. \emph{Cloudflare.com} sets a cookie in the browser with a
tracking identifier called \emph{userID = 1234}. Next, \emph{browser} visits
another website called \emph{website2.com} which loads an advertisement banner
from \emph{google.com}. \emph{Google.com} also sets a cookie with the tracking
identifier \emph{userID = ABCD}. \emph{Browser} has now two cookies from two
different providers, each of them knowing the user under a different identifier.
When \emph{browser} visits a third website called \emph{website3.com} which
makes a request to \emph{cloudflare.com} and recognizes the user with the
identifier \emph{userID = 1234}, \emph{cloudflare.com} sends an \gls{HTTP}
redirect, redirecting \emph{browser} to \emph{google.com}. The redirect also
contains an \gls{HTTP} Query String (see section~\ref{subsec:passing information
in urls}) which adds a query like \emph{?userID=1234\&publisher=website3.com}.
The complete GET request to \emph{google.com} might look like this:
\begin{minted}[frame=lines,framesep=2mm,bgcolor=light-gray,baselinestretch=1.2,fontsize=\scriptsize,linenos]{http}
GET /index.html?userID=1234&publisher=website3.com HTTP/1.1
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
Host: google.com
Cookie: userID=ABCD
\end{minted}
\emph{Google.com} therefore not only knows that the user with the identifier
\emph{userID=ABCD} visited \emph{website3.com} but also that \emph{browser} is
the same user as \emph{userID=1234}. Since the identifiers can now be traced
back to the same person, the different cookies have been synchronized, allowing
the two trackers to exchange information about the user without him or her
knowing.
Cookie Synchronization has seen widespread adoption especially in \gls{RTB}
based auctions \cite{olejnikSellingPrivacyAuction2014}.
\citeauthor{papadopoulosCookieSynchronizationEverything2019}
\cite{papadopoulosCookieSynchronizationEverything2019} recorded and analyzed the
browsing habits of 850 users over a time period of one year and found that 97\%
of users with regular browsing activity were exposed to Cookie Synchronization
at least once. Furthermore, they found that ``[...] the average user receives
around 1 synchronization per 68 requests''
\cite[p.~7]{papadopoulosCookieSynchronizationEverything2019}. In
\cite{englehardtOnlineTracking1MillionSite2016} the authors crawl the top
100,000 sites and find that 45 of the top 50 (90\%) third parties and 460 of the
top 1000 (46\%) use Cookie Synchronization with at least one other party.
\emph{Doubleclick.net} being at the top sharing 108 cookies with 118 other third
parties. \citeauthor{papadopoulosExclusiveHowSynced2018} show in
\cite{papadopoulosExclusiveHowSynced2018} the threat that Cookie Synchronization
poses to encrypted \gls{TLS} sessions by performing the cookie-syncing over
unencrypted \gls{HTTP} even though the original request to the website was
encrypted. This highlights the serious privacy implications for users of
\gls{VPN} services trying to safeguard their traffic from a potentially
malicious \gls{ISP}.
\subsection{Silverlight Isolated Storage}
\label{subsec:silverlight isolated storage}
Silverlight Isolated Storage can also be used for storing data for tracking
purposes on the user's computer. It has been compared to Adobe's Flash
technology as it too requires a plugin from Microsoft to function. Available for
storage are 100 kilobytes which is the same amount Flash cookies can store.
Silverlight does not work in the private browsing mode and can only be cleaned
manually by deleting a hidden directory in the filesystem or by changing
settings in the Silverlight application. Silverlight's Isolated Storage is one
of the methods evercookie (section~\ref{subsec:evercookie}) uses to make
permanent deletion of cookies hard to do and to facilitate cookie respawning.
Usage of Silverlight has seen a steady decline since 2011 even though it has
been used by popular video streaming websites such as Netflix
\cite{NetflixBeginsRollOut2010} and Amazon. Microsoft did not include
Silverlight support in Windows 8 and declared end-of-life in a blog post for
October of 2021 \cite{SilverlightEndSupport2015}. Usage of Silverlight currently
hovers around 0.04\% for the top 10 million websites
\cite{w3techsUsageStatisticsSilverlight2020}.
\subsection{HTML5 Web Storage}
\label{subsec:html5 web storage}
HTML5 Web Storage comes in three different forms: HTML5 Global Storage, HTML5
Local Storage and HTML5 Session Storage. It is part of the HTML specification
\cite{whatwgHTMLStandard2020} and provides means for storing name-value pairs on
the user's computer. HTML5 Web Storage works similarly to cookies but enables
developers to manage transactions that are done by the user simultaneously but
in two different windows. Whereas with cookies the transaction can accidentally
be recorded twice, HTML5 Web Storage allows multiple windows to access the same
storage on the user's computer thereby avoiding this problem. In contrast to
cookies, which are sent to the server every time a request is made, HTML5 Storage
contents do not get sent to the web server. By default the storage limit is
configured to be 5 megabytes per origin \cite{whatwgHTMLStandard2020a}. Even
though this was only a recommendation by the standard, all modern browsers
adhere to it. More space can be allocated upon asking the user for permission to
do so.
Global Storage was part of an initial HTML5 draft and is accessible across
applications. Due to it violating the same-origin policy, most major browsers
have not implemented Global Storage.
Local Storage does, however, obey the same-origin policy by only allowing the
originating domain access to its name-value pairs. Every website has their own
separate storage area which maintains a clear separation of concerns. Local
Storage lends itself for different use cases. Especially applications that
should function even when no internet connection exists can use Local Storage to
enable that functionality. Coupled with file-based \glspl{API}, which are
generally not limited in storage (except by the available disk space), offering
the full range of features in offline-mode is feasible.
The third category of HTML5 Web Storage is similar to Local Storage, but
requires that the stored data be deleted after a session is closed. While
content that is persisted by Local Storage must be deleted explicitly by the
user, Session Storage has the intended function of providing non-persistent
storage.
HTML5 Web Storage can be used for tracking in the same way that cookies are
used: by storing unique identifiers which are read on subsequent visits.
\citeauthor{ayensonFlashCookiesPrivacy2011}
\cite{ayensonFlashCookiesPrivacy2011} found that 17 of the top 100 web sites
used HTML5 Web Storage with some of them using it for cookie respawing (see
section~\ref{subsec:evercookie}). A recent survey by
\citeauthor{belloroKnowWhatYou2018} \cite{belloroKnowWhatYou2018} looks at Web
Storage usage in general and found that 83.09\% of the top 10K Alexa web sites
use it. The authors flagged 63.88\% of those usages as coming from known
tracking domains.
\subsection{HTML5 Indexed Database API}
\label{subsec:html5 indexed database api}
The need for client side storage to provide performant web applications that can
also function offline, has prompted the inception of alternative methods to
store and retrieve information. Consequently, the development of the HTML5
standard has tried to fill that need by introducing HTML5 Web Storage and the
HTML5 Indexed Database \gls{API}.
HTML5 Indexed Database \gls{API} provides an interface for storing values and
hierarchical objects using the well-known key-value pair storage principle
\cite{alabbasIndexedDatabaseAPI2020}. This property makes it similar to NoSQL
storage solutions which have seen increasing adoption rates on the web. It is
the successor to the abandonend Web SQL Database (see section~\ref{subsec:web
sql database}) standard and functions similarly to the HTML5 Web Storage,
meaning that it has the same storage limits and privacy implications and has to
obey the same-origin policy. In contrast to HTML5 Web Storage, IndexedDB is
intended for storing larger amounts of data and provides additional functions
such as in-order key retrieval. Reading from and writing to an IndexedDB is done
with JavaScript by opening a connection to the database, preparing a transaction
and committing it. The development of the standard is ongoing with two editions
already published and recommended by the W3C and the third edition existing as
an editors draft until it is ready for recommendation.
HTML5 IndexedDB has been added to the evercookie library (see
section~\ref{subsec:evercookie}) by
\citeauthor{kamkarEvercookieVirtuallyIrrevocable2010}, providing redundancy for
\gls{HTTP} cookies. \citeauthor{acarWebNeverForgets2014}
\cite{acarWebNeverForgets2014} have shown that only 20 of 100.000 surveyed sites
use the IndexedDB storage vector with one of them (\texttt{weibo.com}) using it
for respawning \gls{HTTP} cookies. A more recent study by
\citeauthor{belloroKnowWhatYou2018} \cite{belloroKnowWhatYou2018} paints a
different picture: On a dataset provided by the \gls{HTTP} Archive project
\cite{soudersAnnouncingHTTPArchive2011}, they found that 5.56\% of observed
sites use IndexedDB. Of those that use IndexedDB, 31.87\% of usages appear to be
coming from domains that are flagged as `trackers'.
\subsection{Web SQL Database}
\label{subsec:web sql database}
Web SQL Database \cite{hicksonWebSQLDatabase2010} was initially developed to
provide an \gls{API} for storing data in databases. The stored data can then be
queried using \gls{SQL} or variants thereof. The W3C stopped the development of
the standard in 2010 due to a lack of other backend implementations (other than
SQLite) which is necessary for a recommendation as a standard. Browsers have
turned to HTML5 IndexedDB (see section~\ref{subsec:html5 indexed database api}),
the ``spiritual successor'' to Web SQL Database, for web database storage.
Despite the W3C deprecating Web SQL Database, some browsers such as Chrome,
Safari and Opera still support it and have no plans of discontinuing it.
In the same way that other tracking technologies can maintain a history of web
site visits and actions, Web SQL Database can store identifying information via
the usage of unique identifiers. An arbitrary maximum size of 5 megabytes of
storage per origin is recommended by the standard, with the possibility to ask
the user for more capacity. This limit includes other domains which are
affiliated with the origin but have a different name (e.g. subdomains).
Due to the W3C abandoning the Web SQL Database standard, not many reports on
usage for tracking purposes exist. The method has been added, however, to the
evercookie library by \citeauthor{kamkarEvercookieVirtuallyIrrevocable2010} (see
section~\ref{subsec:evercookie}) to add another layer of redundancy for storing
unique identifiers and respawning deleted ones. By performing static analysis on
a dataset provided by the \gls{HTTP} Archive project
\cite{soudersAnnouncingHTTPArchive2011}, \citeauthor{belloroKnowWhatYou2018}
found that 1.34\% of the surveyed websites use Web SQL Database in one of their
subresources. 53.59\% of Web SQL Database usage are considered to be coming from
known tracking domains. This ratio is lower for the first 10K web sites as
determined by Alexa (in May 2018): 2.12\% use Web SQL Database and 39.9\% of
those use it for tracking. These percentages show that Web SQL Database is not
used as a means to provide new functionality in most cases, but to increase user
tracking capabilities.
\section{Cache-based Tracking Methods}
\label{sec:cache-based tracking methods}
While the underlying principle of storing unique identifiers on the user agent's
computer remains the same, cache-based methods exploit a type of storage that is
normally used for data that is saved for short periods of time and most commonly
serves to improve performance. Whereas storage-based tracking methods (see
section~\ref{sec:storage-based tracking methods}) exploit storage interfaces
that are meant for persisting data to disk, caches store data that has been
generated by an operation and can be served faster on subsequent requests.
A variety of caches exist and they are utilized for different purposes, leading
to different forms of information exploitability for tracking users. This
section introduces methods which are in most cases not prevalent but are more
sophisticated and can thus be much harder to circumvent or block.
\todo{Insert structure}
\subsection{Web Cache}
\label{subsec:web cache}
Using the \gls{DOM} \gls{API}'s \texttt{Window.getComputedStyle()} method,
websites were able to check a user's browsing history by utilizing the \gls{CSS}
\texttt{:visited} selector. Links can be coloured depending on whether they have
already been visited or not. The colours can be set by the website trying to
find out what the user's browsing history is. JavaScript would then be used to
generate links on the fly for websites that will be cross-checked with the
contents of the browsing history. After generating links, a script can check the
colour, compare it with the colour that has been set for visited and non-visited
websites and see if a website has already been visited or not.
A solution to the problem has been proposed and subsequently implemented by
\citeauthor{baronPreventingAttacksUser2010}
\cite{baronPreventingAttacksUser2010} in 2010, making
\texttt{getComputedStyle()} and similar functions lie about the state of the
visited links and marking them as unvisited. Another solution has been developed
by \citeauthor{jacksonProtectingBrowserState2006}
\cite{jacksonProtectingBrowserState2006} in form of a browser extension that
enforces the same-origin policy for browser histories as well. Although their
approach limits access to a user's browsing history by third parties, first
parties are unencumbered by the same-origin policy. Their browser extension
does, however, thwart the attack carried out by
\citeauthor{jancWebBrowserHistory2010} in \cite{jancWebBrowserHistory2010} where
the authors were able to check for up to 30.000 links per second.
\citeauthor{wondracekPracticalAttackDeanonymize2010}
\cite{wondracekPracticalAttackDeanonymize2010} demonstrate the severity of
history stealing attacks (e.g. visited link differentiation) on user privacy by
probing for \glspl{URL} that encode user information such as group membership in
social networks. By constructing a set of group memberships for each user, the
results can uniquely identify a person. Furthermore, information that is not yet
attributed to a single user but to a group as a whole can be used to more
accurately identify members of said group.
Other ways of utilizing a web browser's cache to track users are tracking
whether a website asset (e.g., an image or script) has already been cached by
the user agent or not. If it has been cached, the website knows that is has been
visited before and if it has not been cached (the asset is downloaded from the
server), the user agent visits for the first time. Another way is to embed
identifiers in cached documents. An \gls{HTML} file can contain an identifier
which is stored in a \texttt{<div>} tag and is cached by the user agent. The
identifier can then be read from the cache on subsequent visits, even from third
party websites.
\subsection{Cache Timing}
\label{subsec:cache timing}
\subsection{Cache Control Directives}
\label{subsec:cache control directives}
\subsection{DNS Cache}
\label{subsec:dns cache}
\end{document}