933 lines
56 KiB
TeX
933 lines
56 KiB
TeX
\chapter{Tracking Methods}
|
|
\label{chap:tracking methods}
|
|
|
|
This chapter will go into detail about various tracking methods that have been
|
|
used during the history of the web. It is important to note that some of those
|
|
approaches to tracking date back to when the World Wide Web was still in its
|
|
early development stages. Knowing where the techniques come from helps in
|
|
correctly judging the impact they had and still have on the Internet as we use
|
|
it today. Furthermore, knowledge about the past allows for better predictions of
|
|
future changes in the tracking ecosystem.
|
|
|
|
To aid in understanding how they work and where they fit in the tracking
|
|
landscape, three different categories are identified and presented:
|
|
session-based, storage-based and cache-based tracking methods. Each category
|
|
uses different mechanisms and technologies to enable tracking of users. What
|
|
most of them have in common, is that they try to place unique identifiers in
|
|
different places, which can then be read on subsequent visits. Thus, a
|
|
chronological ordering of events enables interested parties to infer not only
|
|
usage statistics but also specific data about the entities behind those
|
|
identifiers.
|
|
|
|
\section{Session-based Tracking Methods}
|
|
\label{sec:session-based tracking methods}
|
|
|
|
One of the simplest and most used forms of tracking on the Internet relies on
|
|
sessions. Since \gls{HTTP} is a stateless protocol, web servers cannot by
|
|
default keep track of any previous client requests. In order to implement
|
|
specific features such as personalized advertising, some means to save current
|
|
and recall previous states must be used. For this functionality, sessions were
|
|
introduced. Sessions represent a temporary and interactive exchange of
|
|
information between two parties. Due to their temporary nature, they have to be
|
|
`brought up' at some point and `torn down' at a later point in time. It is not
|
|
specified however, how long the period between establishing and stopping a
|
|
session has to be. It could be only for a single browser session and terminated
|
|
by the user manually, or it could be for as long as a year.
|
|
|
|
|
|
\subsection{Passing Information in URLs}
|
|
\label{subsec:passing information in urls}
|
|
|
|
\glspl{URL} have first been proposed by Berners-Lee in 1994
|
|
\cite{berners-leeUniformResourceLocators1994} and are based on \glspl{URI}
|
|
\cite{berners-leeUniversalResourceIdentifiers1994}. The latter specifies a way
|
|
to uniquely identify a particular resource. The former extends the \gls{URI}
|
|
specification to include where and how a particular resource can be found.
|
|
\glspl{URI} consist of multiple parts:
|
|
|
|
\begin{enumerate}
|
|
\item a scheme (in some cases a specific protocol),
|
|
\item an optional authority (network host or domain name),
|
|
\item a path (a specific location on that host),
|
|
\item an optional query and
|
|
\item an optional fragment preceded by a hashtag (a sub resource pointing to
|
|
a specific location within the resource)
|
|
\end{enumerate}
|
|
|
|
To access a section called \texttt{introduction} in a blog post named
|
|
\texttt{blog post} on a host with the domain name \texttt{example.com} over
|
|
\gls{HTTP}, a user might use the following \gls{URI}:
|
|
|
|
\begin{verbatim}
|
|
http://example.com/blogpost/#introduction
|
|
\end{verbatim}
|
|
|
|
Even though \glspl{URI} and \glspl{URL} are two different things, they are
|
|
mostly used interchangeably today. Especially non-technical people refer to an
|
|
address on the \gls{WWW} simply as a \gls{URL}.
|
|
|
|
The optional query parameter is in most cases constructed of multiple
|
|
\texttt{(key,value)} pairs, separated by delimiters such as \texttt{\&} and
|
|
\texttt{;}. In the tracking context, query parameters can be used to pass
|
|
information (e.g. unique identifiers) to the resource that is to be accessed by
|
|
appending a unique string to all the links within the downloaded page. Since
|
|
requests to pages are generally logged by the server, requesting multiple pages
|
|
with the same unique identifier leaves a trail behind that can be used to
|
|
compile a browsing history. Sharing information with other parties is not only
|
|
limited to unique identifiers. \gls{URL} parameters can also be used to pass the
|
|
referrer of a web page containing a query that has been submitted by the user.
|
|
\citet{falahrastegarTrackingPersonalIdentifiers2016} demonstrate such an
|
|
example where an advertisement tracker logs a user's browsing history by storing
|
|
the referrer into a \texttt{(key,value)} pair
|
|
\cite[p.~37]{falahrastegarTrackingPersonalIdentifiers2016}. Other possibilities
|
|
include encoding geographical data, network properties, user information (e.g.,
|
|
e-mails) and authentication credentials.
|
|
\citet{westMeasuringPrivacyDisclosures2014} conducted a survey concerning
|
|
the use of \gls{URL} Query Strings and found it to be in widespread use on the
|
|
web.
|
|
|
|
\subsection{Hidden Form Fields}
|
|
\label{subsec:hidden form fields}
|
|
|
|
The \gls{HTML} provides a specification for form elements
|
|
\cite{whatwgFormsHTMLStandard2020}, which allows users to submit information
|
|
(e.g., for authentication) to the server via POST or GET methods. Normally, a
|
|
user would input data into a form and on clicking \emph{submit} the input would
|
|
be sent to the server. Sometimes it is necessary to include additional
|
|
information that the user did not enter. For this reason there exist
|
|
\emph{hidden} web forms \cite{whatwgHiddenStateHTML2020}. Hidden web forms do
|
|
not show on the web site and therefore the user cannot enter any information.
|
|
Similarly to \gls{URL} parameters, the value parameter in a hidden field
|
|
contains additional information like the user's preferred language for example.
|
|
Since almost anything can be sent in a value parameter, hidden form fields
|
|
present another way to maintain a session. A parameter containing a unique
|
|
identifier will be sent with the data the user has submitted to the server. The
|
|
server can then match the action the user took with the identifier. In case the
|
|
server already knows that specific identifier from a previous interaction with
|
|
the user, the gained information can now be added to the user's browsing
|
|
profile. An example of a hidden web form is given in Listing~\ref{lst:hidden web
|
|
form}, which has been adapted from \cite{InputFormInput}. In Line 15 a hidden
|
|
web field is created and the \texttt{value} field is set by the server to
|
|
contain a unique user identifier. Once the \emph{submit} button has been
|
|
clicked, the identifier is sent to the server along with the data the user has
|
|
filled in.
|
|
|
|
\begin{listing}
|
|
\inputminted[frame=lines,framesep=2mm,bgcolor=light-gray,baselinestretch=1.2,fontsize=\scriptsize,linenos]{html}{code/hidden-web-form.html}
|
|
\caption{Example of an \gls{HTTP} form containing a hidden field with
|
|
\texttt{id=userId}. The id is set by the web server dynamically so that every
|
|
visitor has his/her unique identifier attached to the form.}
|
|
\label{lst:hidden web form}
|
|
\end{listing}
|
|
|
|
\subsection{HTTP Referer}
|
|
\label{subsec:http referer}
|
|
|
|
Providers of web services often want to know where visitors to their web site
|
|
come from to understand more about their users and their browsing habits. The
|
|
\gls{HTTP} specification accounts for this by introducing the \emph{\gls{HTTP}
|
|
Referer field} [\emph{sic}] \cite{fieldingHTTPSemanticsContent2014} in the
|
|
header. By checking the referrer, the server can see where the request came
|
|
from. In practice, a user clicks on a link on a web page and the current web
|
|
page is sent as a \gls{URL} in the \gls{HTTP} Referer field. The header with the
|
|
referrer information gets attached to the \gls{HTTP} request which is sent to
|
|
the server. The server responds with the requested web page and can establish a
|
|
link from the original web page to the new web page. When applied to a majority
|
|
of the requests on a site, the resulting data can be analyzed for promotional
|
|
and statistical purposes. \citet{malandrinoPrivacyAwarenessInformation2013}
|
|
have shown that the \gls{HTTP} Referer is one of the most critical factors in
|
|
leaking \gls{PII}, because leakage of information relating to user's health has
|
|
been identified as the most severe in terms of identifiability of users on the
|
|
web.
|
|
|
|
\subsection{Explicit Authentication}
|
|
\label{subsec:explicit authentication}
|
|
|
|
Explicit authentication requires a user to \emph{explicitly} log in or register
|
|
to the web site. This way, specific resources are only available to the user
|
|
when he or she has authenticated themselves to the service. Actions taken on an
|
|
authenticated user account are tied to that account and crafting a personal
|
|
profile is more or less a built-in function in this case. Since merely asking a
|
|
user to authenticate is a simple method, the extent to which it can be used is
|
|
limited. Logged in users are generally not logged in across different browser
|
|
sessions, unless they are using cookies to do so (see section~\ref{subsec:http
|
|
cookies}), therefore limiting tracking to one session at a time. Furthermore,
|
|
always requiring a logged in state can be a tiring task for users, because they
|
|
have to be authenticated every time they visit a particular service. This can
|
|
potentially pose a usability problem where users simply stop using the service
|
|
or go to considerable lengths to avoid logging in. This largely depends on a
|
|
cost-benefit analysis the users subconsciously undertake. The third factor
|
|
where this method is lacking, concerns the awareness of the user being tracked.
|
|
Since tracking users depends on them actively logging in to the service,
|
|
tracking them transparently is impossible. Even though most tracking efforts
|
|
are not detected by the average user, it is known that actions taken on an
|
|
account are logged to provide better service through service optimization and
|
|
profile personalization.
|
|
|
|
Making an account on a web site to use their services to their full extent, can
|
|
be beneficial in some cases. Facebook for example, allows their users to
|
|
configure what they want to share with the public and their friends. Research
|
|
has shown however, that managing which posts get shown to whom is not as
|
|
straightforward as one might think. \citet{liuAnalyzingFacebookPrivacy2011}
|
|
conducted a survey where they asked Facebook users about their desired privacy
|
|
and visibility settings and cross-checked them with the actual settings they
|
|
have used for their posts. The results showed that in only 37\% of cases the
|
|
users' expectations match the reality. Additionally, 36\% of content is left on
|
|
the default privacy settings which set the visibility of posts to public,
|
|
meaning that any Facebook user can view them.
|
|
|
|
\subsection{window.name DOM Property}
|
|
\label{subsec:window.name dom property}
|
|
|
|
The \gls{DOM} is a platform and language agnostic \gls{API} which defines the
|
|
logical structure of web documents (i.e., \gls{HTML}, \gls{XHTML} and \gls{XML})
|
|
and the way they are accessed and manipulated. The \gls{DOM} was originally
|
|
introduced by Netscape at the same time as JavaScript as the \gls{DOM} Level 0.
|
|
The first recommendation (\gls{DOM} Level 1) was released in 1998 by the
|
|
\gls{W3C} \gls{DOM} working group \cite{w3cDocumentObjectModel1998} which
|
|
published its final recommendation (\gls{DOM} Level 3) in 2004. Since then the
|
|
\gls{WHATWG} took over and in 2015 published the \gls{DOM} Level 4 standard
|
|
\cite{whatwgDOMLivingStandard2020} which replaces the Level 3 specification. It
|
|
works by organizing all objects in a document in a tree structure which allows
|
|
individual parts to be altered when a specific event happens (e.g., user
|
|
interaction). Furthermore, each object has properties which are either applied to
|
|
all \gls{HTML} elements or only to a subset of all elements.
|
|
|
|
One useful property for tracking purposes is the \texttt{window.name} property
|
|
\cite{whatwgWindowNameHTML2020}. Its original intention was to allow
|
|
client-side JavaScript to get or set the name of the current window. Since
|
|
windows do not have to have names, the window.name property is being used mostly
|
|
for setting targets for hyperlinks and forms. Modern browsers allow storing up
|
|
to two megabytes of data in the window.name property, which makes it a viable
|
|
option for using it as a data storage or---more specifically---maintaining
|
|
session variables. In order to store multiple variables in the window.name
|
|
property, the values have first to be packed in some way because only a single
|
|
string is allowed. A \gls{JSON} stringifier converts a normal string into a
|
|
\gls{JSON} string which is then ready to be stored in the \gls{DOM} property.
|
|
Additionally, serializers can also convert JavaScript objects into a \gls{JSON}
|
|
string. Normally JavaScript's same-origin policy prohibits making requests to
|
|
servers in another domain, but the window.name property is accessible from other
|
|
domains and resistant to page reloads. Maintaining a session across domains and
|
|
without cookies is therefore possible and multiple implementations exist
|
|
\cite{frankSessionVariablesCookies2008,zypWindowNameTransport2008}.
|
|
|
|
\section{Storage-based Tracking Methods}
|
|
\label{sec:storage-based tracking methods}
|
|
|
|
Storage-based tracking methods are different to session-based tracking methods
|
|
in that they try to store information on the client's computer not only for
|
|
single sessions but for as long as desired. The following methods can be used to
|
|
store session data as well but are not limited to that use case. They generally
|
|
enable more advanced tracking approaches because they have information about the
|
|
current browser instance and the operating system the browser is running on. Due
|
|
to their nature of residing on the user's computer, they are in most cases
|
|
harder to circumvent, especially when two or more methods are combined, resulting
|
|
in better resilience against simple defenses.
|
|
|
|
\subsection{HTTP Cookies}
|
|
\label{subsec:http cookies}
|
|
|
|
A method which is most often associated with tracking on the Internet is
|
|
tracking with \gls{HTTP} cookies. Cookies are small files that are placed in the
|
|
browser's storage on the user's computer. They are limited to four kilobytes in
|
|
size and are generally used to identify and authenticate users and to store
|
|
web site preferences. They were introduced to the web to allow stateful
|
|
information to be stored because the \gls{HTTP} is a stateless protocol and
|
|
therefore does not have this capability. It is also a way of reducing the
|
|
server's load by not having to recompute states every time a user visits a
|
|
web site. Shopping cart functionality for example can thus be implemented by
|
|
setting a cookie in the user's browser, saving the items which are currently
|
|
added to the shopping cart and giving the user the possibility to resume
|
|
shopping at a later point provided that they do not delete their cookies. With
|
|
the introduction of cookies, advertising companies could reidentify users by
|
|
placing unique identifiers in the browser and reading them on subsequent visits.
|
|
The first standard for cookies was published in 1997
|
|
\cite{kristolHTTPStateManagement1997} and has since been updated multiple times
|
|
\cite{kristolHTTPStateManagement2000,barthHTTPStateManagement2011}.
|
|
|
|
Cookies can be divided into two categories: first party cookies, which are
|
|
created by the domain the user has requested and third party cookies, which are
|
|
placed in the user's browser by other domains that are generally not under the
|
|
control of the first party \cite{barthThirdPartyCookies2011}. Whereas first
|
|
party cookies are commonly not used for tracking but for the aforementioned
|
|
shopping cart functionality for example or enabling e-commerce applications to
|
|
function properly, third party cookies are popular with data brokerage firms
|
|
(e.g., Datalogix, Experian, Equifax), online advertisers (e.g., DoubleClick)
|
|
and---belonging to both of these categories in some cases---social media
|
|
platforms (e.g., Facebook) \cite{cahnWhatCommunityCookie2016}. The distinction
|
|
between these two categories is not always clear, however. Google Analytics for
|
|
example is considered to be a third party but offers their analytics services by
|
|
setting a first party cookie in the user's browser in addition to loading
|
|
JavaScript snippets from their servers. Therefore, categorizing cookies into
|
|
those that serve third party web content and those that serve first party web
|
|
content presents a more adequate approach.
|
|
|
|
Cookies are set either by calling scripts that are embedded in a web page (e.g.,
|
|
Google's \texttt{analytics.js}) or by using the \gls{HTTP} Set-Cookie response
|
|
header. Once a request to a web server has been issued, the server can set a
|
|
cookie in the Set-Cookie header and sends the response back to the client. On
|
|
the client's side the cookie is stored by the browser and sent with subsequent
|
|
requests to the same domain via the cookie \gls{HTTP} header. An example of a
|
|
cookie header is given in Listing~\ref{lst:session cookie header}. Because this
|
|
example does not set an expiration date for the cookie, it sets a session
|
|
cookie. Session cookies are limited to the current session and are deleted as
|
|
soon as the session is `torn down'. By adding an expiration date (demonstrated
|
|
in Listing~\ref{lst:permanent cookie header}) or a maximum age, the cookie
|
|
becomes permanent. Additionally, the domain attribute can be specified, meaning
|
|
that cookies which list a different domain than the origin, are rejected by the
|
|
user agent \cite[section 4.1.2.3]{barthHTTPStateManagement2011}. The same-origin
|
|
policy applies to cookies, disallowing access by other domains.
|
|
|
|
\begin{listing}
|
|
\inputminted[frame=lines,framesep=2mm,bgcolor=light-gray,baselinestretch=1.2,fontsize=\scriptsize,linenos]{http}{code/session-cookie-header}
|
|
\caption{An example of an \gls{HTTP} header setting a session cookie.}
|
|
\label{lst:session cookie header}
|
|
\end{listing}
|
|
|
|
\begin{listing}
|
|
\inputminted[frame=lines,framesep=2mm,bgcolor=light-gray,baselinestretch=1.2,fontsize=\scriptsize,linenos]{http}{code/permanent-cookie-header}
|
|
\caption{An example of an \gls{HTTP} header setting a permanent cookie.}
|
|
\label{lst:permanent cookie header}
|
|
\end{listing}
|
|
|
|
Distinguishing tracking and non-tracking cookies can be done with high accuracy
|
|
by observing their expiration time and the length of the value field.
|
|
\citet{liTrackAdvisorTakingBack2015} demonstrate a supervised learning approach
|
|
to detecting tracking cookies with their tool \emph{TrackAdvisor}. They found
|
|
that tracking cookies generally have a longer expiration time than non-tracking
|
|
cookies and they need to have a sufficiently long value field carrying the
|
|
unique identifier. Using this method, they found that only 10\% of tracking
|
|
cookies have a lifetime of a single day or less while 80\% of non-tracking
|
|
cookies expire before a day is over. Additionally, a length of more than 35
|
|
characters in the value field applies to 80\% of tracking cookies and a value
|
|
field of less than 35 characters applies to 80\% of non-tracking cookies.
|
|
\emph{Cookie Chunking}, where a cookie of larger length is split into multiple
|
|
cookies with smaller length, did not appear to affect detection by their method
|
|
negatively. They also present a site measurement of the Alexa Top 10,000 web
|
|
sites, finding that 46\% of web sites use third party tracking. More recent
|
|
research \cite{gonzalezCookieRecipeUntangling2017} has shown that tracking
|
|
cookies do not have to be long lasting to accumulate data about users. Some
|
|
cookies---like the \texttt{\_\_utma} cookie from Google Analytics for
|
|
example---save a timestamp of the current visit with the unique identifier,
|
|
thereby allowing to use cookies which last a short time but can be afterwards
|
|
used in series to complete the whole picture.
|
|
\citet{gonzalezCookieRecipeUntangling2017} have also found 20\% of observed
|
|
cookies to be \gls{URL} or base64 encoded, making decoding of cookies a
|
|
necessary step for analysis. Furthermore---and contrary to previous work---,
|
|
cookie values are found in much more varieties than is assumed by approaches
|
|
that only try to detect cookies by their expiration date and/or character
|
|
length. They also presented an entity based matching algorithm to dissect
|
|
cookies which contain more than a unique identifier. This allows for a better
|
|
understanding and interpretation of complex cookies as they are found in
|
|
advertising networks with a lot of reach (e.g., doubleclick.net). This
|
|
information is particularly useful for building applications that effectively
|
|
detect and block cookies (see chapter~\ref{chap:defenses against tracking}).
|
|
|
|
\subsection{Flash Cookies and Java JNLP PersistenceService}
|
|
\label{subsec:flash cookies and java jnlp persistenceservice}
|
|
|
|
Flash Cookies \cite{adobeAdobeFlashPlatform} are similar to HTTP cookies in that
|
|
they too are a store of information that helps web sites and servers to
|
|
recognize already seen users. They are referred to as \glspl{LSO} by Adobe and
|
|
are part of the Adobe Flash Player runtime. Instead of storing data in the
|
|
browser's storage, they have their own storage in a different location on the
|
|
user's computer. Another difference is that they cannot only store 4 kilobytes
|
|
of data but 100 kilobytes and they also have no expiration dates by default
|
|
(\gls{HTTP} cookies live until the end of the session unless specified
|
|
otherwise). Since Flash cookies are not created by means the browser normally
|
|
supports (i.e., \gls{HTTP}, \gls{CSS}) but by Adobe's Flash Player runtime,
|
|
browsers are not managing Flash cookies. This means that, due to Flash cookies
|
|
not being tied to a specific browser, they function across browsers. This
|
|
capability makes them an interesting target for trackers to store their
|
|
identifying information in, because out of the box browsers initially did not
|
|
support removing Flash cookies and one had to manually set preferences in the
|
|
\emph{Web Storage Settings panel} provided by the Flash Player runtime to get
|
|
rid of them. Trackers were searching for a new way to store identifiers because
|
|
users became increasingly aware of the dangers posed by \gls{HTTP} cookies and
|
|
reacted by taking countermeasures.
|
|
|
|
\citet{soltaniFlashCookiesPrivacy2009} were the first to report on the usage of
|
|
Flash cookies by advertisers and popular web sites. While surveying the top 100
|
|
web sites at the time, they found that 54\% of them used Flash cookies. Some
|
|
web sites were setting Flash cookies as well as \gls{HTTP} cookies with the
|
|
same values, suggesting that Flash cookies serve as backup to \gls{HTTP}
|
|
cookies. Several web sites were found using Flash cookies to respawn already
|
|
deleted \gls{HTTP} cookies, even across domains.
|
|
\citet{acarWebNeverForgets2014} automated detecting Flash cookies and access to
|
|
them by monitoring file access with the GNU/Linux \emph{strace} tool
|
|
\cite{michaelStraceLinuxManual2020}. This allowed them to acquire data about
|
|
Flash cookies respawning \gls{HTTP} cookies. Their results show that six of the
|
|
top 100 sites use Flash cookies for respawning.
|
|
|
|
Even though Flash usage has declined during the last few years thanks to the
|
|
development of the HTML5 standard, \citet{buhovFLASH20thCentury2018} have shown
|
|
that despite major security flaws, Flash content is still served by 7.5\% of
|
|
the top one million web sites (2017). The W3Techs Web Technology Survey shows
|
|
a similar trend and also offers an up-to-date measurement of 2.7\% of the top
|
|
ten million web sites for the year 2020
|
|
\cite{w3techsHistoricalYearlyTrends2020}. Due to the security concerns with
|
|
using Flash, Google's popular video sharing platform YouTube switched by
|
|
default to the HTML5 <video> tag in January of 2015
|
|
\cite{youtubeengineeringYouTubeNowDefaults2015}. In 2017 Adobe announced that
|
|
they will end-of-life Flash at the end of 2020, stopping updates and
|
|
distribution \cite{adobecorporatecommunicationsFlashFutureInteractive2017}.
|
|
Consequently, Chrome 76 and Firefox 69 disabled Flash by default and will drop
|
|
support entirely in 2020.
|
|
|
|
Similarly to Flash, Java also provides a way of storing data locally on the
|
|
user's computer via the PersistenceService \gls{API}
|
|
\cite{PersistenceServiceJNLPAPI2015}. It is used by the evercookie library
|
|
(section~\ref{subsec:evercookie}) to store values for cookie respawning by
|
|
injecting a Java applet into the \gls{DOM} of a page
|
|
\cite{baumanEvercookieApplet2013}.
|
|
|
|
\subsection{Evercookie}
|
|
\label{subsec:evercookie}
|
|
|
|
Evercookie is JavaScript code that can be embedded in web sites which allows to
|
|
permanently store information on the user's computer. When activated,
|
|
information is not only stored in standard \gls{HTTP} cookies but also in
|
|
various other places, providing redundancy where possible. A full list of
|
|
locations used by Evercookie can be found on the project's github page
|
|
\cite{kamkarSamykEvercookie2020}. In case the user wants to get rid of all
|
|
information stored by visiting a web site that uses evercookies, every location
|
|
has to be cleared because if one remains, all the other cookies are restored.
|
|
The cookie deletion mechanisms that are provided by browsers by default do not
|
|
clear all locations where evercookies are stored, which makes evercookie almost
|
|
impossible to avoid. Evercookie is open source and quietly implementing or using
|
|
evercookie is therefore not easy to do. Additionally, it is reported on the
|
|
project's github page that it might cause severe performance issues in browsers.
|
|
|
|
Evercookie has been proposed and implemented by
|
|
\citet{kamkarEvercookieVirtuallyIrrevocable2010}. Multiple surveys have tried
|
|
to quantify the use of evercookie in the wild. \citet{acarWebNeverForgets2014}
|
|
provide a heuristic for detecting evercookies stored on the user's computer and
|
|
analyze evercookie usage in conjunction with cookie respawning.
|
|
|
|
\subsection{Cookie Synchronization}
|
|
\label{subsec:cookie synchronization}
|
|
|
|
When trackers are using cookies to store unique identifiers to track users,
|
|
every tracker assigns a different identifier to the same user, due to the
|
|
same-origin policy disallowing interaction with other trackers. Because of this,
|
|
sharing data between multiple trackers is difficult, since there are no easy
|
|
ways to accurately match an accumulated profile history of one identifier to
|
|
another. This problem has been solved by modern trackers by using a mechanism
|
|
called Cookie Synchronization or Cookie Matching
|
|
\cite{googleinc.CookieMatchingRealtime2020}. This technique allows multiple
|
|
trackers to open an information sharing channel between each other without
|
|
necessarily having to know the web site the user visits.
|
|
|
|
\begin{figure}[ht]
|
|
\centering
|
|
\includegraphics[width=1\textwidth]{figures/cookiesyncing.pdf}
|
|
\caption{Cookie Synchronization in practice between two trackers
|
|
\label{fig:cookie synchronization}
|
|
\emph{cloudflare.com} and \emph{google.com}.}
|
|
\end{figure}
|
|
|
|
An example of how Cookie Synchronization works in practice is given in
|
|
Figure~\ref{fig:cookie synchronization}. The two parties that are interested in
|
|
tracking the user are called \emph{cloudflare.com} and \emph{google.com} in this
|
|
example. The user they want to track is called \emph{browser}. \emph{Browser}
|
|
first visits \emph{web site1.com} which loads JavaScript from
|
|
\emph{cloudflare.com}. \emph{Cloudflare.com} sets a cookie in the browser with a
|
|
tracking identifier called \emph{userID = 1234}. Next, \emph{browser} visits
|
|
another web site called \emph{web site2.com} which loads an advertisement banner
|
|
from \emph{google.com}. \emph{Google.com} also sets a cookie with the tracking
|
|
identifier \emph{userID = ABCD}. \emph{Browser} has now two cookies from two
|
|
different providers, each of them knowing the user under a different identifier.
|
|
When \emph{browser} visits a third web site called \emph{website3.com} which
|
|
makes a request to \emph{cloudflare.com} and recognizes the user with the
|
|
identifier \emph{userID = 1234}, \emph{cloudflare.com} sends an \gls{HTTP}
|
|
redirect, redirecting \emph{browser} to \emph{google.com}. The redirect also
|
|
contains an \gls{HTTP} Query String (see section~\ref{subsec:passing information
|
|
in urls}) which adds a query like \emph{?userID=1234\&publisher=website3.com}.
|
|
The complete GET request to \emph{google.com} might look like this:
|
|
|
|
\begin{minted}[frame=lines,framesep=2mm,bgcolor=light-gray,baselinestretch=1.2,fontsize=\scriptsize,linenos]{http}
|
|
GET /index.html?userID=1234&publisher=website3.com HTTP/1.1
|
|
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0
|
|
Host: google.com
|
|
Cookie: userID=ABCD
|
|
\end{minted}
|
|
|
|
\emph{Google.com} therefore not only knows that the user with the identifier
|
|
\emph{userID=ABCD} visited \emph{website3.com} but also that \emph{browser} is
|
|
the same user as \emph{userID=1234}. Since the identifiers can now be traced
|
|
back to the same person, the different cookies have been synchronized, allowing
|
|
the two trackers to exchange information about the user without him or her
|
|
knowing.
|
|
|
|
Cookie Synchronization has seen widespread adoption especially in \gls{RTB}
|
|
based auctions \cite{olejnikSellingPrivacyAuction2014}.
|
|
\citet{papadopoulosCookieSynchronizationEverything2019} recorded and analyzed
|
|
the browsing habits of 850 users over a time period of one year and found that
|
|
97\% of users with regular browsing activity were exposed to Cookie
|
|
Synchronization at least once. Furthermore, they found that ``[...] the average
|
|
user receives around 1 synchronization per 68 requests''
|
|
\cite[p.~7]{papadopoulosCookieSynchronizationEverything2019}. In
|
|
\cite{englehardtOnlineTracking1MillionSite2016} the authors crawl the top
|
|
100,000 sites and find that 45 of the top 50 (90\%) third parties and 460 of
|
|
the top 1000 (46\%) use Cookie Synchronization with at least one other party.
|
|
\emph{Doubleclick.net} being at the top sharing 108 cookies with 118 other
|
|
third parties. \citet{papadopoulosExclusiveHowSynced2018} show the threat
|
|
that Cookie Synchronization poses to encrypted \gls{TLS} sessions by performing
|
|
the cookie-syncing over unencrypted \gls{HTTP} even though the original request
|
|
to the web site was encrypted. This highlights the serious privacy implications
|
|
for users of \gls{VPN} services trying to safeguard their traffic from a
|
|
potentially malicious \gls{ISP}.
|
|
|
|
\subsection{Silverlight Isolated Storage}
|
|
\label{subsec:silverlight isolated storage}
|
|
|
|
Silverlight Isolated Storage can also be used for storing data for tracking
|
|
purposes on the user's computer. It has been compared to Adobe's Flash
|
|
technology as it too requires a plugin from Microsoft to function. Available for
|
|
storage are 100 kilobytes which is the same amount Flash cookies can store.
|
|
Silverlight does not work in the private browsing mode and can only be cleaned
|
|
manually by deleting a hidden directory in the filesystem or by changing
|
|
settings in the Silverlight application. Silverlight's Isolated Storage is one
|
|
of the methods evercookie (section~\ref{subsec:evercookie}) uses to make
|
|
permanent deletion of cookies hard to do and to facilitate cookie respawning.
|
|
Usage of Silverlight has seen a steady decline since 2011 even though it has
|
|
been used by popular video streaming web sites such as Netflix
|
|
\cite{NetflixBeginsRollOut2010} and Amazon. Microsoft did not include
|
|
Silverlight support in Windows 8 and declared end-of-life in a blog post for
|
|
October of 2021 \cite{SilverlightEndSupport2015}. Usage of Silverlight currently
|
|
hovers around 0.04\% for the top 10 million web sites
|
|
\cite{w3techsUsageStatisticsSilverlight2020}.
|
|
|
|
\subsection{HTML5 Web Storage}
|
|
\label{subsec:html5 web storage}
|
|
|
|
HTML5 Web Storage comes in three different forms: HTML5 Global Storage, HTML5
|
|
Local Storage and HTML5 Session Storage. It is part of the HTML specification
|
|
\cite{whatwgHTMLStandard2020} and provides means for storing name-value pairs on
|
|
the user's computer. HTML5 Web Storage works similarly to cookies but enables
|
|
developers to manage transactions that are done by the user simultaneously but
|
|
in two different windows. Whereas with cookies the transaction can accidentally
|
|
be recorded twice, HTML5 Web Storage allows multiple windows to access the same
|
|
storage on the user's computer thereby avoiding this problem. In contrast to
|
|
cookies, which are sent to the server every time a request is made, HTML5 Storage
|
|
contents do not get sent to the web server. By default the storage limit is
|
|
configured to be 5 megabytes per origin \cite{whatwgHTMLStandard2020a}. Even
|
|
though this was only a recommendation by the standard, all modern browsers
|
|
adhere to it. More space can be allocated upon asking the user for permission to
|
|
do so.
|
|
|
|
Global Storage was part of an initial HTML5 draft and is accessible across
|
|
applications. Due to it violating the same-origin policy, most major browsers
|
|
have not implemented Global Storage.
|
|
|
|
Local Storage does, however, obey the same-origin policy by only allowing the
|
|
originating domain access to its name-value pairs. Every web site has their own
|
|
separate storage area which maintains a clear separation of concerns. Local
|
|
Storage lends itself for different use cases. Especially applications that
|
|
should function even when no internet connection exists can use Local Storage to
|
|
enable that functionality. Coupled with file-based \glspl{API}, which are
|
|
generally not limited in storage (except by the available disk space), offering
|
|
the full range of features in offline-mode is feasible.
|
|
|
|
The third category of HTML5 Web Storage is similar to Local Storage, but
|
|
requires that the stored data be deleted after a session is closed. While
|
|
content that is persisted by Local Storage must be deleted explicitly by the
|
|
user, Session Storage has the intended function of providing non-persistent
|
|
storage.
|
|
|
|
HTML5 Web Storage can be used for tracking in the same way that cookies are
|
|
used: by storing unique identifiers which are read on subsequent visits.
|
|
\citet{ayensonFlashCookiesPrivacy2011} found that 17 of the top 100 web sites
|
|
used HTML5 Web Storage with some of them using it for cookie respawing (see
|
|
section~\ref{subsec:evercookie}). A recent survey by
|
|
\citet{belloroKnowWhatYou2018} looks at Web Storage usage in general and found
|
|
that 83.09\% of the top 10K Alexa web sites use it. The authors flagged 63.88\%
|
|
of those usages as coming from known tracking domains.
|
|
|
|
\subsection{HTML5 Indexed Database API}
|
|
\label{subsec:html5 indexed database api}
|
|
|
|
The need for client side storage to provide performant web applications that can
|
|
also function offline has prompted the inception of alternative methods to
|
|
store and retrieve information. Consequently, the development of the HTML5
|
|
standard has tried to fill that need by introducing HTML5 Web Storage and the
|
|
HTML5 Indexed Database \gls{API}.
|
|
|
|
HTML5 Indexed Database \gls{API} provides an interface for storing values and
|
|
hierarchical objects using the well-known key-value pair storage principle
|
|
\cite{alabbasIndexedDatabaseAPI2020}. This property makes it similar to NoSQL
|
|
storage solutions which have seen increasing adoption rates on the web. It is
|
|
the successor to the abandonend Web SQL Database (see section~\ref{subsec:web
|
|
sql database}) standard and functions similarly to the HTML5 Web Storage,
|
|
meaning that it has the same storage limits and privacy implications and has to
|
|
obey the same-origin policy. In contrast to HTML5 Web Storage, IndexedDB is
|
|
intended for storing larger amounts of data and provides additional functions
|
|
such as in-order key retrieval. Reading from and writing to an IndexedDB is done
|
|
with JavaScript by opening a connection to the database, preparing a transaction
|
|
and committing it. The development of the standard is ongoing with two editions
|
|
already published and recommended by the W3C and the third edition existing as
|
|
an editors draft until it is ready for recommendation.
|
|
|
|
HTML5 IndexedDB has been added to the evercookie library (see
|
|
section~\ref{subsec:evercookie}) by
|
|
\citet{kamkarEvercookieVirtuallyIrrevocable2010}, providing redundancy for
|
|
\gls{HTTP} cookies. \citet{acarWebNeverForgets2014} have shown that only 20 of
|
|
100.000 surveyed sites use the IndexedDB storage vector with one of them
|
|
(\texttt{weibo.com}) using it for respawning \gls{HTTP} cookies. A more recent
|
|
study by \citet{belloroKnowWhatYou2018} paints a different picture: On a
|
|
dataset provided by the \gls{HTTP} Archive project
|
|
\cite{soudersAnnouncingHTTPArchive2011}, they found that 5.56\% of observed
|
|
sites use IndexedDB. Of those that use IndexedDB, 31.87\% of usages appear to
|
|
be coming from domains that are flagged as trackers.
|
|
|
|
\subsection{Web SQL Database}
|
|
\label{subsec:web sql database}
|
|
|
|
Web SQL Database \cite{hicksonWebSQLDatabase2010} was initially developed to
|
|
provide an \gls{API} for storing data in databases. The stored data can then be
|
|
queried using \gls{SQL} or variants thereof. The W3C stopped the development of
|
|
the standard in 2010 due to a lack of other backend implementations (other than
|
|
SQLite) which is necessary for a recommendation as a standard. Browsers have
|
|
turned to HTML5 IndexedDB (see section~\ref{subsec:html5 indexed database api}),
|
|
the ``spiritual successor'' to Web SQL Database, for web database storage.
|
|
Despite the W3C deprecating Web SQL Database, some browsers such as Chrome,
|
|
Safari and Opera still support it and have no plans of discontinuing it.
|
|
|
|
In the same way that other tracking technologies can maintain a history of web
|
|
site visits and actions, Web SQL Database can store identifying information via
|
|
the usage of unique identifiers. An arbitrary maximum size of 5 megabytes of
|
|
storage per origin is recommended by the standard, with the possibility to ask
|
|
the user for more capacity. This limit includes other domains which are
|
|
affiliated with the origin but have a different name (e.g. subdomains).
|
|
|
|
Due to the W3C abandoning the Web SQL Database standard, not many reports on
|
|
usage for tracking purposes exist. The method has been added, however, to the
|
|
evercookie library by \citet{kamkarEvercookieVirtuallyIrrevocable2010} (see
|
|
section~\ref{subsec:evercookie}) to add another layer of redundancy for storing
|
|
unique identifiers and respawning deleted ones. By performing static analysis on
|
|
a dataset provided by the \gls{HTTP} Archive project
|
|
\cite{soudersAnnouncingHTTPArchive2011}, \citet{belloroKnowWhatYou2018}
|
|
found that 1.34\% of the surveyed web sites use Web SQL Database in one of their
|
|
subresources. 53.59\% of Web SQL Database usage are considered to be coming from
|
|
known tracking domains. This ratio is lower for the first 10K web sites as
|
|
determined by Alexa (in May 2018): 2.12\% use Web SQL Database and 39.9\% of
|
|
those use it for tracking. These percentages show that Web SQL Database is not
|
|
used as a means to provide new functionality in most cases, but to increase user
|
|
tracking capabilities.
|
|
|
|
\section{Cache-based Tracking Methods}
|
|
\label{sec:cache-based tracking methods}
|
|
|
|
While the underlying principle of storing unique identifiers on the user agent's
|
|
computer remains the same, cache-based methods exploit a type of storage that is
|
|
normally used for data that is saved for short periods of time and most commonly
|
|
serves to improve performance. Whereas storage-based tracking methods (see
|
|
section~\ref{sec:storage-based tracking methods}) exploit storage interfaces
|
|
that are meant for persisting data to disk, caches store data that has been
|
|
generated by an operation and can be served faster on subsequent requests.
|
|
|
|
A variety of caches exist and they are utilized for different purposes, leading
|
|
to different forms of information exploitability for tracking users. This
|
|
section introduces methods which are in most cases not prevalent but are more
|
|
sophisticated and can thus be much harder to circumvent or block.
|
|
|
|
\subsection{Web Cache}
|
|
\label{subsec:web cache}
|
|
|
|
Using the \gls{DOM} \gls{API}'s \texttt{Window.getComputedStyle()} method,
|
|
web sites were able to check a user's browsing history by utilizing the \gls{CSS}
|
|
\texttt{:visited} selector. Links can be colored depending on whether they have
|
|
already been visited or not. The colors can be set by the web site trying to
|
|
find out what the user's browsing history is. JavaScript would then be used to
|
|
generate links on the fly for web sites that will be cross-checked with the
|
|
contents of the browsing history. After generating links, a script can check the
|
|
color, compare it with the color that has been set for visited and non-visited
|
|
web sites and see if a web site has already been visited or not.
|
|
|
|
A solution to the problem has been proposed and subsequently implemented by
|
|
\citet{baronPreventingAttacksUser2010} in 2010, making
|
|
\texttt{getComputedStyle()} and similar functions lie about the state of the
|
|
visited links and marking them as unvisited. Another solution has been
|
|
developed by \citet{jacksonProtectingBrowserState2006} in form of a browser
|
|
extension that enforces the same-origin policy for browser histories as well.
|
|
Although their approach limits access to a user's browsing history by third
|
|
parties, first parties are unencumbered by the same-origin policy. Their
|
|
browser extension does, however, thwart the attack carried out by
|
|
\citet{jancWebBrowserHistory2010} where the authors were able to check for up
|
|
to 30.000 links per second.
|
|
|
|
\citet{wondracekPracticalAttackDeanonymize2010} demonstrate the severity of
|
|
history stealing attacks (e.g. visited link differentiation) on user privacy by
|
|
probing for \glspl{URL} that encode user information such as group membership
|
|
in social networks. By constructing a set of group memberships for each user,
|
|
the results can uniquely identify a person. Furthermore, information that is
|
|
not yet attributed to a single user but to a group as a whole can be used to
|
|
more accurately identify members of said group.
|
|
|
|
Other ways of utilizing a web browser's cache to track users are tracking
|
|
whether a web site asset (e.g., an image or script) has already been cached by
|
|
the user agent or not. If it has been cached, the web site knows that is has been
|
|
visited before and if it has not been cached (the asset is downloaded from the
|
|
server), the user agent visits for the first time. Another way is to embed
|
|
identifiers in cached documents. An \gls{HTML} file can contain an identifier
|
|
which is stored in a \texttt{<div>} tag and is cached by the user agent. The
|
|
identifier can then be read from the cache on subsequent visits, even from third
|
|
party web sites.
|
|
|
|
\subsection{Cache Timing}
|
|
\label{subsec:cache timing}
|
|
|
|
Cache timing attacks \cite{feltenTimingAttacksWeb2000} are another form of
|
|
history stealing which enables an attacker to probe for already visited
|
|
\glspl{URL} by timing how long it takes a client to fetch a resource. Timing
|
|
attacks are most commonly used in cryptography to indirectly observe the
|
|
generation or usage of a cipher key by measuring cpu noises, frequencies, power
|
|
usage or other properties that allow conclusions to be drawn about the key. This
|
|
type of attack is referred to as a side-channel attack. Cache timing exploits
|
|
the fact that it takes time to load assets for a web site. It works by measuring
|
|
the time a client takes to access a specified resource. If the time is short,
|
|
the resource has most likely been served from the cache and has thus been
|
|
downloaded before, implying a visit to a web site which uses that resource. If
|
|
it takes longer than a cache hit would, on the other hand, the resource did not
|
|
exist before and has to be downloaded now, suggesting that no other web site
|
|
using that resource has been visited before. In practice an attack might look
|
|
like this (taken from \cite[p.~2]{feltenTimingAttacksWeb2000}):
|
|
|
|
\begin{enumerate}
|
|
\item Alice visits a web site from Bob called \texttt{bob.com}.
|
|
\item Bob wants to find out whether Alice visited Charlie's web site
|
|
\texttt{charlie.com} in the past.
|
|
\item Bob chooses a file from \texttt{charlie.com} which is regularly
|
|
downloaded by visitors to that site.
|
|
\item Bob implements a script or program that checks the time it takes
|
|
to load the file from \texttt{charlie.com} and embeds it in his
|
|
own site.
|
|
\item The program is loaded by Alice upon visiting and measures the time
|
|
needed to load the file from \texttt{charlie.com}.
|
|
\item If the measured time is below a certain threshold, the file has
|
|
probably been downloaded into the cache and Alice has therefore
|
|
visited \texttt{charlie.com} before.
|
|
\end{enumerate}
|
|
|
|
Bob can do this process for multiple resources and for every user that visits
|
|
his web site, collecting browser history information on all of them. Since
|
|
caches exist to boost performance and avoid unnecessary loading of content from
|
|
servers which has already been downloaded before, timing attacks are very hard
|
|
to circumvent because caches exist solely for that purpose. Countermeasures
|
|
either cause a massive slowdown when browsing the web due to the ubiquity of
|
|
caches, or imply a substantial change in user agent design.
|
|
|
|
\citet{feltenTimingAttacksWeb2000} were the first to conduct a study on the
|
|
feasibility of cache timing attacks and concluded that accuracy in determining
|
|
whether a file has been loaded from cache or downloaded from a server is
|
|
generally very high ($>95$\%). Furthermore, they evaluated a host of
|
|
countermeasures such as turning off caching, altering hit or miss performance
|
|
and turning off Java and JavaScript but concluded that they were unattractive
|
|
or at worst ineffective. They propose a partial remedy for cache timing by
|
|
introducing \emph{Domain Tagging} which requires that resources are tagged with
|
|
the domain they have initially been loaded from. Once another web site wants to
|
|
determine whether a user has visited a site before by cross-loading a resource,
|
|
the domain does not match the tagged domain on the resource. If that is the
|
|
case, the initial cache hit gets transformed into a cache miss and the resource
|
|
has to be downloaded again, fooling the attacker into believing that the origin
|
|
web site has not been visited before. It is necessary to mention that at the
|
|
time (2000) \glspl{CDN} were not as widely used as today. Since web sites rely
|
|
on \glspl{CDN} to cache resources that are used on multiple sites and can thus
|
|
be served much faster from cache, domain tagging would effectively nullify the
|
|
performance boost a \gls{CDN} provides by converting every cache hit into a
|
|
cache miss. The authors themselves question the effectiveness of such an
|
|
approach.
|
|
|
|
Because the attack presented by \citet{feltenTimingAttacksWeb2000} relies on
|
|
being able to accurately time resource loading, a reliable network is needed.
|
|
Today a sizeable portion of internet activity comes from mobile devices which
|
|
are often not connected via cable but wirelessly.
|
|
\citet{vangoethemClockStillTicking2015} have therefore proposed four new
|
|
methods to accurately time resource loading over unstable networks. By using
|
|
these improved methods, they managed to determine whether a user is a member of
|
|
a particular age group (in this case between 23 and 32). The authors also ran
|
|
their attacks against other social networks (LinkedIn, Twitter, Google and
|
|
Amazon), successfully extracting sensitive information on users. The research
|
|
discussed so far has not tackled the problem through a quantitative perspective
|
|
but instead focused on individual cases. Due to this missing piece,
|
|
\citet{sanchez-rolaBakingTimerPrivacyAnalysis2019} conducted a survey on 10K
|
|
web sites to determine how feasible it is to perform a history sniffing attack
|
|
on a large scale. Their tool \textsc{BakingTimer} collects timing information
|
|
on \gls{HTTP} requests, checking for logged in status and sensitive data. Their
|
|
results show that 71.07\% of the surveyed web sites are vulnerable to the
|
|
attack.
|
|
|
|
\subsection{Cache Control Directives}
|
|
\label{subsec:cache control directives}
|
|
|
|
Cache Control Directives can be supplied in the Cache-Control \gls{HTTP} header,
|
|
allowing rules about storing, updating and deletion of resources in the cache to
|
|
be defined. Cache Control Directives make heavy use of \emph{\glspl{ETag}}
|
|
\cite{fieldingHTTPETag} and \emph{Last-Modified \gls{HTTP} Headers}
|
|
\cite{fieldingHTTPLastModified} to determine whether a cached resource is stale
|
|
and needs to be updated. Commonly, a collision-resistant hash function is used
|
|
to generate a unique hash of a cached resource which is sent along with the
|
|
resource in the first \gls{HTTP} request. The resource and the hash—which is
|
|
stored in the \gls{ETag} header—is then cached by the client. On subsequent
|
|
retrievals of the same \gls{URL}, the client checks for an expiration date on
|
|
the requested \gls{URL} via the Cache-Control and Expire headers. If the
|
|
\gls{URL} has expired, the client sends a request with the \emph{If-None-Match}
|
|
field set with the \gls{ETag}. The server then compares the \gls{ETag} received
|
|
by the client with the generated \gls{ETag} of the resource on the server side.
|
|
If the two values match (i.e., the resource has not changed), the server can
|
|
send back an \gls{HTTP} 304 Not-Modified status. Otherwise, the answer contains
|
|
a full \gls{HTTP} response with the modified resource and the newly generated
|
|
\gls{ETag}, which the client can cache again. Usage of \glspl{ETag} can
|
|
therefore improve performance and cache consistency while at the same time
|
|
reducing bandwidth usage.
|
|
|
|
As with most other tracking methods, unique identifiers can be stored inside
|
|
the \gls{ETag} header because it offers a storage capacity of 81864 bits. Once
|
|
the identifier has been placed in the \gls{ETag} header, the server can answer
|
|
requests to check for an updated resource always with an \gls{HTTP} 301
|
|
Not-Modified header, effectively persisting the unique identifier in the
|
|
client's cache. During their 2011 survey of QuantCast.com's top 100 U.S. based
|
|
web sites, \citet{ayensonFlashCookiesPrivacy2011} found \texttt{hulu.com} to be
|
|
using \glspl{ETag} as backup for tracking cookies that are set by
|
|
\texttt{KISSmetrics} (an analytics platform). This allowed cookies to be
|
|
respawned once they had been cleared by checking the \gls{ETag} header.
|
|
|
|
\subsection{DNS Cache}
|
|
\label{subsec:dns cache}
|
|
|
|
The \gls{DNS} is used every time a connection to a server has to be established.
|
|
It's main function is to map hard to remember \gls{IP} addresses to easily
|
|
memorizable domain names and vice versa. The hierarchical structure of the
|
|
system is particularly suited for caching as requests do not have to be passed
|
|
all the way to the top of the hierarchy but can possibly be answered by lower
|
|
level \gls{DNS} servers. Caching is especially beneficial when name resolution
|
|
is recursive, allowing servers to quickly respond to requests without having to
|
|
consult higher level ones. Since name resolution servers change less often the
|
|
closer they are to the root node, the probability of having a cache hit
|
|
increases with every step in the hierarchy. Caching also happens at the lower
|
|
levels of the hierarchy, directly at the client in multiple ways. The host
|
|
operating system has it's own cache that applications can ask for name
|
|
resolution. Some applications introduce another layer of caching by having their
|
|
own cache (e.g., browsers).
|
|
|
|
\citet{kleinDNSCacheBasedUser2019} demonstrate a tracking method which uses
|
|
\gls{DNS} caches to assign unique identifiers to client machines. In order for
|
|
the technique to work, the tracker has to have control over one web server (or
|
|
multiple) as well as an authoritative \gls{DNS} server which associates the web
|
|
servers with a domain name under the control of the tracker. The tracking
|
|
process starts once a user agent requests a web site which loads a script from
|
|
one of the web servers the attacker is controlling. The process can then be
|
|
sketched out as follows (see \cite[p.~5]{kleinDNSCacheBasedUser2019} for a
|
|
detailed description).
|
|
|
|
\begin{enumerate}
|
|
\item The snippet loads a resource from muliple domains (\texttt{1.ex.com},
|
|
\texttt{2.ex.com}, ...) the tracker controls.
|
|
\item The \gls{OS} forwards a \gls{DNS} request to the configured name
|
|
server.
|
|
\item The name server receives and caches multiple, randomly ordered
|
|
\gls{IP} addresses for the domain from the tracker-controlled \gls{DNS}
|
|
server. The results are forwarded to the \gls{OS}.
|
|
\item The \gls{OS} caches the result as well and the browser makes
|
|
\gls{HTTP} requests to (in most cases) the first \gls{IP} address.
|
|
\item The web server responsible for the \gls{IP} address responds with a
|
|
number. Other web servers would respond with a different number.
|
|
\item The script then constructs an identifier from the received values.
|
|
\end{enumerate}
|
|
|
|
Due to the random ordering of the received \gls{IP} addresses from the
|
|
authoritative \gls{DNS} server, the identifier that is assembled by the script
|
|
is unique and thus allows identification of not only the browser but the client
|
|
machine itself.
|
|
|
|
Advantages of this tracking method are that it works across browsers in most
|
|
cases. \citet{kleinDNSCacheBasedUser2019} found that it survives browser
|
|
restarts and is resistant to the privacy mode employed by modern browsers.
|
|
Futhermore, \glspl{VPN} do not affect the method and it works with different
|
|
protocols (\gls{HTTPS}, \gls{IPv6}, \gls{DNSSEC}).
|
|
|
|
Disadvantages are that the tracking identifiers do not survive a computer
|
|
restart. Additionally, switching the network causes the identifiers to be
|
|
obsoleted and identifiers generally only live as long as the \gls{TTL} limit
|
|
allows.
|
|
|
|
Due to these constraining factors, \gls{DNS} tracking is best combined with
|
|
methods that are intended for tracking over longer time periods such as cookies
|
|
for example.
|
|
|
|
\subsection{TLS Session Resumption}
|
|
\label{subsec:tls session resumption}
|
|
|
|
\gls{TLS} \cite{rescorlaTransportLayerSecurity2018} is widely used today to
|
|
securely encapsulate communication across the web. For the secured communication
|
|
to work, client and server first have to authenticate themselves and then agree
|
|
on protocol version, cipher suite and compression method. The exchange of this
|
|
information at the beginning of a connection is called a \emph{handshake}.
|
|
Figure~\ref{fig:tls-handshake} shows how the initial handshake is performed
|
|
after which both the client and the server are ready for sending and receiving
|
|
application data. For bandwidth savings and better performance, it is possible
|
|
to cache a \gls{TLS} session to allow reusing an already established secure
|
|
connection at a later point in time. Versions prior to \gls{TLS} 1.3 used two
|
|
mechanisms to accomplish this: \gls{TLS} session identifiers and session
|
|
tickets. Session identifiers are sent by the server along with the initial
|
|
handshake with the user agent. The identifier is randomly generated and saved by
|
|
the server so that the current session can be found later. To resume a session,
|
|
the user agent sends the identifier with the \emph{ClientHello} message to the
|
|
server. The server can then match the identifier to the previously initiated
|
|
session and responds with the same session identifier to signal to the user
|
|
agent that the session can be resumed. Session tickets are only issued by the
|
|
server when the client has expressed support for them. They are encrypted and
|
|
provided by the server after a successful handshake via an out-of-band message.
|
|
The ticket contains all the necessary information to reestablish a secure
|
|
connection. When the user agent wishes to resume a connection, the session
|
|
ticket is sent along with the first \emph{ClientHello} message and the server
|
|
can decrypt the ticket and resume the session.
|
|
|
|
\begin{figure}
|
|
\begin{center}
|
|
\includegraphics[width=0.75\textwidth]{figures/tls-handshake.png}
|
|
\caption{A \gls{TLS}-handshake between a client and a server. First, the
|
|
client sends a \emph{ClientHello} message to the server which the
|
|
server has to answer with a \emph{ServerHello} message or else the
|
|
connection fails. These two initial messages establish protocol
|
|
version, session ID, cipher suite and compression method
|
|
\cite[p.~44]{rescorlaTransportLayerSecurity2008}. The server also
|
|
checks for a session resumption. If the client sends a session ID
|
|
with the \emph{ClientHello} message, the server knows that it should
|
|
resume a previously established connection. The next three messages
|
|
are used for the key exchange which allows client and server to
|
|
authenticate themselves.}
|
|
\label{fig:tls-handshake}
|
|
\end{center}
|
|
\end{figure}
|
|
|
|
In \gls{TLS} version 1.3 \cite{rescorlaTransportLayerSecurity2018} the session
|
|
identifiers and tickets have been replaced with a \gls{PSK}. Instead of sending
|
|
a ticket which is not encapsulated in the \gls{TLS}-secured connection, a
|
|
\gls{PSK} identity is sent from the server after the initial handshake, usually
|
|
avoiding out-of-band communication. The \gls{PSK} identity provides a mechanism
|
|
by which information associated with a secure connection (certificates, keys)
|
|
can be restored.
|
|
|
|
Because resuming a connection reuses information that has been exchanged before
|
|
to establish secure communication, individual sessions can be linked together
|
|
to form a history of information exchanges. This tracking method is described
|
|
by \citet{syTrackingUsersWeb2018}. Even though \gls{TLS} session resumption can
|
|
be mitigated by restarting the browser because that clears the cache, the
|
|
authors argue that due to mobile devices being online without restarts for long
|
|
periods the attack remains viable. Futhermore, despite browsers imposing
|
|
limits on the lifetime of session identifiers and \glspl{PSK}, it is possible
|
|
to maintain a session indefinitely by carrying out a \emph{prolongation
|
|
attack}. \citet{syTrackingUsersWeb2018} define a prolongation attack as an
|
|
attack where the client asks for a session resumption by sending the identifier
|
|
of a previously initiated connection and the server responds with a new
|
|
handshake instead of resuming the old one. This effectively resets the time
|
|
limit as long as the user is initiating new (or trying to resume old)
|
|
connections to the server within the imposed time limit.
|
|
|
|
The authors present an empirical evaluation of server and browser configurations
|
|
with respect to session resumption lifetime by crawling the top 1M web sites as
|
|
determined by Alexa. Their results indicate that only 4\% of those sites do not
|
|
allow session resumption at all, while the majority (78\%) allows session
|
|
identifiers as well as tickets.
|