\chapter{Tracking Methods} \label{chap:tracking methods} This chapter will go into detail about various tracking methods that have been used during the history of the web. It is important to note that some of those approaches to tracking date back to when the World Wide Web was still in its early development stages. Knowing where the techniques come from helps in correctly judging the impact they had and still have on the Internet as we use it today. Furthermore, knowledge about the past allows for better predictions of future changes in the tracking ecosystem. To aid in understanding how they work and where they fit in the tracking landscape, three different categories are identified and presented: session-based, storage-based and cache-based tracking methods. Each category uses different mechanisms and technologies to enable tracking of users. What most of them have in common, is that they try to place unique identifiers in different places, which can then be read on subsequent visits. Thus, a chronological ordering of events enables interested parties to infer not only usage statistics but also specific data about the entities behind those identifiers. \section{Session-based Tracking Methods} \label{sec:session-based tracking methods} One of the simplest and most used forms of tracking on the Internet relies on sessions. Since \gls{HTTP} is a stateless protocol, web servers cannot by default keep track of any previous client requests. In order to implement specific features such as personalized advertising, some means to save current and recall previous states must be used. For this functionality, sessions were introduced. Sessions represent a temporary and interactive exchange of information between two parties. Due to their temporary nature, they have to be `brought up' at some point and `torn down' at a later point in time. It is not specified however, how long the period between establishing and stopping a session has to be. It could be only for a single browser session and terminated by the user manually, or it could be for as long as a year. \subsection{Passing Information in URLs} \label{subsec:passing information in urls} \glspl{URL} have first been proposed by Berners-Lee in 1994 \cite{berners-leeUniformResourceLocators1994} and are based on \glspl{URI} \cite{berners-leeUniversalResourceIdentifiers1994}. The latter specifies a way to uniquely identify a particular resource. The former extends the \gls{URI} specification to include where and how a particular resource can be found. \glspl{URI} consist of multiple parts: \begin{enumerate} \item a scheme (in some cases a specific protocol), \item an optional authority (network host or domain name), \item a path (a specific location on that host), \item an optional query and \item an optional fragment preceded by a hashtag (a sub resource pointing to a specific location within the resource) \end{enumerate} To access a section called \texttt{introduction} in a blog post named \texttt{blog post} on a host with the domain name \texttt{example.com} over \gls{HTTP}, a user might use the following \gls{URI}: \begin{verbatim} http://example.com/blogpost/#introduction \end{verbatim} Even though \glspl{URI} and \glspl{URL} are two different things, they are mostly used interchangeably today. Especially non-technical people refer to an address on the \gls{WWW} simply as a \gls{URL}. The optional query parameter is in most cases constructed of multiple \texttt{(key,value)} pairs, separated by delimiters such as \texttt{\&} and \texttt{;}. In the tracking context, query parameters can be used to pass information (e.g. unique identifiers) to the resource that is to be accessed by appending a unique string to all the links within the downloaded page. Since requests to pages are generally logged by the server, requesting multiple pages with the same unique identifier leaves a trail behind that can be used to compile a browsing history. Sharing information with other parties is not only limited to unique identifiers. \gls{URL} parameters can also be used to pass the referrer of a web page containing a query that has been submitted by the user. \citet{falahrastegarTrackingPersonalIdentifiers2016} demonstrate such an example where an advertisement tracker logs a user's browsing history by storing the referrer into a \texttt{(key,value)} pair \cite[p.~37]{falahrastegarTrackingPersonalIdentifiers2016}. Other possibilities include encoding geographical data, network properties, user information (e.g., e-mails) and authentication credentials. \citet{westMeasuringPrivacyDisclosures2014} conducted a survey concerning the use of \gls{URL} Query Strings and found it to be in widespread use on the web. \subsection{Hidden Form Fields} \label{subsec:hidden form fields} The \gls{HTML} provides a specification for form elements \cite{whatwgFormsHTMLStandard2020}, which allows users to submit information (e.g., for authentication) to the server via POST or GET methods. Normally, a user would input data into a form and on clicking \emph{submit} the input would be sent to the server. Sometimes it is necessary to include additional information that the user did not enter. For this reason there exist \emph{hidden} web forms \cite{whatwgHiddenStateHTML2020}. Hidden web forms do not show on the web site and therefore the user cannot enter any information. Similarly to \gls{URL} parameters, the value parameter in a hidden field contains additional information like the user's preferred language for example. Since almost anything can be sent in a value parameter, hidden form fields present another way to maintain a session. A parameter containing a unique identifier will be sent with the data the user has submitted to the server. The server can then match the action the user took with the identifier. In case the server already knows that specific identifier from a previous interaction with the user, the gained information can now be added to the user's browsing profile. An example of a hidden web form is given in Listing~\ref{lst:hidden web form}, which has been adapted from \cite{InputFormInput}. In Line 15 a hidden web field is created and the \texttt{value} field is set by the server to contain a unique user identifier. Once the \emph{submit} button has been clicked, the identifier is sent to the server along with the data the user has filled in. \begin{listing} \inputminted[frame=lines,framesep=2mm,bgcolor=light-gray,baselinestretch=1.2,fontsize=\scriptsize,linenos]{html}{code/hidden-web-form.html} \caption{Example of an \gls{HTTP} form containing a hidden field with \texttt{id=userId}. The id is set by the web server dynamically so that every visitor has his/her unique identifier attached to the form.} \label{lst:hidden web form} \end{listing} \subsection{HTTP Referer} \label{subsec:http referer} Providers of web services often want to know where visitors to their web site come from to understand more about their users and their browsing habits. The \gls{HTTP} specification accounts for this by introducing the \emph{\gls{HTTP} Referer field} [\emph{sic}] \cite{fieldingHTTPSemanticsContent2014} in the header. By checking the referrer, the server can see where the request came from. In practice, a user clicks on a link on a web page and the current web page is sent as a \gls{URL} in the \gls{HTTP} Referer field. The header with the referrer information gets attached to the \gls{HTTP} request which is sent to the server. The server responds with the requested web page and can establish a link from the original web page to the new web page. When applied to a majority of the requests on a site, the resulting data can be analyzed for promotional and statistical purposes. \citet{malandrinoPrivacyAwarenessInformation2013} have shown that the \gls{HTTP} Referer is one of the most critical factors in leaking \gls{PII}, because leakage of information relating to user's health has been identified as the most severe in terms of identifiability of users on the web. \subsection{Explicit Authentication} \label{subsec:explicit authentication} Explicit authentication requires a user to \emph{explicitly} log in or register to the web site. This way, specific resources are only available to the user when he or she has authenticated themselves to the service. Actions taken on an authenticated user account are tied to that account and crafting a personal profile is more or less a built-in function in this case. Since merely asking a user to authenticate is a simple method, the extent to which it can be used is limited. Logged in users are generally not logged in across different browser sessions, unless they are using cookies to do so (see section~\ref{subsec:http cookies}), therefore limiting tracking to one session at a time. Furthermore, always requiring a logged in state can be a tiring task for users, because they have to be authenticated every time they visit a particular service. This can potentially pose a usability problem where users simply stop using the service or go to considerable lengths to avoid logging in. This largely depends on a cost-benefit analysis the users subconsciously undertake. The third factor where this method is lacking, concerns the awareness of the user being tracked. Since tracking users depends on them actively logging in to the service, tracking them transparently is impossible. Even though most tracking efforts are not detected by the average user, it is known that actions taken on an account are logged to provide better service through service optimization and profile personalization. Making an account on a web site to use their services to their full extent, can be beneficial in some cases. Facebook for example, allows their users to configure what they want to share with the public and their friends. Research has shown however, that managing which posts get shown to whom is not as straightforward as one might think. \citet{liuAnalyzingFacebookPrivacy2011} conducted a survey where they asked Facebook users about their desired privacy and visibility settings and cross-checked them with the actual settings they have used for their posts. The results showed that in only 37\% of cases the users' expectations match the reality. Additionally, 36\% of content is left on the default privacy settings which set the visibility of posts to public, meaning that any Facebook user can view them. \subsection{window.name DOM Property} \label{subsec:window.name dom property} The \gls{DOM} is a platform and language agnostic \gls{API} which defines the logical structure of web documents (i.e., \gls{HTML}, \gls{XHTML} and \gls{XML}) and the way they are accessed and manipulated. The \gls{DOM} was originally introduced by Netscape at the same time as JavaScript as the \gls{DOM} Level 0. The first recommendation (\gls{DOM} Level 1) was released in 1998 by the \gls{W3C} \gls{DOM} working group \cite{w3cDocumentObjectModel1998} which published its final recommendation (\gls{DOM} Level 3) in 2004. Since then the \gls{WHATWG} took over and in 2015 published the \gls{DOM} Level 4 standard \cite{whatwgDOMLivingStandard2020} which replaces the Level 3 specification. It works by organizing all objects in a document in a tree structure which allows individual parts to be altered when a specific event happens (e.g., user interaction). Furthermore, each object has properties which are either applied to all \gls{HTML} elements or only to a subset of all elements. One useful property for tracking purposes is the \texttt{window.name} property \cite{whatwgWindowNameHTML2020}. Its original intention was to allow client-side JavaScript to get or set the name of the current window. Since windows do not have to have names, the window.name property is being used mostly for setting targets for hyperlinks and forms. Modern browsers allow storing up to two megabytes of data in the window.name property, which makes it a viable option for using it as a data storage or---more specifically---maintaining session variables. In order to store multiple variables in the window.name property, the values have first to be packed in some way because only a single string is allowed. A \gls{JSON} stringifier converts a normal string into a \gls{JSON} string which is then ready to be stored in the \gls{DOM} property. Additionally, serializers can also convert JavaScript objects into a \gls{JSON} string. Normally JavaScript's same-origin policy prohibits making requests to servers in another domain, but the window.name property is accessible from other domains and resistant to page reloads. Maintaining a session across domains and without cookies is therefore possible and multiple implementations exist \cite{frankSessionVariablesCookies2008,zypWindowNameTransport2008}. \section{Storage-based Tracking Methods} \label{sec:storage-based tracking methods} Storage-based tracking methods are different to session-based tracking methods in that they try to store information on the client's computer not only for single sessions but for as long as desired. The following methods can be used to store session data as well but are not limited to that use case. They generally enable more advanced tracking approaches because they have information about the current browser instance and the operating system the browser is running on. Due to their nature of residing on the user's computer, they are in most cases harder to circumvent, especially when two or more methods are combined, resulting in better resilience against simple defenses. \subsection{HTTP Cookies} \label{subsec:http cookies} A method which is most often associated with tracking on the Internet is tracking with \gls{HTTP} cookies. Cookies are small files that are placed in the browser's storage on the user's computer. They are limited to four kilobytes in size and are generally used to identify and authenticate users and to store web site preferences. They were introduced to the web to allow stateful information to be stored because the \gls{HTTP} is a stateless protocol and therefore does not have this capability. It is also a way of reducing the server's load by not having to recompute states every time a user visits a web site. Shopping cart functionality for example can thus be implemented by setting a cookie in the user's browser, saving the items which are currently added to the shopping cart and giving the user the possibility to resume shopping at a later point provided that they do not delete their cookies. With the introduction of cookies, advertising companies could reidentify users by placing unique identifiers in the browser and reading them on subsequent visits. The first standard for cookies was published in 1997 \cite{kristolHTTPStateManagement1997} and has since been updated multiple times \cite{kristolHTTPStateManagement2000,barthHTTPStateManagement2011}. Cookies can be divided into two categories: first party cookies, which are created by the domain the user has requested and third party cookies, which are placed in the user's browser by other domains that are generally not under the control of the first party \cite{barthThirdPartyCookies2011}. Whereas first party cookies are commonly not used for tracking but for the aforementioned shopping cart functionality for example or enabling e-commerce applications to function properly, third party cookies are popular with data brokerage firms (e.g., Datalogix, Experian, Equifax), online advertisers (e.g., DoubleClick) and---belonging to both of these categories in some cases---social media platforms (e.g., Facebook) \cite{cahnWhatCommunityCookie2016}. The distinction between these two categories is not always clear, however. Google Analytics for example is considered to be a third party but offers their analytics services by setting a first party cookie in the user's browser in addition to loading JavaScript snippets from their servers. Therefore, categorizing cookies into those that serve third party web content and those that serve first party web content presents a more adequate approach. Cookies are set either by calling scripts that are embedded in a web page (e.g., Google's \texttt{analytics.js}) or by using the \gls{HTTP} Set-Cookie response header. Once a request to a web server has been issued, the server can set a cookie in the Set-Cookie header and sends the response back to the client. On the client's side the cookie is stored by the browser and sent with subsequent requests to the same domain via the cookie \gls{HTTP} header. An example of a cookie header is given in Listing~\ref{lst:session cookie header}. Because this example does not set an expiration date for the cookie, it sets a session cookie. Session cookies are limited to the current session and are deleted as soon as the session is `torn down'. By adding an expiration date (demonstrated in Listing~\ref{lst:permanent cookie header}) or a maximum age, the cookie becomes permanent. Additionally, the domain attribute can be specified, meaning that cookies which list a different domain than the origin, are rejected by the user agent \cite[section 4.1.2.3]{barthHTTPStateManagement2011}. The same-origin policy applies to cookies, disallowing access by other domains. \begin{listing} \inputminted[frame=lines,framesep=2mm,bgcolor=light-gray,baselinestretch=1.2,fontsize=\scriptsize,linenos]{http}{code/session-cookie-header} \caption{An example of an \gls{HTTP} header setting a session cookie.} \label{lst:session cookie header} \end{listing} \begin{listing} \inputminted[frame=lines,framesep=2mm,bgcolor=light-gray,baselinestretch=1.2,fontsize=\scriptsize,linenos]{http}{code/permanent-cookie-header} \caption{An example of an \gls{HTTP} header setting a permanent cookie.} \label{lst:permanent cookie header} \end{listing} Distinguishing tracking and non-tracking cookies can be done with high accuracy by observing their expiration time and the length of the value field. \citet{liTrackAdvisorTakingBack2015} demonstrate a supervised learning approach to detecting tracking cookies with their tool \emph{TrackAdvisor}. They found that tracking cookies generally have a longer expiration time than non-tracking cookies and they need to have a sufficiently long value field carrying the unique identifier. Using this method, they found that only 10\% of tracking cookies have a lifetime of a single day or less while 80\% of non-tracking cookies expire before a day is over. Additionally, a length of more than 35 characters in the value field applies to 80\% of tracking cookies and a value field of less than 35 characters applies to 80\% of non-tracking cookies. \emph{Cookie Chunking}, where a cookie of larger length is split into multiple cookies with smaller length, did not appear to affect detection by their method negatively. They also present a site measurement of the Alexa Top 10,000 web sites, finding that 46\% of web sites use third party tracking. More recent research \cite{gonzalezCookieRecipeUntangling2017} has shown that tracking cookies do not have to be long lasting to accumulate data about users. Some cookies---like the \texttt{\_\_utma} cookie from Google Analytics for example---save a timestamp of the current visit with the unique identifier, thereby allowing to use cookies which last a short time but can be afterwards used in series to complete the whole picture. \citet{gonzalezCookieRecipeUntangling2017} have also found 20\% of observed cookies to be \gls{URL} or base64 encoded, making decoding of cookies a necessary step for analysis. Furthermore---and contrary to previous work---, cookie values are found in much more varieties than is assumed by approaches that only try to detect cookies by their expiration date and/or character length. They also presented an entity based matching algorithm to dissect cookies which contain more than a unique identifier. This allows for a better understanding and interpretation of complex cookies as they are found in advertising networks with a lot of reach (e.g., doubleclick.net). This information is particularly useful for building applications that effectively detect and block cookies (see chapter~\ref{chap:defenses against tracking}). \subsection{Flash Cookies and Java JNLP PersistenceService} \label{subsec:flash cookies and java jnlp persistenceservice} Flash Cookies \cite{adobeAdobeFlashPlatform} are similar to HTTP cookies in that they too are a store of information that helps web sites and servers to recognize already seen users. They are referred to as \glspl{LSO} by Adobe and are part of the Adobe Flash Player runtime. Instead of storing data in the browser's storage, they have their own storage in a different location on the user's computer. Another difference is that they cannot only store 4 kilobytes of data but 100 kilobytes and they also have no expiration dates by default (\gls{HTTP} cookies live until the end of the session unless specified otherwise). Since Flash cookies are not created by means the browser normally supports (i.e., \gls{HTTP}, \gls{CSS}) but by Adobe's Flash Player runtime, browsers are not managing Flash cookies. This means that, due to Flash cookies not being tied to a specific browser, they function across browsers. This capability makes them an interesting target for trackers to store their identifying information in, because out of the box browsers initially did not support removing Flash cookies and one had to manually set preferences in the \emph{Web Storage Settings panel} provided by the Flash Player runtime to get rid of them. Trackers were searching for a new way to store identifiers because users became increasingly aware of the dangers posed by \gls{HTTP} cookies and reacted by taking countermeasures. \citet{soltaniFlashCookiesPrivacy2009} were the first to report on the usage of Flash cookies by advertisers and popular web sites. While surveying the top 100 web sites at the time, they found that 54\% of them used Flash cookies. Some web sites were setting Flash cookies as well as \gls{HTTP} cookies with the same values, suggesting that Flash cookies serve as backup to \gls{HTTP} cookies. Several web sites were found using Flash cookies to respawn already deleted \gls{HTTP} cookies, even across domains. \citet{acarWebNeverForgets2014} automated detecting Flash cookies and access to them by monitoring file access with the GNU/Linux \emph{strace} tool \cite{michaelStraceLinuxManual2020}. This allowed them to acquire data about Flash cookies respawning \gls{HTTP} cookies. Their results show that six of the top 100 sites use Flash cookies for respawning. Even though Flash usage has declined during the last few years thanks to the development of the HTML5 standard, \citet{buhovFLASH20thCentury2018} have shown that despite major security flaws, Flash content is still served by 7.5\% of the top one million web sites (2017). The W3Techs Web Technology Survey shows a similar trend and also offers an up-to-date measurement of 2.7\% of the top ten million web sites for the year 2020 \cite{w3techsHistoricalYearlyTrends2020}. Due to the security concerns with using Flash, Google's popular video sharing platform YouTube switched by default to the HTML5