\documentclass[runningheads]{llncs} \usepackage{graphicx} \usepackage[backend=biber,style=numeric]{biblatex} \usepackage{hyperref} \usepackage{amsmath} \usepackage{csquotes} \hypersetup{ colorlinks=true, linkcolor=black, urlcolor=blue, citecolor=black } \addbibresource{trustworthy-ai.bib} \begin{document} \title{Trustworthy Artificial Intelligence} \author{Tobias Eidelpes} \authorrunning{T. Eidelpes} \institute{Technische Universität Wien, Karlsplatz 13, 1040 Wien, Austria \email{e1527193@student.tuwien.ac.at}} \maketitle \begin{abstract} The abstract should briefly summarize the contents of the paper in 150--250 words. \keywords{Artificial Intelligence, Trustworthiness, Social Computing} \end{abstract} \section{Introduction} \label{sec:introduction} The use of artificial intelligence (AI) in computing has seen an unprecedented rise over the last few years. From humble beginnings as a tool to aid humans in decision making to advanced use cases where human interaction is avoided as much as possible, AI has transformed the way we live our lives today. The transformative capabilities of AI are not just felt in the area of computer science, but have bled into a diverse set of other disciplines such as biology, chemistry, mathematics and economics. For the purposes of this work, AIs are machines that can learn, take decision autonomously and interact with the environment~\cite{russellArtificialIntelligenceModern2021}. While the possibilities of AI are seemingly endless, the public is slowly but steadily learning about its limitations. These limitations manifest themselves in areas such as autonomous driving and medicine, for example. These are fields where AI can have a direct—potentially life-changing—impact on people's lives. A self-driving car operates on roads where accidents can happen at any time. Decisions made by the car before, during and after the accident can result in severe consequences for all participants. In medicine, AIs are increasingly used to drive human decision-making. The more critical the proper use and functioning of AI is, the more trust in its architecture and results is required. Trust, however, is not easily defined, especially in relation to artificial intelligence. This work will explore the following question: \emph{Can artificial intelligence be trustworthy, and if so, how?} To be able to discuss this question, trust has to be defined and dissected into its constituent components. Chapter~\ref{sec:modeling-trust} analyzes trust and molds the gained insights into a framework suitable for interactions between humans and artificial intelligence. Chapter~\ref{sec:taxonomy} approaches trustworthiness in artificial intelligence from a computing perspective. There are various ways to make AIs more \emph{trustworthy} through the use of technical means. This chapter seeks to discuss and summarize important methods and approaches. Chapter~\ref{sec:social-computing} discusses combining humans and artificial intelligence into one coherent system which is capable of achieving more than either of its parts on their own. \section{Trust} \label{sec:modeling-trust} In order to be able to define the requirements and goals of \emph{trustworthy AI}, it is important to know what trust is and how we humans establish trust with someone or something. This section therefore defines and explores different forms of trust. \subsection{Defining Trust} Commonly, \emph{trusting someone} means to have confidence in another person's ability to do certain things. This can mean that we trust someone to speak the truth to us or that a person is competently doing the things that we \emph{entrust} them to do. We trust the person delivering the mail that they do so on time and without mail getting lost on the way to our doors. We trust people knowledgeable in a certain field such as medicine to be able to advise us when we need medical advice. Trusting in these contexts means to cede control over a particular aspect of our lives to someone else. We do so in expectation that the trustee does not violate our \emph{social agreement} by acting against our interests. Often times we are not able to confirm that the trustee has indeed done his/her job. Sometimes we will only find out later that what was in fact done did not happen in line with our own interests. Trust is therefore also always a function of time. Previously entrusted people can—depending on their track record—either continue to be trusted or lose trust. We do not only trust certain people to act on our behalf, we can also place trust in things rather than people. Every technical device or gadget receives our trust to some extent, because we expect it to do the things we expect it to do. This relationship encompasses \emph{dumb} devices such as vacuum cleaners and refrigerators, as well as seemingly \emph{intelligent} systems such as algorithms performing medical diagnoses. Artificial intelligence systems belong to the latter category when they are functioning well, but can easily slip into the former in the case of a poorly trained machine learning algorithm that simply classifies pictures of dogs and cats always as dogs, for example. Scholars usually divide trust either into \emph{cognitive} or \emph{non-cognitive} forms. While cognitive trust involves some sort of rational and objective evaluation of the trustee's capabilities, non-cognitive trust lacks such an evaluation. For instance, if a patient comes to a doctor with a health problem which resides in the doctor's domain, the patient will place trust in the doctor because of the doctor's experience, track record and education. The patient thus consciously decides that he/she would rather trust the doctor to solve the problem and not a friend who does not have any expertise. Conversely, non-cognitive trust allows humans to place trust in people they know well, without a need for rational justification, but just because of their existing relationship. Due to the different dimensions of trust and its inherent complexity in different contexts, frameworks for trust are an active field of research. One such framework—proposed by \textcite{ferrarioAIWeTrust2020}—will be discussed in the following sections. \subsection{Incremental Model of Trust} The framework by \textcite{ferrarioAIWeTrust2020} consists of three types of trust: simple trust, reflective trust and paradigmatic trust. Their model thus consists of the triple \[ T = \langle\text{simple trust}, \text{reflective trust}, \text{paradigmatic trust}\rangle \] \noindent and a 5-tuple \[ \langle X, Y, A, G, C\rangle \] \noindent where $X$ and $Y$ denote interacting agents and $A$ the action to be performed by the agent $Y$ to achieve goal $G$. $C$ stands for the context in which the action takes place. \subsubsection{Simple Trust} is a non-cognitive form of trust and the least demanding form of trust in the incremental model. $X$ trusts $Y$ to perform an action $A$ to pursue the goal $G$ without requiring additional information about $Y$'s ability to generate a satisfactory outcome. In other words, $X$ \emph{depends} on $Y$ to perform an action. $X$ has no control over the process and also does not want to control it or the outcome. A lot of day-to-day interactions happen in some form or another under simple trust: we (simply) trust a stranger on the street to show us the right way when we are lost. Sometimes simple trust is unavoidable because of the trustor's inability to obtain additional information about the other party. Children, for example, have to simply trust adults not because they want to but out of necessity. This changes when they get older and develop their ability to better judge other people. \subsubsection{Reflective Trust} adds an additional layer to the simple trust model: trustworthiness. Trustworthiness can be defined as the cognitive belief of $X$ that $Y$ is trustworthy. Reflective trust involves a cognitive process which allows a trustor to obtain reasons for trusting a potential trustee. $X$ believes in the trustworthiness of $Y$ because there are reasons for $Y$ being trustworthy. Contrary to simple trust, reflective trust includes the aspect of control. For an agent $X$ to \emph{reflectively} trust another agent $Y$, $X$ has objective reasons to trust $Y$ but is not willing to do so without control. Reflective trust does not have to be expressed in binary form but can also be expressed by a subjective measure of confidence. The more likely a trustee $Y$ is to perform action $A$ towards a goal $G$, the higher $X$'s confidence in $Y$ is. Additionally, $X$ might have high reflective trust in $Y$ but still does not trust $Y$ to perform a given task because of other, potentially unconscious, reasons. \subsubsection{Pragmatic Trust} is the last form of trust in the incremental model proposed by \cite{ferrarioAIWeTrust2020}. In addition to having objective reasons to trust $Y$, $X$ is also willing to do so without control. It is thus a combination of simple trust and reflective trust. Simple trust provides the non-cognitive, non-controlling aspect of trust and reflective trust provides the cognitive aspect. \subsection{Application of the Model} Since the incremental model of trust can be applied to human-human as well as human-AI interactions, an example which draws from both domains will be presented. The setting is that of a company which ships tailor-made machine learning (ML) solutions to other firms. On the human-human interaction side there are multiple teams working on different aspects of the software. The hierarchical structure between bosses, their team leaders and their developers is composed of different forms of trust. A boss has worked with a specific team leader in the past and thus knows from experience that the team leader can be trusted without control (paradigmatic trust). The team leader has had this particular team for a number of projects already but has recently hired a new junior developer. The team leader has some objective proof that the new hire is capable of delivering good work on time due to impressive credentials but needs more time to be able to trust the new colleague without control (reflective trust). On the human-AI side, one developer is working on a machine learning algorithm to achieve a specific goal $G$. Taking the 5-tuple from the incremental model, $X$ is the developer, $Y$ is the machine learning algorithm and $A$ is the action the machine learning algorithm takes to achieve the goal $G$. In the beginning, $X$ does not yet trust $Y$ to do its job properly. This is due to an absence of any past performance metric of the algorithm to achieve $G$. While most, if not all, parameters of $Y$ have to be controlled by $X$ in the beginning, there is less and less control needed if $Y$ achieves $G$ consistently. This also increases the cognitive trust in $Y$ as time goes on due to accurate performance metrics. \section{Computational Aspects of Trustworthy AI} \label{sec:taxonomy} While there have been rapid advances in the quality of machine learning models and neural networks, scholars, the public and policymakers are increasingly recognizing the dangers of artificial intelligence. Concerns about privacy and methods for deanonymizing individual data points to discrimination through learned biases and environmental impacts have prompted a new area of research which is focused on altering the models to alleviate these concerns. A recent survey by \textcite{liuTrustworthyAIComputational2021} summarizes the state of the art in trustworthy AI research. The authors collect research from a computational perspective and divide it into six categories: safety and robustness, non-discrimination and fairness, explainability, privacy, accountability and auditability and environmental well-being. The following sections summarize the computational methods for each category. \subsection{Safety and Robustness} Machine learning models should be able to give robust results even in the face of adversarial attacks or naturally occurring noise in the training data. It has been shown that even small perturbations in the training set can affect the quality of the model disproportionately \cite{madryDeepLearningModels2019}. In order to build models which retain their accuracy and general performance even under less than ideal circumstances, it is necessary to study different forms of attacks and how to defend against them. Safe and robust models lead to increases in trustworthiness because beneficiaries can more easily depend on their results (reflective trust). \subsubsection{Threat models} model by which method an attacker manages to break the performance of a particular machine learning algorithm. \emph{Poisoning attacks} allow an attacker to intentionally introduce bad samples into the training set which results in wrong predictions by the model. While many models are trained beforehand, other models are constantly being updated by data that the model receives from its beneficiaries. One such example may be Netflix' movie recommendation system that receives which type of movies certain users are interested in. A malicious user could therefore attack the recommendation engine by supplying wrong inputs. \emph{Evasion attacks} consist of alterations which are made to the training samples in such a way that these alternations—while generally invisible to the human eye—mislead the algorithm. \emph{White-box attacks} allow an attacker to clearly see all parameters and all functions of a model. \emph{Black-box attackers}, on the other hand, can only give inputs to the model and obtain the outputs. The former type of attack is generally easier to carry out. \emph{Targeted attacks} are aimed at specific classes of a machine learning classifier for example. Suppose a model is trained to recognize facial features. In a targeted attack, an attacker would try to feed inputs to the model such that just one person is consistently incorrectly classified. This type of attack is in contrast to \emph{non-targeted attacks} which seek to undermine the model's performance in general. Targeted attacks are usually much harder to detect as the predictions are correct overall but incorrect for a tiny subset. \subsubsection{Defenses against adversarial attacks} are specific to the domain a model is working in. \textcite{xuAdversarialAttacksDefenses2020} describe different attacks and defenses for text, image and graph data in deep neural networks. Defending against adversarial attacks often has negative impacts on training time and accuracy \cite{tsiprasRobustnessMayBe2019}. Balancing these trade-offs is therefore critical for real-world applications. \subsection{Non-discrimination and Fairness} Non-discrimination and fairness are two important properties of any artificial intelligence system. If one or both of them are violated, trust in the system erodes quickly. Often researchers only find out about a system's discriminatory behavior when the system has been in place for a long time. In other cases—such as with the chat bot Tay from Microsoft Research, for example—the problems become immediately apparent once the algorithm is live. Countless other models have been shown to be biased on multiple fronts: the US' recidivism prediction software \textsc{COMPAS} is biased against black people \cite{angwinMachineBias2016}, camera software for blink detection is biased against Asian eyes \cite{roseFaceDetectionCamerasGlitches2010} and gender-based discrimination in the placement of career advertisements \cite{lambrechtAlgorithmicBiasEmpirical2019}. Some biases are already included in the data from which an algorithm learns to differentiate between different samples. Examples include \emph{measurement bias}, \emph{aggregation bias} and \emph{representation bias} \cite{lambrechtAlgorithmicBiasEmpirical2019}. If biases are present in systems that are already being used by people worldwide, these systems can in turn influence users' behavior through \emph{algorithmic bias}, \emph{popularity bias} and \emph{emergent bias} \cite{friedmanBiasComputerSystems1996,lambrechtAlgorithmicBiasEmpirical2019}. Not all biases are bad. In order for models to work properly, some form of bias must be present in the data or there is no room for the model to generalize away from individual samples to common properties. This is what is commonly referred to as \emph{productive bias} \cite{liuTrustworthyAIComputational2021}. It is often introduced by the assumptions engineers of machine learning algorithms make about a specific problem. If the assumptions about the data are incorrectly made by the model architects, productive bias quickly turns into \emph{erroneous bias}. The last category of bias is \emph{discriminatory bias} and is of particular relevance when designing artificial intelligence systems. Fairness, on the other hand, is \enquote{…the absence of any prejudice or favoritism towards an individual or a group based on their inherent or acquired characteristics} \cite[p.~2]{mehrabiSurveyBiasFairness2021}. Fairness in the context of artificial intelligence thus means that the system treats groups or individuals with similar traits similarly. \subsubsection{Bias assessment tools} allow researchers to quantify the amount of bias and fairness produced by a machine learning algorithm. One such assessment tool is Aequitas \cite{saleiroAequitasBiasFairness2019}. Another tool developed by IBM is called the AI Fairness 360 toolkit \cite{bellamyAIFairness3602018}. \subsubsection{Bias mitigation techniques} deal with unwanted bias in artificial intelligence systems. Depending on the stage at which they are introduced, they can be either applied during \emph{pre-processing}, \emph{in-processing} or \emph{post-processing} \cite{liuTrustworthyAIComputational2021}. If it is possible to access the training data beforehand, pre-processing methods are particularly effective. Undersampled classes can be purposely weighted differently than majority classes to achieve a better distribution over all samples. Re-weighting can also be applied during training of the algorithm by first training on the samples and then re-training on the weights of the first training iteration. Post-processing methods include transforming the trained model after the fact to account for potentially biased outputs. Balancing these transformations can be a difficult endeavor because prediction accuracy can suffer. \subsection{Explainability} Recent advances in artificial intelligence can mostly be attributed to an ever-increasing model complexity, made possible by massive deep neural networks (DNNs) and other similarly complex architectures. Due to their size models are treated as black-boxes with no apparent way to know how a particular prediction came to be. This lack of explainability disallows humans to trust artificial intelligence systems especially in critical areas such as medicine. To combat the development towards difficult to understand artificial intelligence systems, a new research field called \emph{eXplainable Artificial Intelligence} (XAI) has emerged. Scholars distinguish between two similar but slightly different terms: \emph{explainability} and \emph{interpretability}. Interpretable systems allow humans to \emph{look inside} the model to determine which predictions it is going to make. This is only possible if most or all parameters of the model are visible to an observer and changes to those parameters result in predictable changes in outputs. Explainability, on the other hand, applies to black-box systems such as deep neural networks where the system explains its predictions after the fact. The definition of interpretability already provides one possibility for explainable models. If the model is constructed in a way which makes the parameters visible and a decision can be traced from a starting point to the outcome, the model is inherently explainable. Examples are decision trees, linear regression models, rule-based models and Bayesian networks. This approach is not possible for neural networks and thus \emph{model-agnostic explanations} have to be found. \textsc{LIME} \cite{ribeiroWhyShouldTrust2016} is a tool to find such model-agnostic explanations. \textsc{LIME} works \enquote{…by learning an interpretable model locally around the prediction} \cite[p.~1]{ribeiroWhyShouldTrust2016}. An advantage of this approach is that \textsc{LIME} is useful for any model, regardless of how it is constructed. Due to the high amount of flexibility introduced by model\nobreakdash-agnostic explanations, these can even be used for already interpretable models such as random forest classifiers. Deep neural networks can also be explained using either a \emph{gradient-based} or \emph{perturbation-based} explanation algorithm. Gradient-based algorithms attempt to evaluate how much outputs change if inputs are modified. If the gradient for a set of inputs is large, those inputs have a large effect on outputs. Similarly, a small gradient indicates that the change in inputs does not affect the outputs to a large extent. Perturbation-based explanations work by finding perturbations in the inputs that alter the model's predictions the most. \textsc{LIME} is an example of a perturbation-based explanation algorithm. \subsection{Privacy} \subsection{Accountability and Auditability} \subsection{Environmental Well-Being} \section{Social Computing} \label{sec:social-computing} \section{Conclusion} \label{sec:conclusion} \printbibliography \end{document}