16 Responsible AI

DALL·E 3 Prompt: Illustration of responsible AI in a futuristic setting with the universe in the backdrop: A human hand or hands nurturing a seedling that grows into an AI tree, symbolizing a neural network. The tree has digital branches and leaves, resembling a neural network, to represent the interconnected nature of AI. The background depicts a future universe where humans and animals with general intelligence collaborate harmoniously. The scene captures the initial nurturing of the AI as a seedling, emphasizing the ethical development of AI technology in harmony with humanity and the universe.

As machine learning models grow across various domains, these algorithms have the potential to perpetuate historical biases, breach privacy, or enable unethical automated decisions if developed without thoughtful consideration of their societal impacts. Even systems created with good intentions can ultimately discriminate against certain demographic groups, enable surveillance, or lack transparency into their behaviors and decision-making processes. As such, machine learning engineers and companies have an ethical responsibility to proactively ensure principles of fairness, accountability, safety, and transparency are reflected in their models to prevent harm and build public trust.

Learning Objectives

Understand the core principles and motivations behind responsible AI, including fairness, transparency, privacy, safety, and accountability.
Learn technical methods for putting responsible AI principles into practice, like detecting dataset biases, building interpretable models, adding noise for privacy, and testing model robustness.
Recognize organizational and social challenges to achieving responsible AI, including issues around data quality, model objectives, communication, and job impacts.
Gain knowledge of ethical frameworks and considerations for AI systems, spanning AI safety, human autonomy, and economic consequences.
Appreciate the increased complexity and costs associated with developing ethical, trustworthy AI systems compared to unprincipled AI.

16.1 Introduction

Machine learning models are increasingly used to automate decisions in high-stakes social domains like healthcare, criminal justice, and employment. However, without deliberate care, these algorithms can perpetuate biases, breach privacy, or cause other harm. For instance, a loan approval model solely trained on data from high-income neighborhoods could disadvantage applicants from lower-income areas. This motivates the need for responsible machine learning - creating fair, accountable, transparent, and ethical models.

Several core principles underlie responsible ML. Fairness ensures models do not discriminate based on gender, race, age, and other attributes. Explainability enables humans to interpret model behaviors and improve transparency. Robustness and safety techniques prevent vulnerabilities like adversarial examples. Rigorous testing and validation help reduce unintended model weaknesses or side effects.

Implementing responsible ML presents both technical and ethical challenges. Developers must grapple with defining fairness mathematically, balancing competing objectives like accuracy vs interpretability, and securing quality training data. Organizations must also align incentives, policies, and culture to uphold ethical AI.

This chapter will equip you to critically evaluate AI systems and contribute to developing beneficial and ethical machine learning applications by covering the foundations, methods, and real-world implications of responsible ML. The responsible ML principles discussed are crucial knowledge as algorithms mediate more aspects of human society.

16.2 Definition

Responsible AI is about developing AI that positively impacts society under human ethics and values. There is no universally agreed-upon definition of “responsible AI,” but here is a summary of how it is commonly described. Responsible AI refers to designing, developing, and deploying artificial intelligence systems in an ethical, socially beneficial way. The core goal is to create trustworthy, unbiased, fair, transparent, accountable, and safe AI. While there is no canonical definition, responsible AI is generally considered to encompass principles such as:

Fairness: Avoiding biases, discrimination, and potential harm to certain groups or populations
Explainability: Enabling humans to understand and interpret how AI models make decisions
Transparency: Openly communicating how AI systems operate, are built, and are evaluated
Accountability: Having processes to determine responsibility and liability for AI failures or negative impacts
Robustness: Ensuring AI systems are secure, reliable and behave as intended
Privacy: Protecting sensitive user data and adhering to privacy laws and ethics

Putting these principles into practice involves technical techniques, corporate policies, governance frameworks, and moral philosophy. There are also ongoing debates around defining ambiguous concepts like fairness and determining how to balance competing objectives.

16.3 Principles and Concepts

16.3.1 Transparency and Explainability

Machine learning models are often criticized as mysterious “black boxes” - opaque systems where it’s unclear how they arrived at particular predictions or decisions. For example, an AI system called COMPAS used to assess criminal recidivism risk in the U.S. was found to be racially biased against black defendants. Still, the opacity of the algorithm made it difficult to understand and fix the problem. This lack of transparency can obscure biases, errors, and deficiencies.

Explaining model behaviors helps engender trust from the public and domain experts and enables identifying issues to address. Interpretability techniques like LIME, Shapley values, and saliency maps empower humans to understand and validate model logic. Laws like the EU’s GDPR also mandate transparency, which requires explainability for certain automated decisions. Overall, transparency and explainability are critical pillars of responsible AI.

16.3.2 Fairness, Bias, and Discrimination

ML models trained on historically biased data often perpetuate and amplify those prejudices. Healthcare algorithms have been shown to disadvantage black patients by underestimating their needs (Obermeyer et al. 2019). Facial recognition needs to be more accurate for women and people of color. Such algorithmic discrimination can negatively impact people’s lives in profound ways.

Different philosophical perspectives also exist on fairness - for example, is it fairer to treat all individuals equally or try to achieve equal outcomes for groups? Ensuring fairness requires proactively detecting and mitigating biases in data and models. However, achieving perfect fairness is tremendously difficult due to contrasting mathematical definitions and ethical perspectives. Still, promoting algorithmic fairness and non-discrimination is a key responsibility in AI development.

16.3.3 Privacy and Data Governance

Maintaining individuals’ privacy is an ethical obligation and legal requirement for organizations deploying AI systems. Regulations like the EU’s GDPR mandate data privacy protections and rights like the ability to access and delete one’s data.

However, maximizing the utility and accuracy of data for training models can conflict with preserving privacy - modeling disease progression could benefit from access to patients’ full genomes but sharing such data widely violates privacy.

Responsible data governance involves carefully anonymizing data, controlling access with encryption, getting informed consent from data subjects, and collecting the minimum data needed. Honoring privacy is challenging but critical as AI capabilities and adoption expand.

16.3.4 Safety and Robustness

Putting AI systems into real-world operation requires ensuring they are safe, reliable, and robust, especially for human interaction scenarios. Self-driving cars from Uber and Tesla have been involved in deadly crashes due to unsafe behaviors.

Adversarial attacks that subtly alter input data can also fool ML models and cause dangerous failures if systems are not resistant. Deepfakes represent another emerging threat area.

Promoting safety requires extensive testing, risk analysis, human oversight, and designing systems that combine multiple weak models to avoid single points of failure. Rigorous safety mechanisms are essential for the responsible deployment of capable AI.

16.3.5 Accountability and Governance

When AI systems eventually fail or produce harmful outcomes, there must be mechanisms to address resultant issues, compensate affected parties, and assign responsibility. Both corporate accountability policies and government regulations are indispensable for responsible AI governance. For instance, Illinois’ Artificial Intelligence Video Interview Act requires companies to disclose and obtain consent for AI video analysis, promoting accountability.

Without clear accountability, even harms caused unintentionally could go unresolved, furthering public outrage and distrust. Oversight boards, impact assessments, grievance redress processes, and independent audits promote responsible development and deployment.

16.4 Cloud, Edge & Tiny ML

While these principles broadly apply across AI systems, certain responsible AI considerations are unique or pronounced when dealing with machine learning on embedded devices versus traditional server-based modeling. Therefore, we present a high-level taxonomy comparing responsible AI considerations across cloud, edge, and TinyML systems.

16.4.1 Summary

The table below summarizes how responsible AI principles manifest differently across cloud, edge, and TinyML architectures and how core considerations tie into their unique capabilities and limitations. Each environment’s constraints and tradeoffs shape how we approach transparency, accountability, governance, and other pillars of responsible AI.

Principle	Cloud ML	Edge ML	TinyML
Explainability	Complex models supported	Lightweight required	Severe limits
Fairness	Broad data available	On-device biases	Limited data labels
Privacy	Cloud data vulnerabilities	More sensitive data	Data dispersed
Safety	Hacking threats	Real-world interaction	Autonomous devices
Accountability	Corporate policies	Supply chain issues	Component tracing
Governance	External oversight feasible	Self-governance needed	Protocol constraints

16.4.2 Explainability

For cloud-based machine learning, explainability techniques can leverage significant compute resources, enabling complex methods like SHAP values or sampling-based approaches to interpret model behaviors. For example, Microsoft’s InterpretML toolkit provides explainability techniques tailored for cloud environments.

However, edge ML operates on resource-constrained devices, requiring more lightweight explainability methods that can run locally without excessive latency. Techniques like LIME (Ribeiro, Singh, and Guestrin 2016) approximate model explanations using linear models or decision trees to avoid expensive computations, which makes them ideal for resource-constrained devices. But LIME requires training hundreds to even thousands of models to generate good explanations, which is often infeasible given edge computing constraints. In contrast, saliency-based methods are often much faster in practice, only requiring a single forward pass through the network to estimate feature importance. This greater efficiency makes such methods better suited to edge devices with limited compute resources where low-latency explanations are critical.

Embedded systems poses the most significant challenges for explainability, given tiny hardware capabilities. More compact models and limited data make inherent model transparency easier. Explaining decisions may not be feasible on high-size- and power-optimized microcontrollers. DARPA’s Transparent Computing program aims to develop extremely low overhead explainability, especially for TinyML devices like sensors and wearables.

16.4.3 Fairness

For cloud machine learning, vast datasets and computing power enable detecting biases across large heterogeneous populations and mitigating them through techniques like re-weighting data samples. However, biases may emerge from the broad behavioral data used to train cloud models. Amazon’s Fairness Flow framework helps assess cloud ML fairness.

Edge ML relies on limited on-device data, making analyzing biases across diverse groups harder. But edge devices interact closely with individuals, providing an opportunity to adapt locally for fairness. Google’s Federated Learning distributes model training across devices to incorporate individual differences.

TinyML poses unique challenges for fairness with highly dispersed specialized hardware and minimal training data. Bias testing is difficult across diverse devices. Collecting representative data from many devices to mitigate bias has scale and privacy hurdles. DARPA’s Assured Neuro Symbolic Learning and Reasoning (ANSR) efforts are geared toward developing fairness techniques given extreme hardware constraints.

16.4.4 Safety

For cloud ML, key safety risks include model hacking, data poisoning, and malware disrupting cloud services. Robustness techniques like adversarial training, anomaly detection, and diversified models aim to harden cloud ML against attacks. Redundancy and redundancy can help prevent single points of failure.

Edge ML and TinyML interact with the physical world, so reliability and safety validation are critical. Rigorous testing platforms like Foretellix synthetically generate edge scenarios to validate safety. TinyML safety is magnified by autonomous devices with limited supervision. TinyML safety often relies on collective coordination - swarms of drones maintain safety through redundancy. Physical control barriers also constrain unsafe TinyML device behaviors.

In summary, safety is crucial but manifests differently in each domain. Cloud ML guards against hacking, edge ML interacts physically so reliability is key, and TinyML leverages distributed coordination for safety. Understanding the nuances guides appropriate safety techniques.

16.4.5 Accountability

Cloud ML’s accountability centers on corporate practices like responsible AI committees, ethical charters, and processes to address harmful incidents. Third-party audits and external government oversight promote cloud ML accountability.

Edge ML accountability is more complex with distributed devices and supply chain fragmentation. Companies are accountable for devices, but components come from various vendors. Industry standards help coordinate edge ML accountability across stakeholders.

With TinyML, accountability mechanisms must be traced across long, complex supply chains of integrated circuits, sensors, and other hardware. TinyML certification schemes help track component provenance. Trade associations should ideally promote shared accountability for ethical TinyML.

16.4.6 Governance

For cloud ML, organizations institute internal governance like ethics boards, audits, and model risk management. But external governance also oversees cloud ML, like regulations on bias and transparency such as the AI Bill of Rights, General Data Protection Regulation (GDPR), and California Consumer Protection Act (CCPA). Third-party auditing supports cloud ML governance.

Edge ML is more decentralized, requiring responsible self-governance by developers and companies deploying models locally. Industry associations coordinate governance across edge ML vendors. Open software helps align incentives for ethical edge ML.

With TinyML, extreme decentralization and complexity make external governance infeasible. TinyML relies on protocols and standards for self-governance baked into model design and hardware. Cryptography enables the provable trustworthiness of TinyML devices.

16.4.7 Privacy

For cloud ML, vast amounts of user data are concentrated in the cloud, creating risks of exposure through breaches. Differential privacy techniques add noise to cloud data to preserve privacy. Strict access controls and encryption protect cloud data at rest and in transit.

Edge ML moves data processing onto user devices, reducing aggregated data collection but increasing potential sensitivity as personal data resides on the device. Apple uses on-device ML and differential privacy to train models while minimizing data sharing. Data anonymization and secure enclaves protect on-device data.

TinyML distributes data across many resource-constrained devices, making centralized breaches unlikely and challenging for scale anonymization. Data minimization and using edge devices as intermediaries help TinyML privacy.

So, while cloud ML must protect expansive centralized data, edge ML secures sensitive on-device data, and TinyML aims for minimal distributed data sharing due to constraints. While privacy is vital throughout, techniques must match the environment. Understanding nuances allows for selecting appropriate privacy preservation approaches.

16.5 Technical Aspects

16.5.1 Detecting and Mitigating Bias

There has been a large body of work demonstrating that machine learning models can exhibit bias, from underperforming for people of a certain identity to making decisions that limit groups’ access to important resources (Buolamwini and Gebru 2018).

Ensuring fair and equitable treatment for all groups affected by machine learning systems is crucial as these models increasingly impact people’s lives in areas like lending, healthcare, and criminal justice. We typically evaluate model fairness by considering “subgroup attributes” - attributes unrelated to the prediction task that capture identities like race, gender, or religion. For example, in a loan default prediction model, subgroups could include race, gender, or religion. When models are trained naively to maximize accuracy, they often ignore subgroup performance. However, this can negatively impact marginalized communities.

To illustrate, imagine a model predicting loan repayment where the plusses (+’s) represent repayment and the circles (O’s) represent default, as shown in Figure 16.1. The optimal accuracy would be correctly classifying all of Group A while misclassifying some of Group B’s creditworthy applicants as defaults. If positive classifications allow access loans, Group A would receive many more loans—which would naturally result in a biased outcome.

Alternatively, correcting the biases against Group B would likely increase “false positives” and reduce accuracy for Group A. Or, we could train separate models focused on maximizing true positives for each group. But this would require explicitly using sensitive attributes like race in the decision process.

As we see, there are inherent tensions around priorities like accuracy versus subgroup fairness, and whether to explicitly account for protected classes. Reasonable people can disagree on the appropriate tradeoffs. And constraints around costs and implementation options further complicate matters. Overall, ensuring the fair and ethical use of machine learning involves navigating these complex challenges.

Thus, fairness literature has proposed three main fairness metrics for quantifying how fair a model performs over a dataset (Hardt, Price, and Srebro 2016). Given a model h, a dataset D consisting of (x,y,s) samples, where x is the data features, y is the label, and s is the subgroup attribute, where we assume there are simply two subgroups a and b, we can define the following.

Demographic Parity asks how accurate a model is for each subgroup. In other words, P(h(X) = Y S = a) = P(h(X) = Y S = b)
Equalized Odds asks how precise a model is on positive and negative samples for each subgroup. P(h(X) = y S = a, Y = y) = P(h(X) = y S = b, Y = y)
Equality of Opportunity is a special case of equalized odds that asks how precise a model is on positive samples only. This is relevant in cases such as resource allocation where we care about how positive (ie resource allocated) labels are distributed across groups. For example, we care that an equal proportion of loans are given to both men and women. P(h(X) = 1 S = a, Y = 1) = P(h(X) = 1 S = b, Y = 1)

Note: these definitions often take a narrow view of considering binary comparisons between two subgroups. Another thread of fair machine learning research focusing on multicalibration and multiaccuracy considers the interactions between an arbitrary number of identities, acknowledging the inherent intersectionality of individual identities in the real world (Hébert-Johnson et al. 2018).

Context Matters

Before making any technical decisions in developing an unbiased ML algorithm we need to understand the context surrounding our model. Here are some of the key questions to think about:

Who will this model make decisions for?
Who is represented in the training data?
Who is represented and who is missing at the table of engineers, designers, and managers?
What sort of long-lasting impacts could this model have? For example, will it impact the financial security of an individual at a generational scale such as determining college admissions or admitting a loan for a house?
What historical and systematic biases are present in this setting, and are they present in the training data the model will generalize from?

Understanding the social, ethical and historical background of a system is critical to prevent harm and should inform decisions throughout the model development lifecycle. After understanding the context, there are a wide array of technical decisions one can make to remove bias. First, one must decide what fairness metric is the most appropriate criterion to optimize for. Next, there are generally three main areas where one can intervene to debias an ML system.

First, preprocessing is when one balances a dataset to ensure fair representation, or even increases the weight on certain underrepresented groups to ensure the model performs well on them. Second, in processing attempts to modify the training process of an ML system to ensure it prioritizes fairness. This can be as simple as adding a fairness regularizer (Lowy et al. 2021), to training an ensemble of models and sampling from them in a specific manner (Agarwal et al. 2018).

Finally, post processing debiases a model after the fact, taking a trained model and modifying its predictions in a specific manner to ensure fairness is preserved (Alghamdi et al. 2022; Hardt, Price, and Srebro 2016). Post processing builds on the preprocessing and in processing steps by providing another opportunity to address bias and fairness issues in the model after it has already been trained.

The three step process of preprocessing, in processing, and post processing provides a framework for intervening at different stages of model development to mitigate issues around bias and fairness. While preprocessing and in processing focus on data and training, post processing allows for adjustments after the model has been fully trained. Together, these three approaches give multiple opportunities to detect and remove unfair bias.

Thoughtful Deployment

The breadth of existing fairness definitions and debiasing interventions underscores the need for thoughtful assessment before deploying ML systems. As ML researchers and developers, responsible model development requires proactively educating ourselves on the real-world context, consulting domain experts and end-users, and centering harm prevention.

Rather than seeing fairness considerations as a box to check, we must deeply engage with the unique social implications and ethical trade offs around each model we build. Every technical choice about datasets, model architectures, evaluation metrics and deployment constraints embeds values. By broadening our perspective beyond narrow technical metrics, carefully evaluating tradeoffs, and listening to impacted voices, we can work to ensure our systems expand opportunity rather than encode bias.

The path forward lies not in an arbitrary debiasing checklist but in a commitment to understanding and upholding our ethical responsibility at each step. This commitment starts with proactively educating ourselves and consulting others, rather than just going through the motions of a fairness checklist. It requires engaging deeply with ethical tradeoffs in our technical choices, evaluating impacts on different groups, and listening to those voices most impacted.

Ultimately, responsible and ethical AI systems come not from checkbox debiasing, but from upholding our duty to assess harms, broaden perspectives, understand tradeoffs and ensure we provide opportunity for all groups. This ethical responsibility should drive every step.

The connection between the paragraphs is that the first paragraph sets up the need for thoughtful assessment of fairness issues rather than a checkbox approach. The second paragraph then expands on what that thoughtful assessment looks like in practice - engaging with tradeoffs, evaluating impacts on groups, listening to impacted voices. Finally, the last paragraph circles back to the idea of avoiding an “arbitrary debiasing checklist” and instead committing to ethical responsibility through assessment, understanding tradeoffs, and providing opportunity.

16.5.2 Preserving Privacy

Recent incidents have demonstrated how AI models can memorize sensitive user data in ways that violate privacy. For example, as shown in Figure XXX below, Stable Diffusion’s art generations were found to mimic identifiable artists’ styles and replicate existing photos, concerning many (Ippolito et al. 2023). These risks are amplified with personalized ML systems deployed in intimate environments like homes or wearables.

Imagine if a smart speaker uses our conversations to improve the quality of service to end users who genuinely want it. Still, others could violate privacy by trying to extract what the speaker “remembers.” Figure 16.2 below shows an example of how diffusion models can memorize and generate individual training examples (Ippolito et al. 2023).

Figure 16.2: Diffusion models memorizing samples from training data. Credit: Ippolito et al. (2023).

Adversaries can take advantage of these memorization capabilities and train models to detect if specific training data influenced a target model. For example, membership inference attacks train a secondary model which learns to detect a change in the target model’s outputs when making inference over data it was trained on versus not trained on (Shokri et al. 2017).

ML devices are especially vulnerable because they are often personalized on user data and are deployed in even more intimate settings such as the home. To combat these privacy issues, private machine learning techniques have evolved to establish safeguards against adversaries, as mentioned in the Security and Privacy chapter. Methods like differential privacy add mathematical noise during training to obscure individual data points’ influence on the model. Popular techniques like DP-SGD (Abadi et al. 2016) also clip gradients to limit what the model leaks about the data. Still, some argue users should also be able to delete the impact of their data after the fact.

16.5.3 Machine Unlearning

With ML devices personalized to individual users and then deployed to remote edges without connectivity, a challenge arises—how can models responsively “forget” data points after deployment? If a user requests their personal data be removed from a personalized model, the lack of connectivity makes retraining infeasible. Thus, efficient on-device data forgetting is necessary but poses hurdles.

Initial unlearning approaches faced limitations in this context. Retraining models from scratch on the device to forget data points proves inefficient or even impossible, given the resource constraints. Fully retraining also requires retaining all the original training data on the device, which brings its own security and privacy risks. Common machine unlearning techniques (Bourtoule et al. 2021) for remote embedded ML systems fail to enable responsive, secure data removal.

However, newer methods show promise in modifying models to approximately forget data [?] without full retraining. While the accuracy loss from avoiding full rebuilds is modest, guaranteeing data privacy should still be the priority when handling sensitive user information ethically. Even slight exposure to private data can violate user trust. As ML systems become deeply personalized, efficiency and privacy must be enabled from the start—not afterthoughts.

Recent policy discussions which include the European Union’s General Data, Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), the Act on the Protection of Personal Information (APPI), and Canada’s proposed Consumer Privacy Protection Act (CPPA), require the deletion of private information. These policies coupled with AI incidents like Stable Diffusion memorizing artist data have underscored the ethical need for users to delete their data from models after training.

The right to remove data arises from privacy concerns around corporations or adversaries misusing sensitive user information. Machine unlearning refers to removing the influence of specific points from an already-trained model. Naively this involves full retraining without the deleted data. However, for ML systems personalized and deployed to remote edges, connectivity constraints often make retraining infeasible. If a smart speaker learns from private home conversations, retaining access to delete that data is important.

Although limited, methods are evolving to enable efficient approximations to retraining for unlearning. By modifying models inference-time, they can mimic “forgetting” data without full access to training data. However, most current techniques are restricted to simple models, still have resource costs, and trading some accuracy. Though methods are evolving, enabling efficient data removal and respecting user privacy remains an imperative for responsible TinyML deployment.

16.5.4 Adversarial Examples and Robustness

Machine learning models, especially deep neural networks, have a well-documented Achilles heel: they often break when even tiny perturbations are made to their inputs (Szegedy et al. 2014). This surprising fragility highlights a major robustness gap that threatens real-world deployment in high-stakes domains. It also opens the door for adversarial attacks designed to deliberately fool models.

Machine learning models can exhibit a surprising brittleness - minor input tweaks can cause shocking malfunctions, even in state-of-the-art deep neural networks (Szegedy et al. 2014). This unpredictability around out-of-sample data underscores gaps in model generalization and robustness. Given the growing ubiquity of ML, it also enables adversarial threats that weaponize models’ blindspots.

Deep neural networks demonstrate an almost paradoxical dual nature - human-like proficiency in training distributions coupled with extreme fragility to tiny input perturbations (Szegedy et al. 2014). This adversarial vulnerability gap highlights gaps in standard ML procedures and threats to real-world reliability. At the same time, it can be exploited: attackers can find model-breaking points humans wouldn’t perceive.

Figure 16.3 includes an example of a small meaningless perturbation that changes a model prediction. This fragility has real-world impacts: lack of robustness undermines trust in deploying models for high-stakes applications like self-driving cars or medical diagnosis. Moreover, the vulnerability leads to security threats: attackers can deliberately craft adversarial examples that are perceptually indistinguishable from normal data but cause model failures.

Figure 16.3: Perturbation effect on prediction. Credit: Microsoft.

For instance, past work shows successful attacks that trick models for tasks like NSFW detection (Bhagoji et al. 2018), ad-blocking (Tramèr et al. 2019), and speech recognition (Carlini et al. 2016). While errors in these domains already pose security risks, the problem extends beyond IT security: recently adversarial robustness has been proposed as an additional performance metric by approximating worst-case behavior.

The surprising model fragility highlighted above casts doubt on real-world reliability and opens the door to adversarial manipulation. This growing vulnerability underscores several needs. First, principled robustness evaluations are essential for quantifying model vulnerabilities before deployment. Approximating worst-case behavior surfaces blindspots.

Second, effective defenses across domains must be developed to close these robustness gaps. With security on the line, developers cannot ignore the threat of attacks exploiting model weaknesses. Moreover, for safety-critical applications like self-driving vehicles and medical diagnosis, we cannot afford any fragility-induced failures. Lives are at stake.

Finally, the research community continues mobilizing rapidly in response. Interest in adversarial machine learning has exploded as attacks reveal the need to bridge the robustness gap between synthetic and real-world data. Conferences now commonly feature defenses for securing and stabilizing models. The community recognizes that model fragility is a critical issue that must be addressed through robustness testing, defense development, and ongoing research. By surfacing blindspots and responding with principled defenses, we can work to ensure reliability and safety for machine learning systems, especially in high-stakes domains.

16.5.5 Building Interpretable Models

As models are deployed more frequently in high-stakes settings, practitioners, developers, and downstream end-users, as well as increasing regulation, have highlighted the need for explainability in machine learning. The goal of many interpretability and explainability methods is to provide practitioners with more information about either the overall behavior of models or the behavior given a specific input. This allows users to decide whether or not the output or prediction of a model is trustworthy.

Such analysis can help developers debug models and improve performance by pointing out biases, spurious correlations, and failure modes of models. In cases where models are able to surpass human performance on a task, interpretability can help users and researchers better understand relationships in their data and patterns that may previously have been unknown.

There are many classes of methods in explainability/interpretability, including: post hoc explainability, inherent interpretability, and mechanistic interpretability. These methods aim to make complex machine learning models more understandable and ensure users can trust model predictions, especially in critical settings. By providing transparency into model behavior, explainability techniques are an important tool for developing safe, fair, and reliable AI systems.

Post Hoc Explainability

Post hoc explainability methods typically explain the output behavior of a black-box model on a specific input. Popular methods include counterfactual explanations, feature attribution methods, and concept-based explanations.

Counterfactual explanations, also frequently referred to as algorithmic recourse, take the form of “If X had not occurred, Y would not have occurred” (Wachter, Mittelstadt, and Russell 2017). For example, consider a person applying for a bank loan whose application is rejected by a model. They may ask their bank for recourse, or how they need to change to be eligible for a loan. A counterfactual explanation would tell them which features they need to change and by how much such that the model’s prediction changes.

Feature attribution methods aim to highlight the input features important or necessary for a particular prediction. For a computer vision model, this would mean highlighting the individual pixels that contributed most to the predicted label of the image. Note that these methods do not explain how those pixels/features impact the prediction, only that they do. Common methods include input gradients, GradCAM (Selvaraju et al. 2017), SmoothGrad (Smilkov et al. 2017), LIME (Ribeiro, Singh, and Guestrin 2016), and SHAP (Lundberg and Lee 2017).

By providing examples of changes to input features that would alter a prediction (counterfactuals) or indicating the most influential features for a given prediction (attribution), these post hoc explanation techniques shed light on model behavior for individual inputs. This granular transparency helps users determine whether they can trust and act upon specific model outputs.

Concept based explanations aim to explain model behavior and outputs using a pre-defined set of semantic concepts (e.g. the model recognizes scene class “bedroom” based on the presence of concepts “bed” and “pillow”). Recent work shows that users often prefer these explanations to attribution and example based explanations because they “resemble human reasoning and explanations” (Vikram V. Ramaswamy et al. 2023b). Popular concept based explanation methods include TCAV (Kim et al. 2018), Network Dissection (Bau et al. 2017), and interpretable basis decomposition (Zhou et al. 2018).

Note that these methods are extremely sensitive to the size and quality of the concept set, and that there exists a tradeoff between the accuracy and faithfulness of these methods and their interpretability or understandability to humans (Vikram V. Ramaswamy et al. 2023a). By mapping model predictions to human-understandable concepts, however, concept-based explanations can provide transparency into the reasoning behind model outputs.

Inherent Interpretability

Inherently interpretable models are constructed such that their explanations are part of the model architecture and are thus naturally faithful, which sometimes makes them preferable to post-hoc explanations applied to black-box models, especially in high-stakes domains where transparency is imperative (Rudin 2019). Often, these models are constrained so that the relationships between input features and predictions are easy for humans to follow (linear models, decision trees, decision sets, k-NN models), or they obey structural knowledge of the domain, such as monotonicity (Gupta et al. 2016), causality, or additivity (Lou et al. 2013; Beck and Jackman 1998).

However, more recent works have relaxed the restrictions on inherently interpretable models, using black-box models for feature extraction and using a simpler inherently interpretable model for classification, allowing for faithful explanations that relate high-level features to prediction. For example, Concept Bottleneck Models (Koh et al. 2020) predict a concept set c that is passed into a linear classifier, and ProtoPNets (Chen et al. 2019) dissect inputs into linear combinations of similarities to prototypical parts from the training set.

Mechanistic Interpretability

Mechanistic interpretability methods seek to reverse engineer neural networks, often analogized to how one might reverse engineer a compiled binary or how neuroscientists attempt to decode the function of individual neurons and circuits in brains. Most research in mechanistic interpretability views models as a computational graph (Geiger et al. 2021) and circuits are subgraphs with distinct functionality (Wang and Zhan 2019). Current approaches to extracting circuits from neural networks and understanding their functionality rely on human manual inspection of visualizations produced by circuits (Olah et al. 2020).

Alternatively, some approaches build sparse autoencoders that encourage neurons to encode disentangled interpretable features (Davarzani et al. 2023). This field is much newer than existing areas in explainability and interpretability, and as such most works are generally exploratory rather than solution oriented.

There are many open problems in mechanistic interpretability, including the polysemanticity of neurons and circuits, the inconvenience and subjectivity of human labeling, and the exponential search space for identifying circuits in large models with billions or trillions of neurons.

Challenges and Considerations

As methods for interpreting and explaining models progress, it is important to note that humans overtrust and misuse interpretability tools (Kaur et al. 2020) and that a user’s trust in a model due to an explanation can be independent of the correctness of the explanations (Lakkaraju and Bastani 2020). As such, it is necessary that aside from assessing the faithfulness/correctness of explanations, researchers must also ensure that interpretability methods are developed and deployed with a specific user in mind, and that user studies are performed to evaluate their efficacy and usefulness in practice.

Furthermore, explanations should be tailored with the expertise of the user in mind, as well as the task they are using the explanation for, and the corresponding minimal amount of information required for the explanation to be useful to prevent information overload.

While interpretability/explainability are popular areas in machine learning research, very few works study their intersection with TinyML and edge computing. Given that a significant application of TinyML is healthcare, which often requires high transparency and interpretability, it is important that existing techniques are tested for scalability and efficiency with respect to edge devices. Many methods rely on extra forward and backward passes, and some even require extensive training of proxy models, all of which would likely be infeasible on microcontrollers that are resource constrained.

That being said, explainability methods can be highly useful in the development of models for edge devices, as they can give insights into how input data and models can be compressed and how representations may change post compression. Furthermore, many interpretable models are often smaller than their black-box counterparts, which could have additional benefits in TinyML applications.

16.5.6 Monitoring Model Performance

While developers may train models such that they seem adversarially robust, fair, and interpretable before deployment, it is imperative that both the users and the owners of the model continue to monitor the model’s performance and trustworthiness during the model’s full lifecycle. In practice, data is frequently changing, which can often result in distribution shifts. These distribution shifts can have profound impacts on both the vanilla predictive performance of the model as well as its trustworthiness (fairness, robustness, and interpretability) on real world data.

Furthermore, definitions of fairness also frequently change with time, such as what society considers a protected attribute, and the expertise of the users asking for explanations may change as well.

To ensure that models keep up to date with such changes in the real world, developers must continually evaluate their model on current and representative data and standards, and update models when necessary.

16.6 Implementation Challenges

16.6.1 Organizational and Cultural Structures

While innovation and regulation are often seen as having competing interests, many countries have found it necessary to provide oversight as AI systems expand into more sectors. As illustrated in Figure 16.4, this oversight has become crucial as these systems continue permeating various industries and impacting people’s lives (see Human-Centered AI, Chapter 8 “Government Interventions and Regulations”.

Figure 16.4: How various groups impact human-centered AI. Credit: Shneiderman (2020).

Among these are:

Canada’s Responsible Use of Artificial Intelligence
The European Union’s General Data Protection Regulation (GDPR)
The European Commission’s White Paper on Artificial Intelligence: a European approach to excellence and trust
The UK’s Information Commissioner’s Office and Alan Turing Institute’s Consultation on Explaining AI Decisions Guidance co-badged guidance by the individuals affected by them.

16.6.2 Obtaining Quality and Representative Data

Responsible AI design must occur at all stages of the pipeline, including data collection such as those things discussed in the Data Engineering chapter. This begs the question; what does it mean for data to be high-quality and representative? Consider the following scenarios that hinder the representativeness of data:

Subgroup Imbalance

This is likely what comes to mind when hearing the phrase “representative data.” Subgroup imbalance means that the dataset contains relatively more data from one subgroup than another. This imbalance can negatively affect the downstream ML model, by causing it to overfit to a subgroup of people while having poor performance on another.

One example consequence of subgroup imbalance is racial discrimination in facial recognition technology (Buolamwini and Gebru 2018); commercial facial recognition algorithms have up to 34% worse error rates on darker-skinned females than lighter-skinned males.

Note that data imbalance goes both ways, and subgroups can also be harmfully overrepresented in the dataset. For example, the Allegheny Family Screening Tool (AFST) is used to predict the likelihood that a child will eventually be removed from a home. The AFST produces disproportionate scores for different subgroups, one of the reasons being that it is trained on historically biased data, sourced from juvenile and adult criminal legal systems, public welfare agencies, and behavioral health agencies and programs.

Quantifying Target Outcomes

This occurs in applications where the ground-truth label cannot be measured or is difficult to represent in a single quantity. For example, an ML model in a mobile wellness application may want to predict individual stress levels. The true stress labels themselves are impossible to obtain directly, and must be inferred from other biosignals, such as heart rate variability and user’s self-reported data. In these situations, noise is built into the data by design, making this a challenging ML task.

Distribution Shift

Data may no longer be representative of a task if a major external event causes the source of the data to change drastically. The most common way to think about distribution shift is with respect to time; for example, data on consumer shopping habits that was collected pre-covid may no longer be representative of consumer behavior today.

Another form of distribution shift is that caused by transfer. For instance, in applying a triage system that was trained on data from one hospital to another, distribution shift may occur if the two hospitals are very different.#

Gathering Data

A reasonable solution for many of the above problems with non-representative or low-quality data is to collect more; we can collect more data targeting an underrepresented subgroup or collect more data from the target hospital to which our model might be transferred. However, there are also reasons that gathering more data is an inappropriate or infeasible solution for the task at hand.

Data collection can be harmful. This is the paradox of exposure, the situation in which those that stand to significantly gain from their data being collected are also those that are put at risk by the collection process (D’ignazio and Klein (2023), Chapter 4). For example, collecting more data on non-binary individuals may be important for ensuring fairness of the ML application, but also put them at risk, depending on who is collecting the data and how (whether the data is easily identifiable, contains sensitive content, etc).
Data collection can be costly. In some domains, such as in healthcare, obtaining data can be costly in terms of time and money.
Biased data collection. For example, Electronic Health Records are a huge data-source for ML driven healthcare applications. Issues of subgroup representation aside, the data itself may be collected in a biased manner. For example, negative language (“nonadherent”, “unwilling”) is disproportionately used on black patients (Himmelstein, Bates, and Zhou 2022).

We conclude with several additional strategies for maintaining data quality: improving understanding of the data, data exploration, and intr. First, fostering a deeper understanding of the data is crucial. This can be achieved through the implementation of standardized labels and measures of data quality, such as in the Data Nutrition Project.

Directly collaborating with organizations responsible for the data collection can help ensure that the data is interpreted correctly. Second, employing effective tools for data exploration is important. Visualization techniques and statistical analyses can reveal issues with the data. Finally, establishing a feedback loop within the ML pipeline is essential for understanding the real world implications of the data. Metrics, such as fairness measures, allow us to define “data quality” in the context of the downstream application; improving fairness may directly improve the quality of the predictions that the end users receive.

16.6.3 Balancing Accuracy and Other Objectives

Machine learning models are often evaluated on accuracy alone, but this single metric cannot fully capture model performance and tradeoffs for responsible AI systems. Other ethical dimensions like fairness, robustness, interpretability and privacy may compete with pure predictive accuracy during model development. For instance, inherently interpretable models such as small decision trees or linear classifiers with simplified features intentionally trade some accuracy for transparency into the model behavior and predictions. While these simplified models achieve lower accuracy by not capturing all complexity in the dataset, improved interpretability builds trust by enabling direct analysis by human practitioners.

Additionally, certain techniques meant to improve adversarial robustness like adversarial training examples or dimensionality reduction can degrade accuracy on clean validation data. In sensitive applications like healthcare, focusing narrowly on state-of-the-art accuracy carries ethical risks if it allows models to rely more on spurious correlations that introduce bias or use opaque reasoning. Therefore, the appropriate performance objectives depend greatly on the sociotechnical context.

Methodologies like Value Sensitive Design provide frameworks for formally evaluating the priorities of various stakeholders within the real-world deployment system. These elucidate tensions between values like accuracy, interpretability and fairness which can then guide responsible tradeoff decisions. For a medical diagnosis system, achieving the highest accuracy may not be the singular goal - improving transparency to build practitioner trust or reducing bias towards minority groups could justify small losses in accuracy. Analyzing the sociotechnical context is key for setting these objectives.

By taking a holistic view, we can responsibly balance accuracy with other ethical objectives for model success. Ongoing monitoring of performance along multiple dimensions is crucial as the system evolves after deployment.

16.7 Ethical Considerations in AI Design

We must discuss at least some of the many ethical issues at stake in the design and application of AI systems and diverse frameworks for approaching these issues, including those from AI safety, Human-Computer Interaction (HCI), and Science, Technology, and Society (STS).

16.7.1 AI Safety and Value Alignment

In 1960, Norbert Weiner wrote, “’if we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively… we had better be quite sure that the purpose put into the machine is the purpose which we really desire” (Wiener 1960).

In recent years, as the capabilities of deep learning models have achieved, and sometimes even surpassed human abilities, the issue of how to create AI systems that act in accord with human intentions instead of pursuing unintended or undesirable goals, has become a source of concern (Russell 2021). Within the field of AI safety, a particular goal concerns “value alignment,” or the problem of how to code the “right” purpose into machines Human-Compatible Artificial Intelligence. Present AI research assumes we know the objectives we want to achieve and “studies the ability to achieve objectives, not the design of those objectives.”

However, complex real-world deployment contexts make explicitly defining “the right purpose” for machines difficult, requiring frameworks for responsible and ethical goal-setting. Methodologies like Value Sensitive Design provide formal mechanisms to surface tensions between stakeholder values and priorities.

By taking a holistic sociotechnical view, we can better ensure intelligent systems pursue objectives that align with broad human intentions rather than maximizing narrow metrics like accuracy alone. Achieving this in practice remains an open and critical research question as AI capabilities continue advancing rapidly.

The absence of this alignment can lead to a number of AI safety issues, as have been documented in a variety of deep learning models. A common feature of systems that optimize for an objective, is that variables not directly included in the said objective may be set to extreme values to help optimize for that objective, leading to issues that have been characterized as specification gaming, reward hacking, etc. in reinforcement learning (RL).

In recent years, a particularly popular implementation of RL has been models pre-trained using self-supervised learning and fine-tuned using reinforcement learning from human feedback (RLHF) (Christiano et al. 2017). Ngo 2022 (Ngo, Chan, and Mindermann 2022) argue that by rewarding models for appearing harmless and ethical, while also maximizing useful outcomes, RLHF could encourage the emergence of three problematic properties: situationally-aware reward hacking where policies exploit human fallibility to gain high reward, misaligned internally-represented goals that generalize beyond the RLHF fine-tuning distribution, and power-seeking strategies.

Similarly, Van Noorden (2016) outline six concrete problems for AI safety, including avoiding negative side effects, avoiding reward hacking, scalable oversight for aspects of the objective that are too expensive to be frequently evaluated during training, safe exploration strategies that encourage creativity but while preventing harms, and robustness to distributional shift in unseen testing environments.

16.7.2 Autonomous Systems and Control [and Trust]

The consequences of autonomous systems that act independently of human oversight, and often outside of human judgment, have been well documented across a number of different industries and use cases. Most recently, the The California Department of Motor Vehicles suspended Cruise’s deployment and testing permits for its autonomous vehicles citing “unreasonable risks to public safety”. One such accident occurred when a vehicle struck a pedestrian who stepped into a crosswalk after the stoplight had turned green, and the vehicle was allowed to proceed. In 2018, a pedestrian crossing the street with her bike was killed when a self-driving Uber car, which was operating in autonomous mode, failed to accurately classify her moving body as an object to be avoided.

Autonomous systems beyond self-driving vehicles are also susceptible to such issues, with potentially graver consequences, as remotely-powered drones are already reshaping warfare. While such incidents bring up important ethical questions regarding who should be held responsible when these systems fail, they also highlight the technical challenges of giving full control of complex, real-world tasks to machines.

At its core, there is a tension between human and machine autonomy. Engineering and computer science disciplines have tended to focus on machine autonomy. For example, as of 2019, a search for the word “autonomy” in the Digital Library of the Association for Computing Machinery (ACM) reveals that of the top 100 most cited papers, 90% are on machine autonomy (Calvo et al. 2020). In an attempt to build systems for the benefit of humanity, these disciplines have taken without question increasing productivity, efficiency, and automation as primary strategies for benefiting humanity.

These goals put machine automation at the forefront, often at the expense of the human. This approach suffers from inherent challenges, as noted since the early days of AI through the Frame problem and qualification problem, which formalizes the observation that is impossible to specify all the preconditions needed for a real-world action to succeed (McCarthy 1981).

These logical limitations have given rise to mathematical approaches such as Responsibility-sensitive safety (RSS) (Shalev-Shwartz, Shammah, and Shashua 2017), which is aimed at breaking down the end goal of an automated driving system (namely safety) into concrete and checkable conditions that can be rigorously formulated in mathematical terms. The goal of RSS is that those safety rules guarantee ADS safety in the rigorous form of mathematical proofs. However, such approaches tend towards using automation to the problems of automation and are susceptible to many of the same issues.

Another approach to combating these issues is to turn the focus towards the human-centered design of interactive systems that incorporate human control. Value-sensitive design (Friedman 1996) described three key design factors for a user interface that impact autonomy, including system capability, system complexity, misrepresentation, and fluidity. A more recent model, called METUX (A Model for Motivation, Engagement, and Thriving in the User Experience) leverages insights from Self-determination Theory (SDT) in Psychology to identifies six distinct spheres of technology experience that contribute to the design systems that promote wellbeing and human flourishing (Peters, Calvo, and Ryan 2018). SDT defines autonomy as acting in accordance with one’s goals and values, which is distinct from the use of autonomy as simply a synonym for either independence or being in control (Ryan and Deci 2000).

Calvo 2020 elaborates on METUX and its six “spheres of technology experience” in the context of AI-recommender systems (Calvo et al. 2020). They propose these spheres – Adoption, Interface, Tasks, Behavior, Life, and Society – as a way of organizing thinking and evaluation of technology design in order to appropriately capture contradictory and downstream impacts on human autonomy when interacting with AI systems.

16.7.3 Economic Impacts on Jobs, Skills, Wages

A major concern of the current rise of AI technologies is widespread unemployment. As AI systems’ capabilities expand, many fear that these technologies will cause an absolute loss of jobs as they replace current workers and overtake alternative employment roles across industries. However, changing economic landscapes at the hands of automation are not new, and historically, have been found to reflect patterns of displacement rather than replacement (Shneiderman 2022)—Chapter 4. In particular, automation usually lowers costs and increases quality, which greatly increases access and demand. The need to serve these growing markets pushes production, which in turn creates new jobs.

Furthermore, studies have found that attempts to achieve “lights-out” automation – productive and flexible automation with a minimal number of human workers – have been unsuccessful. Attempts to do so have led to what the MIT Work of the Future taskforce has termed “zero-sum automation”, in which process flexibility is sacrificed for increased productivity.

In contrast, the taskforce propose a “positive-sum automation” approach in which flexibility is increased by designing technology that strategically incorporates humans where they are very much needed: making it easier for line employees to train and debug robots; using a bottom-up approach to identifying what tasks should be automated; and choosing the right metrics for measuring success (see MIT’s Work of the Future).

However, the optimism of the high-level outlook does not preclude individual harms, especially to those whose skills and jobs will be rendered obsolete by automation. Public and legislative pressure as well as corporate social responsibility efforts will need to be directed to create policies that share the benefits of automation with workers and result in higher minimum wages and benefits.

16.7.4 Scientific Communication and AI Literacy

A 1993 survey of 3000 North American adults’ beliefs about the “electronic thinking machine” revealed two primary perspectives of the early computer: the “beneficial tool of man” perspective and the “awesome thinking machine” perspective. The attitudes contributing to the “awesome thinking machine” view in this and other studies, revealed a characterization of computers as “intelligent brains, smarter than people, unlimited, fast, mysterious, and frightening” (Martin 1993). These fears highlight an easily overlooked component of responsible AI, especially amidst the rush to commercialize such technologies: scientific communication that accurately communicates the capabilities and limitations of these systems, while providing transparency about the limitations of experts’ knowledge about these systems.

As AI systems capabilities continue to expand beyond most people’s comprehension, there is a natural tendency to assume the kinds of apocalyptic worlds painted by our media. This is in part due to the apparent difficulty of assimilating scientific information, even in technologically advanced cultures, which leads to the products of science being perceived as magic - “understandable only in terms of what it did, not how it worked” (Handlin 1965).

While tech companies should be held responsible for limiting grandiose claims and not falling into cycles of hype, research studying scientific communication, especially with respect to (generative) AI, will also be useful in tracking and correcting public understanding of these technologies. An analysis of the Scopus scholarly database found that such research is scarce, with only a handful of papers mentioning both “science communication” and “artificial intelligence” (Schäfer 2023).

Research that exposes the perspectives, frames, and images of the future that are promoted by academic institutions, tech companies, stakeholders, regulators, journalists, NGOs and others will also help to identify potential gaps in AI literacy among adults (Lindgren 2023). Increased focus on AI literacy from all stakeholders will be an important tool in helping people whose skills are rendered obsolete by AI automation (Ng et al. 2021).

“But even those who never acquire that understanding need assurance that there is a connection between the goals of science and their own welfare, and above all, that the scientist is not a man altogether apart but one who shares some of their own value.” (Handlin, 1965)

16.8 Conclusion

Responsible artificial intelligence is crucial as machine learning systems exert growing influence across sectors like healthcare, employment, finance, and criminal justice. While AI promises immense benefits, thoughtlessly designed models risk perpetrating harm through biases, privacy violations, unintended behaviors, and other pitfalls.

Upholding principles of fairness, explainability, accountability, safety, and transparency enables developing ethical AI aligned with human values. However, putting these principles into practice involves surmounting complex technical and social challenges around detecting dataset biases, choosing appropriate model tradeoffs, securing quality training data, and more. Frameworks like value-sensitive design provide guidance on balancing accuracy versus other objectives based on stakeholder needs.

Looking forward, advancing responsible AI necessitates continued research and industry commitment. More standardized benchmarks are required for comparing model biases and robustness. Enabling efficient transparency and user control for edge devices warrants focus as personalized TinyML expands. Revised incentive structures and policies must encourage deliberate, ethical development before reckless deployment. Education around AI literacy and limitations will further responsible public understanding.

Responsible methods underscore that while machine learning offers immense potential, thoughtless application risks adverse consequences. Cross-disciplinary collaboration and human-centered design is imperative so AI can promote broad social benefit. The path ahead lies not in an arbitrary checklist but a steadfast commitment at each step to understand and uphold our ethical responsibility. By taking conscientious action, the machine learning community can lead AI toward empowering all people equitably and safely.

Resources

Here is a curated list of resources to support both students and instructors in their learning and teaching journey. We are continuously working on expanding this collection and will be adding new exercises in the near future.

Slides

These slides serve as a valuable tool for instructors to deliver lectures and for students to review the material at their own pace. We encourage both students and instructors to leverage these slides to enhance their understanding and facilitate effective knowledge transfer.

Exercises

To reinforce the concepts covered in this chapter, we have curated a set of exercises that challenge students to apply their knowledge and deepen their understanding.

Coming soon.

Labs

In addition to exercises, we also offer a series of hands-on labs that allow students to gain practical experience with embedded AI technologies. These labs provide step-by-step guidance, enabling students to develop their skills in a structured and supportive environment. We are excited to announce that new labs will be available soon, further enriching the learning experience.

Coming soon.

Abadi, Martin, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. “Deep Learning with Differential Privacy.” In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 308–18. CCS ’16. New York, NY, USA: ACM. https://doi.org/10.1145/2976749.2978318.

Agarwal, Alekh, Alina Beygelzimer, Miroslav Dudı́k, John Langford, and Hanna M. Wallach. 2018. “A Reductions Approach to Fair Classification.” In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, edited by Jennifer G. Dy and Andreas Krause, 80:60–69. Proceedings of Machine Learning Research. PMLR. http://proceedings.mlr.press/v80/agarwal18a.html.

Alghamdi, Wael, Hsiang Hsu, Haewon Jeong, Hao Wang, Peter Michalak, Shahab Asoodeh, and Flavio Calmon. 2022. “Beyond Adult and COMPAS: Fair Multi-Class Prediction via Information Projection.” Adv. Neur. In. 35: 38747–60.

Bau, David, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. 2017. “Network Dissection: Quantifying Interpretability of Deep Visual Representations.” In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 3319–27. IEEE Computer Society. https://doi.org/10.1109/CVPR.2017.354.

Beck, Nathaniel, and Simon Jackman. 1998. “Beyond Linearity by Default: Generalized Additive Models.” Am. J. Polit. Sci. 42 (2): 596. https://doi.org/10.2307/2991772.

Bhagoji, Arjun Nitin, Warren He, Bo Li, and Dawn Song. 2018. “Practical Black-Box Attacks on Deep Neural Networks Using Efficient Query Mechanisms.” In Proceedings of the European Conference on Computer Vision (ECCV), 154–69.

Bourtoule, Lucas, Varun Chandrasekaran, Christopher A. Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. 2021. “Machine Unlearning.” In 2021 IEEE Symposium on Security and Privacy (SP), 141–59. IEEE; IEEE. https://doi.org/10.1109/sp40001.2021.00019.

Buolamwini, Joy, and Timnit Gebru. 2018. “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification.” In Conference on Fairness, Accountability and Transparency, 77–91. PMLR.

Calvo, Rafael A, Dorian Peters, Karina Vold, and Richard M Ryan. 2020. “Supporting Human Autonomy in AI Systems: A Framework for Ethical Enquiry.” Ethics of Digital Well-Being: A Multidisciplinary Approach, 31–54.

Carlini, Nicholas, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Micah Sherr, Clay Shields, David Wagner, and Wenchao Zhou. 2016. “Hidden Voice Commands.” In 25th USENIX Security Symposium (USENIX Security 16), 513–30.

Chen, Chaofan, Oscar Li, Daniel Tao, Alina Barnett, Cynthia Rudin, and Jonathan Su. 2019. “This Looks Like That: Deep Learning for Interpretable Image Recognition.” In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, edited by Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, 8928–39. https://proceedings.neurips.cc/paper/2019/hash/adf7ee2dcf142b0e11888e72b43fcb75-Abstract.html.

Christiano, Paul F., Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. “Deep Reinforcement Learning from Human Preferences.” In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, edited by Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, 4299–4307. https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html.

D’ignazio, Catherine, and Lauren F Klein. 2023. Data Feminism. MIT press.

Davarzani, Samaneh, David Saucier, Purva Talegaonkar, Erin Parker, Alana Turner, Carver Middleton, Will Carroll, et al. 2023. “Closing the Wearable Gap: Footankle Kinematic Modeling via Deep Learning Models Based on a Smart Sock Wearable.” Wearable Technologies 4. https://doi.org/10.1017/wtc.2023.3.

Friedman, Batya. 1996. “Value-Sensitive Design.” Interactions 3 (6): 16–23. https://doi.org/10.1145/242485.242493.

Geiger, Atticus, Hanson Lu, Thomas Icard, and Christopher Potts. 2021. “Causal Abstractions of Neural Networks.” In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, Virtual, edited by Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, 9574–86. https://proceedings.neurips.cc/paper/2021/hash/4f5c422f4d49a5a807eda27434231040-Abstract.html.

Gupta, Maya, Andrew Cotter, Jan Pfeifer, Konstantin Voevodski, Kevin Canini, Alexander Mangylov, Wojciech Moczydlowski, and Alexander Van Esbroeck. 2016. “Monotonic Calibrated Interpolated Look-up Tables.” The Journal of Machine Learning Research 17 (1): 3790–3836.

Handlin, Oscar. 1965. “Science and Technology in Popular Culture.” Daedalus-Us., 156–70.

Hardt, Moritz, Eric Price, and Nati Srebro. 2016. “Equality of Opportunity in Supervised Learning.” In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, edited by Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, 3315–23. https://proceedings.neurips.cc/paper/2016/hash/9d2682367c3935defcb1f9e247a97c0d-Abstract.html.

Hébert-Johnson, Úrsula, Michael P. Kim, Omer Reingold, and Guy N. Rothblum. 2018. “Multicalibration: Calibration for the (Computationally-Identifiable) Masses.” In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, edited by Jennifer G. Dy and Andreas Krause, 80:1944–53. Proceedings of Machine Learning Research. PMLR. http://proceedings.mlr.press/v80/hebert-johnson18a.html.

Himmelstein, Gracie, David Bates, and Li Zhou. 2022. “Examination of Stigmatizing Language in the Electronic Health Record.” JAMA Network Open 5 (1): e2144967. https://doi.org/10.1001/jamanetworkopen.2021.44967.

Ippolito, Daphne, Florian Tramer, Milad Nasr, Chiyuan Zhang, Matthew Jagielski, Katherine Lee, Christopher Choquette Choo, and Nicholas Carlini. 2023. “Preventing Generation of Verbatim Memorization in Language Models Gives a False Sense of Privacy.” In Proceedings of the 16th International Natural Language Generation Conference, 5253–70. Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.inlg-main.3.

Kaur, Harmanpreet, Harsha Nori, Samuel Jenkins, Rich Caruana, Hanna M. Wallach, and Jennifer Wortman Vaughan. 2020. “Interpreting Interpretability: Understanding Data Scientists’ Use of Interpretability Tools for Machine Learning.” In CHI ’20: CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, April 25-30, 2020, edited by Regina Bernhaupt, Florian ’Floyd’Mueller, David Verweij, Josh Andres, Joanna McGrenere, Andy Cockburn, Ignacio Avellino, et al., 1–14. ACM. https://doi.org/10.1145/3313831.3376219.

Kim, Been, Martin Wattenberg, Justin Gilmer, Carrie J. Cai, James Wexler, Fernanda B. Viégas, and Rory Sayres. 2018. “Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV).” In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, edited by Jennifer G. Dy and Andreas Krause, 80:2673–82. Proceedings of Machine Learning Research. PMLR. http://proceedings.mlr.press/v80/kim18d.html.

Koh, Pang Wei, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. 2020. “Concept Bottleneck Models.” In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, 119:5338–48. Proceedings of Machine Learning Research. PMLR. http://proceedings.mlr.press/v119/koh20a.html.

Lakkaraju, Himabindu, and Osbert Bastani. 2020. “”How Do i Fool You?”: Manipulating User Trust via Misleading Black Box Explanations.” In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, 79–85. ACM. https://doi.org/10.1145/3375627.3375833.

Lindgren, Simon. 2023. Handbook of Critical Studies of Artificial Intelligence. Edward Elgar Publishing.

Lou, Yin, Rich Caruana, Johannes Gehrke, and Giles Hooker. 2013. “Accurate Intelligible Models with Pairwise Interactions.” In The 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013, Chicago, IL, USA, August 11-14, 2013, edited by Inderjit S. Dhillon, Yehuda Koren, Rayid Ghani, Ted E. Senator, Paul Bradley, Rajesh Parekh, Jingrui He, Robert L. Grossman, and Ramasamy Uthurusamy, 623–31. ACM. https://doi.org/10.1145/2487575.2487579.

Lowy, Andrew, Rakesh Pavan, Sina Baharlouei, Meisam Razaviyayn, and Ahmad Beirami. 2021. “Fermi: Fair Empirical Risk Minimization via Exponential Rényi Mutual Information.”

Lundberg, Scott M., and Su-In Lee. 2017. “A Unified Approach to Interpreting Model Predictions.” In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, edited by Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, 4765–74. https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html.

Martin, C. Dianne. 1993. “The Myth of the Awesome Thinking Machine.” Commun. ACM 36 (4): 120–33. https://doi.org/10.1145/255950.153587.

McCarthy, John. 1981. “Epistemological Problems of Artificial Intelligence.” In Readings in Artificial Intelligence, 459–65. Elsevier. https://doi.org/10.1016/b978-0-934613-03-3.50035-0.

Ng, Davy Tsz Kit, Jac Ka Lok Leung, Kai Wah Samuel Chu, and Maggie Shen Qiao. 2021. “AI Literacy: Definition, Teaching, Evaluation and Ethical Issues.” Proceedings of the Association for Information Science and Technology 58 (1): 504–9.

Ngo, Richard, Lawrence Chan, and Sören Mindermann. 2022. “The Alignment Problem from a Deep Learning Perspective.” ArXiv Preprint abs/2209.00626. https://arxiv.org/abs/2209.00626.

Obermeyer, Ziad, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations.” Science 366 (6464): 447–53. https://doi.org/10.1126/science.aax2342.

Olah, Chris, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. 2020. “Zoom in: An Introduction to Circuits.” Distill 5 (3): e00024–001. https://doi.org/10.23915/distill.00024.001.

Peters, Dorian, Rafael A. Calvo, and Richard M. Ryan. 2018. “Designing for Motivation, Engagement and Wellbeing in Digital Experience.” Front. Psychol. 9: 797. https://doi.org/10.3389/fpsyg.2018.00797.

Ramaswamy, Vikram V., Sunnie S. Y. Kim, Ruth Fong, and Olga Russakovsky. 2023a. “Overlooked Factors in Concept-Based Explanations: Dataset Choice, Concept Learnability, and Human Capability.” In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10932–41. IEEE. https://doi.org/10.1109/cvpr52729.2023.01052.

Ramaswamy, Vikram V, Sunnie SY Kim, Ruth Fong, and Olga Russakovsky. 2023b. “UFO: A Unified Method for Controlling Understandability and Faithfulness Objectives in Concept-Based Explanations for CNNs.” ArXiv Preprint abs/2303.15632. https://arxiv.org/abs/2303.15632.

Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. 2016. “” Why Should i Trust You?” Explaining the Predictions of Any Classifier.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135–44.

Rudin, Cynthia. 2019. “Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead.” Nature Machine Intelligence 1 (5): 206–15. https://doi.org/10.1038/s42256-019-0048-x.

Russell, Stuart. 2021. “Human-Compatible Artificial Intelligence.” Human-Like Machine Intelligence, 3–23.

Ryan, Richard M., and Edward L. Deci. 2000. “Self-Determination Theory and the Facilitation of Intrinsic Motivation, Social Development, and Well-Being.” Am. Psychol. 55 (1): 68–78. https://doi.org/10.1037/0003-066x.55.1.68.

Schäfer, Mike S. 2023. “The Notorious GPT: Science Communication in the Age of Artificial Intelligence.” Journal of Science Communication 22 (02): Y02. https://doi.org/10.22323/2.22020402.

Selvaraju, Ramprasaath R., Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. “Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization.” In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 618–26. IEEE Computer Society. https://doi.org/10.1109/ICCV.2017.74.

Shalev-Shwartz, Shai, Shaked Shammah, and Amnon Shashua. 2017. “On a Formal Model of Safe and Scalable Self-Driving Cars.” ArXiv Preprint abs/1708.06374. https://arxiv.org/abs/1708.06374.

Shneiderman, Ben. 2020. “Bridging the Gap Between Ethics and Practice: Guidelines for Reliable, Safe, and Trustworthy Human-Centered AI Systems.” ACM Transactions on Interactive Intelligent Systems 10 (December): 1–31. https://doi.org/10.1145/3419764.

———. 2022. Human-Centered AI. Oxford University Press.

Shokri, Reza, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. “Membership Inference Attacks Against Machine Learning Models.” In 2017 IEEE Symposium on Security and Privacy (SP), 3–18. IEEE; IEEE. https://doi.org/10.1109/sp.2017.41.

Smilkov, Daniel, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. 2017. “Smoothgrad: Removing Noise by Adding Noise.” ArXiv Preprint abs/1706.03825. https://arxiv.org/abs/1706.03825.

Szegedy, Christian, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. 2014. “Intriguing Properties of Neural Networks.” In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, edited by Yoshua Bengio and Yann LeCun. http://arxiv.org/abs/1312.6199.

Tramèr, Florian, Pascal Dupré, Gili Rusak, Giancarlo Pellegrino, and Dan Boneh. 2019. “AdVersarial: Perceptual Ad Blocking Meets Adversarial Machine Learning.” In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, 2005–21. ACM. https://doi.org/10.1145/3319535.3354222.

Van Noorden, Richard. 2016. “ArXiv Preprint Server Plans Multimillion-Dollar Overhaul.” Nature 534 (7609): 602–2. https://doi.org/10.1038/534602a.

Wachter, Sandra, Brent Mittelstadt, and Chris Russell. 2017. “Counterfactual Explanations Without Opening the Black Box: Automated Decisions and the GDPR.” SSRN Electronic Journal 31: 841. https://doi.org/10.2139/ssrn.3063289.

Wang, LingFeng, and YaQing Zhan. 2019. “A Conceptual Peer Review Model for arXiv and Other Preprint Databases.” Learn. Publ. 32 (3): 213–19. https://doi.org/10.1002/leap.1229.

Wiener, Norbert. 1960. “Some Moral and Technical Consequences of Automation: As Machines Learn They May Develop Unforeseen Strategies at Rates That Baffle Their Programmers.” Science 131 (3410): 1355–58. https://doi.org/10.1126/science.131.3410.1355.

Zhou, Bolei, Yiyou Sun, David Bau, and Antonio Torralba. 2018. “Interpretable Basis Decomposition for Visual Explanation.” In Proceedings of the European Conference on Computer Vision (ECCV), 119–34.