“The Alignment Problem: Machine Learning and Human Values” by Brian Christian – Review and Commentary – Part 1

The Alignment Problem: Machine Learning and Human Values by Briant Christian, W. W. Norton & Company (October 6, 2020), 496 pages

See Part 2 of the review here.


What is the alignment problem? This was first articulated in 1960 by Norbert Wiener in “Some Moral and Technical Consequences of Automation” which says:

If we use, to achieve our purposes, a mechanical agency with whose operation we cannot efficiently interfere once we have started it … then we had better be quite sure that the purpose put into the machine is the purpose which we really desire and not merely a colorful imitation of it.

The problem isn’t so much that our computers, artificial intelligences, and robotics need to align to human desires, values, and morals. The biggest problems are (A) what is it that humans actually desire and value, and what are our morals (and, more importantly, what should be our desires, morals, and values)? And (B) if we assume we know, or at least know to some sufficient threshold, what humans ought to desire and value, and what morals humans ought to abide by, how can these be made explicit in such a way that there will be no loopholes, workarounds, and unintended consequences? In other words, how can human desires, values, and morals be explicitly written out in computer code?

When it comes to artificial intelligence (AI) and the alignment problem, people often conjure images of Skynet or the Matrix, hyperadvanced superintelligences with the capacity and disposition to wipe out or enslave humankind. But there are ways in which the alignment problem can seem much more mundane, yet no less consequential than this. Indeed, the alignment problem is not some far-off issue that future humans will need to grapple with, humankind is already facing the alignment problem. (Indeed, it was in the process of reading this book and writing this post that Chat GPT sprung onto the scene and brought the alignment problem to the forefront of people’s thought). It is this more immediate and urgent alignment problem that Christian discusses in this book.

Christian begins with a prologue recounting the story of the young prodigy Walter Pitts, who as a twelve-year-old read and understood Bertrand Russell’s Principia Mathematica enough to discover errors in it, and his neurologist mentor Warren McCulloch. In 1943 the pair published a paper, “A Logical Calculus of the Ideas Immanent in Nervous Activity (1990 reprint version)” that showed how neurons act similar to AND, OR, and NOT logic gates (e.g., a neuron with low threshold for firing is like an OR gate, since it will fire if any one of its inputs sends a signal, and a neuron with high threshold is like an AND gate since it requires all inputs to signal in order to fire).

In the introduction (following the brief prologue) Christian begins by discussing three situations in which computers do exactly what they’re told to do, but it results in undesirable (and perhaps unforeseeable) outcomes.

First he discusses Google’s word2vec program, which took in enormous datasets of language from newspapers and internet sites and discovered patterns (e.g., that the word “Beijing” was used in relation to “China” in the same way that “Moscow” was used in relation to “Russia”). It then assigned vectors to words in order to perform mathematical operations on the words, such as China + river = Yangtze”. However, it was soon discovered that the program held sexist views, seen in cases like “doctor – man + woman = nurse”.

Second he discusses racial biases found in machine learning algorithms used to assess the risk of recidivism in people arrested for crimes. This comes from the ProPublica study on the Northpointe program COMPAS (Correctional Offender Management Profiling for Alternative Sanctions). You can see the results of the study here and here.

Third he discusses a slightly less grim account of a program that is meant to learn how to win a boat race. But since it was programmed to rack up the highest score, it converged on a solution of just doing donuts in a small space in order to collect points gained from passing through a certain part of the track.

Christian then defines the three types of machine learning: unsupervised, supervised, and reinforcement.

  1. Unsupervised: the program is given a bunch of data and told to make sense of it. This was seen in the case of word2vec.
  2. Supervised: the program is given categorized/labeled samples and told to make predictions about future instances. This was seen in the case of COMPAS.
  3. Reinforcement: the program is placed in an environment where it is rewarded or punished for various behaviors wherein the program seeks to maximize rewards and minimize punishments. This was seen in the case of the boat racing program.

In all three cases what is being illustrated is the alignment problem, which, as discussed above, is how to get AI to have values and ethics that align with (or at least do not unduly violate) human values and ethics. It’s easy to state in plain language things like the Golden Rule, but more difficult to put in a precise computer language. Additionally, we humans take many things on assumption that an AI wouldn’t necessarily do: if we told an AI car to get me from point A to point B as fast as possible, it might show up with you dead, the vehicle damaged, and with half the police in pursuit because it sped down the road at 350 km/h and didn’t slow down for the turns. If you ask a human Uber driver to get you from A to B as fast as possible, the human knows that there are a plethora of unstated assumptions, such as keeping you alive and following traffic laws.

This book examines the alignment problem in three parts, titled Prophecy, Agency, and Normativity, each of which is divided into three chapters. The book can largely be viewed as a narrative of ways in which computer scientists have attempted to address the alignment problem in generating AI. In Prophecy, Christian examines the simplest kind of AI solutions, which are neural networks trained on large datasets or with explicitly stated objective functions (i.e., maximize rewards and minimize punishments). Due to issues that crop up in this, such as the three cases from above, new approaches were necessary. The Agency section examines possible reward/punishments functions that can be used to get around the issues brought up in Prophecy. Normativity then discusses the solution of using imitation and inference rather than explicitly stated objective functions: watch what humans do and imitate it, or come up with objective functions by inferring what it is humans want by observing their behavior.

I will go through each of these three parts in a separate post, this post focusing on Part 1: Prophecy. As usual, these reviews/commentaries is not a substitute for reading the book, but at best a supplement. If these posts wet your appetite for this subject, then I recommend reading the whole book.

Part 1: Prophecy

The word prophecy seems to refer to one of the four types of AI put forth by Nick Bostrom in Superintelligence. These four types are the tool, the oracle, the genie, and the sovereign. The tool is essentially what we have right now, where the AI is completely dependent on its coding and its commands for what its objectives are and what means it uses to accomplish those objectives. The computer or phone you have right now responds to your commands in completely reliable and reproducible ways. The oracle is more like what “the algorithm” is, i.e., the algorithms that take in your data and make predictions about your future desires (e.g., so they can advertise to you or recommend videos to you) or actions (e.g., the COMPAS program attempting to predict one’s risk of recidivism). This is where the word prophecy comes from, because oracle AI programs try to predict, or prophesize, about the future. Bostrom foresees that a superintelligent AI could become arbitrarily good at making predictions by taking in more and more data (e.g. about political and economic activity, weather patterns, trends on social media and TV, spying on people’s conversations and metadata through their phones, and so on) and using better and better algorithms running on better and better hardware, essentially becoming an oracle that can make predictions about complex systems with extremely high accuracy.

Genies then take the next step and start being able to take concrete actions in the real world. A person could give it a command (make a wish) and then the genie would accomplish the objective. What makes a genie more powerful than a tool, though, is that a genie would be capable of adapting and discovering novel solutions. Say, for instance, the U.S. military had a superintelligent AI genie and they gave it the command to have Vladimir Putin overthrown. The genie could then act autonomously to find ways of doing this: creating millions of online bots to spread disinformation, hacking into high level Russian military and intelligence networks to manipulate information, and socially engineering scores of Russian citizens and politicians to want to overthrow Putin. Once the objective is accomplished, the genie then awaits its next command.

The sovereign is like the genie, but it is able to set its own objectives (it will not simply await its next command once an objective is accomplished). It would likely have some kind of prime directive coded in (e.g., maximize human flourishing), but it could then set numerous sub-directives and objectives. For instance, if we had a sovereign with the prime directive to maximize human flourishing, it would be capable of seeing that Putin hinders this objective and so the sovereign would independently generate the objective of having Putin overthrown. It’s this type of superintelligent AI that sci-fi loves.

In this first section of the book, Christian discusses AI Prophets that are made using neural networks and machine learning.

Chapter 1: Representation

The first chapter in the Prophecy section of The Alignment Problem begins with a discussion of Frank Rosenblatt‘s perceptron and Rosenblatt’s theorem. The theorem says (from the last link): “…elementary perceptrons can solve any classification problem if there are no discrepancies in the training set.” In other words, as long as the data set is consistent, a neural network can learn any classification one wants. Put yet another way, there is no way of categorizing things that a sufficiently deep neural network could not learn.

Christian briefly discusses the cycle of enthusiasm for artificial intelligence. After Rosenblatt’s perceptron there was great enthusiasm for AI. That is, up until the publication in 1969 of Perceptrons: An Introduction to Computational Geometry by Marvin Minsky and Seymour Papert, which showed that the shallow neural networks used by Rosenblatt and others were not sufficient for certain categorization tasks (e.g., determining if there are an even or odd number of objects). In the 1980’s the enthusiasm returned when deeper neural networks came into use, for instance by the post office to read zip codes. The enthusiasm again waned in the late 1990’s when computer scientists bumped up against their limitations: their training data sets were too small and their computers were too slow. But in the mid-2000’s, with computers becoming faster and the internet more widely used (thus supplying greater data sets to train on), enthusiasm once again rose. This most recent increase in enthusiasm came particularly from the so-called AlexNet when Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton showed that deep neural networks using GPU’s could be used to categorize things with very little error.

Problems arose. The data sets that neural networks are trained on are often biased in favor of white men. Christian discusses the 2015 controversy where Google Photos tagged images of black people as gorillas. He talks about the finding that the Labeled Faces in the Wild (LFW) dataset compiled in 2007 was 77% male and 83% white and that neural networks trained on such datasets make 100-fold more mistakes in identifying dark skinned women than they do light skinned men.

Similarly for language, various biases cropped up. Programs meant to scan hundreds of résumés for the best candidates are trained on natural language, for instance online or even from previous résumés. These programs are similar to the word2vec discussed above in that words are given a set of numerical values representing their “position” in the vector space such that related words (i.e., words more often found together in specific contexts, such as “Beijing” and “China” being used together in the same way as “Moscow” and “Russia” or “Washington D.C.” and “America” etc.). But these programs would often stereotype, such that a company looking to hire software engineers would rank men as better candidates than women, even when everything else in the résumé was identical (i.e., everything except the name). But even when names and gender pronouns are corrected for, subtle uses of diction and differential use of synonyms can often suss out gender bias.

Interestingly, social scientists have used this use of language being a reflection of implicit sentiments to examine attitudes toward race and sex over the decades. Using Google Books/Corpus of Historical American English (COHA) embeddings, Garg et al looked at attitudes toward women and Asian Americans since 1910. The good news is that things have gotten markedly better, but the bad news is that it isn’t fixed.

The problem with this isn’t just that these issues are awkward, or even that it reflects badly on society (the programs are trained on data gathered from natural language and so act as a sort of mirror to our attitudes, as shown in the above study), but that without fixing the problem, the adoption of such technologies could exacerbate and amplify the problem. If programs trained on real world data rate men as better candidates than women for software engineering positions, then the problem of sexism in hiring practices will persist. This then further reinforces this bias in the data used to train programs, thereby continuing the cycle of sexism in perpetuity. The problem here isn’t just that men and women may, on average, have different interests and career preferences. Two identical candidates for a position who differ only in their sex can be discriminated, which is exactly antithetical to the ideal of meritocracy, since a person in a meritocratic system should be evaluated for past accomplishments (as a predictor for present and future capabilities), not on accidental characteristics like race and sex.

Even when such biases are “corrected” for, such as removing gendered words or normalizing so that all words are equidistant from the gender axis in the vector space, a person’s gender can often be accurately surmised anyway just based on the kinds of words often associated with it (see Table 2 in the above figure for the kinds of adjectives associated with women, for instance). So, even if “male” and “female”, “he” and “she”, “boy” and “girl”, and so on, are all corrected for, there is still a sort of aura of words often associated with each sex that continues nucleate around them, making it extremely difficult to eliminate such biases. Furthermore, it ends up inhibiting software in other ways, such as when gendered words are used in non-gendered ways (e.g., “man your stations” or “the clause was grandfathered in”).

This is the alignment problem. We humans value meritocracy (at least to certain degrees) and fairness. We thus want our AI oracles to reflect these values. We don’t want software that reflects our human biases, we want software that gets us closer to the ideals of meritocracy and fairness. Our technology should not exacerbate the problems we know about ourselves, but to improve ourselves, to allow us to live up to the higher ideals that we humans so reliably fail to meet.

Chapter 2: Fairness

The second chapter in the Prophecy section begins by diving into more detail about the COMPAS case. This program rates defendants on a scale from 1-10, with 1 being least likely to recidivate and 10 being most likely. What ProPublica found was that black defendants were evenly distributed, with about 10% falling into each of the ten possible scores, while white people were disproportionately rated lower. Since the study came out in 2016, there have been no end of interpretations and refutations (for example see Northpointe’s response here and a more recent independent response here).

Northpointe’s response essentially says that their software is fair in two important ways: the accuracy of predicting recidivism and being equally calibrated across both white and black defendants. In the first case, what it means is that someone rated a 7 had just as much chance of recidivism as any other person rated a 7. Same for a 1, a 2, a 3, and so on. In other words, a 1 was a 1 was a 1, a 5 was a 5 was a 5, and a 10 was a 10 was a 10, across the board. In the second case, what it means is that it was just as accurate for white people as it was for black people. More specifically, it was 61% accurate for both black people and white people and 39% of the time it was wrong, equally for both black people and white people. This picture indicates that it just is the case that black people are more likely to re-offend than white people.

What is actually going on, however, is that COMPAS was overestimating the number of black defendants who would recidivate and underestimate the number of white defendants who would recidivate. In other words, Nothpointe focused on the rate of errors (the 39%) while ignoring the type of errors.

The cause of these biases is from what is called redundant encoding. As discussed above, there is a sort of aura of words that clings to a person that is indicative of traits like sex and race. Nova on PBS also recently discussed this subject. What is happening is that the racism of the algorithm comes from racism inherent in the dataset: because police, judges, parole officers, etc. make racially biased decisions, the dataset the algorithm was trained on was racist. And so, even when race is corrected for in the algorithm, the data set still generates this halo of words or concepts around a person that is based on racial biases in the data – if police are arresting black people at disproportionally higher rates, then black people will inevitably end up with larger rap sheets, which will be noticed by the algorithm. Put another way: these programs do not predict crime rates, they predict arrest and conviction rates, which are subject to the biases of police and the criminal justice system. This has the added problem that policing is influenced by where arrests occur, and so it can create a positive feedback loop.

The problem with trying to blind software to traits like sex and race is that sometimes, in order to correct for these things, one must know the sex or race of a person. For instance, if our program looks for candidates on a résumé and uses whether they had a job in the last year as one of the parameters, then knowing that the person is female and had been on maternity leave would be important information to know, since such a circumstance should not count against a candidate for a job. But if the program is sex-blind, then all it will see is that the candidate had been out of work for some time.

Thus there is a push for fairness through awareness (awareness of certain traits, such as sex or race). But all this, particularly the 2016 ProPublica articles, has spurred much discussion on the topic of fairness. This was kicked off in particular in 2017 with the publications of three papers:

  1. Inherent Trade-Offs in the Fair Determination of Risk Scores
  2. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments
  3. Algorithmic decision making and the cost of fairness

The first paper found that the definitions of fairness (accuracy of true positives and true negatives, equalizing false positives and false negatives, and calibration) used by ProPublica and Northpointe were incompatible if there was not an equal base rate of recidivism between black people and white people. The second paper similarly shows that without equal base rates, it is impossible to have equal false positives and negatives. The third paper found “…that optimizing for public safety yields stark racial disparities; conversely, satisfying past fairness definitions means releasing more high-risk defendants, adversely affecting public safety.” Or, as Christian puts it:

…if a set of equally desirable criteria are impossible for any model to satisfy, then any exposé of any risk-assessment instrument whatsoever is guaranteed to find something headline-worthy to dislike.

Meaning, essentially, that there is no perfectly “fair” algorithm that will make everyone happy. No equally desirable measures of fairness are reconcilable. For instance, equalizing false positives (rating someone as a recidivism risk when they are not a recidivism risk) would require applying different standards based on race, e.g., a black person rated 7 is the same as a white person rated 6. This would violate the U.S. 14th amendment’s equal protections clause in two ways: detaining more white people who are rated as lower risk, and exposing black communities to people rated as higher risk (i.e., it may make crime worse in black communities if people who are at higher risk of recidivism are released). Likewise, it would also mean that there is no calibration to the numbers – what does it even mean to be rated a 6 if its meaning depends on things like race or sex (i.e., recidivism rate differs between men and women as well)?

Christian does point out that subjective human assessments of people up for parole are no better, and often worse, than our algorithms. This leaves us in somewhat of a double bind as to the best way to make such predictions.

 Chapter 3: Transparency

A glaring problem in a lot of machine learning is that they are what are known as black boxes – we can’t easily determine how the neural network is actually processing the data and generating results. This has a number of different problems, but one of them is that we can’t know what errors the program may be making. Christian discusses the example of using a neural network to assess whether patients diagnosed with pneumonia ought to be treated in-patient or out-patient. The neural network designed by Richard Ambrosino in the 1990’s recommended that patients with asthma be treated out-patient, even though doctors know that such patients are at significantly higher risk. It was only because a competing program that was much simpler (went through a series of if-then statements instead of being a neural network) showed that patients with asthma had higher survival rates because they were always treated in-patient and given special care. Being trained on this data set made it appear like having asthma (as well as other things like having a heart condition) was a predictor of better outcomes. But it would be impossible to make such a determination by looking at the neural network’s code.

Christian goes on to describe how, in fact, simple models are often very good at making predictions – better than “optimized” models and much better than humans. He discusses the work of people like Theodore Sarbin, Paul Meehl, and Robyn Dawes, whose mid-twentieth century work demonstrated that even very simple models outcompete experts in making good predictions.

However, the input of people is still fundamental: it is people who decide which parameters are the ones that ought to be evaluated in the model. But while this may work well, it’s likely not optimal. As such, some seek to look through large datasets not to train a neural network, but to look for the minimal number of parameters that function as the best predictors using a simple (and therefore more transparent) model.

Another way of adding transparency that Christian discusses is in adding more outputs to the model to give context to the output in what is called multitask learning. For instance, in the case of patients with pneumonia, having length of hospital stay and the medical bill as outputs helps show that morbidity alone isn’t all that is important: even though an asthmatic might have lower morbidity due to being given more urgent treatment, they will likely end up with longer hospital stays and larger bills.

The gold standard, of course, would be to understand what a neural network is doing at every step of its processing. Solutions for neural networks meant to categorize images have been improving. But how is it that neural networks actually categorize images? This can be done in multiple ways, some of the more popular being deep learning, recurrent neural networks, convolutional neural networks, and deep reinforcement learning. But in all cases, the general idea is to use multiple layers of nodes with weighted connections and have the weights of those connections adjusted through the training process. The following videos do a great job of explaining it.

To attempt to shine line in the black box of a neural network (the “hidden layers”), computer scientists can use deconvolutional networks to show what parts of the image are “seen” at each node of the neural network. The techniques that earned some meme popularity, like DeepDream, can be used to sort of test and calibrate the neural network. For instance, a neural network trained to see dogs can take in an image and then output an image with the somewhat more dog-like parts of the image being slightly more dog-like, then doing this iteratively, gives strange looking images like the one below:

The practical use of this exercise is to see what kinds of thing the program will output. So, for example, you do this exercise with a neural network trained to see human faces and it outputs something with a bunch of white male faces, then that tells you that it was likely trained on mostly white male faces and so it is much better at detecting white male faces.

Part 1 Concluding Remarks

Although artificial intelligence has seen great improvements in even the time since this book was published, there is still a sort of uncanniness to it that makes it somewhat easy to discern from real human works (at least after it is revealed that something was produced by AI). But a common refrain is that AI right now is worse than it ever will be in the future. In other words, if you think that the things an AI can do are unimpressive, just keep in mind that they will only ever get better. And, as we’ve seen accelerating improvements, the alignment problem looms larger and larger over the field: the time to get the alignment problem right is now.

For some potential uses of AI in the very near future, I highly recommend the following video:

Stay tuned for part 2 of this review, which will cover reinforcement learning.