The Alignment Problem: Machine Learning and Human Values by Briant Christian, W. W. Norton & Company (October 6, 2020), 496 pages

See Part 1 of the review here.

See Part 2 of the review here.

Introduction

Possibly one of the biggest issues in philosophical methodology is that of definitions. In particular, definitions that attempt to establish necessity and sufficiency. What is meant by necessity is the set of conditions that must be satisfied in order for some thing (physical object, concept) to be an instance of some more general concept. What is meant by sufficiency is the set of conditions that, if they do happen to be satisfied, then the thing (physical object, concept) is automatically an instance of that more general thing. As an example: what are the necessary and sufficient conditions for an object to be a mammal? Some necessary, but not sufficient, conditions would be: being a living thing, being in the animal kingdom, being a vertebrate, and giving live birth. What we mean by this is that for something to be a mammal, then, for instance, it must give live birth, but simply being something that gives live birth does not mean something is automatically a mammal (e.g., sharks give live birth). A candidate for a sufficient condition might be that of warm-bloodedness – anything that warm blooded is automatically a mammal, but something can be a mammal without being warm blooded.

Now, many of you may have already thought of objections. Not all mammals give live birth (e.g., the platypus and the echidna), and things can be warm blooded without being a mammal (e.g., birds). And so you might think: why not change it around? Make live birth a sufficient, but not necessary condition and make warm-bloodedness a necessary, but not sufficient condition? Well, with live birth we now run into the issue already parenthetically addressed: sharks give live birth, but they are not mammals. With warm-bloodedness we have the counterexample of the naked mole-rat, which does not regulate its own body temperature and so is a cold blooded mammal.

Thus is the issue of necessity and sufficiency: in many cases (indeed, arguably the vast majority of cases) there are always counterexamples. It becomes difficult to define anything using necessary and sufficient conditions, and so our categories and concepts become fuzzy, malleable, and riddled with qualifications and exceptions. This has been pounced on by postmodernists and queer theorists, who argue that our categories (e.g., such as male and female) should be considered anywhere along the spectrum of tentative at best, useless, and actively harmful at worst. Various fixes and alternatives to necessity and sufficiency, such as prototypes and homeostatic property clusters, have been put forward, but these often run into their own issues.

These issues are inconvenient at best when it comes to defining and categorizing physical things in the world. Where it gets even harrier is when attempting to do this with more abstract or nebulous things, like cultural norms or morality. For instance: how does one define a concept like “good” such that it is universally applicable (or even applicable in the majority of cases)? Or concepts like “justice” or “wrong” or “murder”? For instance, with murder: is it murder when a soldier kills an enemy soldier in war? What about if the war is considered unjust, however we might define such a war? What if a soldier accidentally kills a noncombatant, i.e., collateral damage? What if a soldier accidentally kills a comrade through friendly fire? What about a civilian killing another civilian in self-defense? Or killing another civilian when believing it is self-defense, but the victim was not actually attempting to cause harm?

The point here being that morality is difficult to pin down in terms of necessary and sufficient conditions. Yet, in computer code, we are always working in the realm of necessity and sufficiency. Trying to code, for instance, when it is permitted to kill a human, we would need thousands of lines of code dedicated to if-else statements at varying levels of nesting, and it would still almost certainly be far from perfect, much less universally agreed upon. Indeed, the trolley problem type scenarios with self-driving cars is already at hand: when should a car veer to avoid hitting a pedestrian (or group of pedestrians), especially when it will kill the passenger(s)? And what about what to do in trade-offs between injuries of various severity, or multiple injuries vs. a death (e.g., how many broken toes is a life worth)?

Coding all of these things is both prodigiously impracticable and doomed to fail. And so the question is: how can we get our AI to align to our human values and morals?

Part 3: Normativity

Well, what better way to get artificial intelligence to adopt human values than simply to have the AI imitate humans? If we cannot give an exhaustive list of everything humans value, and then put it into machine language, why not just show the AI the kinds of things humans value by allowing them to observe humans and then have the AI infer from those observations what it is the AI ought to value?

Chapter 7: Imitation

It turns out that apes are not so great at imitation as the hype would have us believe (apes are not actually so good at “aping” each other, with the exception of chimpanzees). The one glaring exception are humans – we are masters of aping one another. In fact, we’re so good at it that we often engage in overimitation , which even chimps do not do. Christian says (discussing this study):

One of the most revealing, and intriguing, studies [of overimitation] involved plastic boxes with two locked openings: one on top and one on the front. The experimenter demonstrated first unlocking the top opening, then unlocking the front, then reaching into the front to get a bag of food. When chimpanzees saw this demonstration using an opaque black box, they faithfully did both actions in the same order. But when experimenters used a clear box, the chimpanzee could observe that the top opening had nothing whatsoever to do with the food. In this case, the chimpanzee would then go straight to the front opening, ignoring the top one altogether. The three-year-old children, in contrast, reproduced the unnecessary first step even when they could see that it did nothing.

It was theorized that perhaps humans, in this instance, are simply slower to develop the relevant skill. Researchers shifted from studying three-year-olds to studying five-year-olds. The overimitation behavior was even worse! The children were more prone to overimitate than the younger children. [all italics in original]

He goes on to say that, even when the children were told not to do the unnecessary step, they continued to overimitate anyway. Why might this be the case? Christian says it has to do with humans possessing a theory of mind (discussing this study):

Suddenly it began to make sense that such behavior increased from age one to three, and again from three to five. As children grow in their cognitive sophistication, they become better able to model the minds of others. Sure, they can see – in the case of the transparent cube – that the adult is opening a latch that has no effect. But they realize that the adult can see that too! If the adult can see that what they’re doing has no apparent effect, but they still do it anyway, there must be a reason. Therefore, even if we can’t figure out what the reason is, we’d better do that “silly” thing too.

…

More recent work has established just how subtle these effects can be. Children are, from a very young age, acutely sensitive to whether the grown-up demonstrating something is deliberately teaching them, or just experimenting. When adults present themselves as an expert – “I’m going to show you how it works” – children faithfully reproduce even seemingly “unnecessary” steps that the adult took. But when the adult presents themselves as unfamiliar with the toy – “I haven’t played with it yet” – the child will imitate only the effective actions and will ignore the “silly” ones. Again it appears that seeming overimitation, rather than being irrational or lazy or cognitively simple, is in fact a sophisticated judgement about the mind of the teacher. [all italics in original]

The question then is: why is imitation a good way to learn? Christian points out four advantages of imitation: it’s efficient (acquiring knowledge from someone else’s effort), it demonstrates that some action is actually possible (the person being imitated demonstrates this), it’s safer (one can benefit from others’ mistakes), and it is able to demonstrate things that are often difficult to describe in words.

Learning by imitation can also be used in machine learning. In training self–driving vehicles, the trainer just has to drive around and have the computer learn by watching what the trainer does in different situations (e.g., how to make turns, stop for other vehicles and pedestrians, when to slow down and speed up, etc.).

For more on imitation learning, see the following:

“Playing hard exploration games by watching YouTube” by Aytar et al, where the researchers trained an agent to play Montezuma’s Revenge (a famously difficult game for AI due to having to link together multiple actions with no reward for each sub-action) by having it watch people play the game on YouTube. This was done even before the intrinsic motivation revolution in AI occurred (see part 2 of this review for more on that).

“ALVINN: An Autonomous Land Vehicle in a Neural Network” by Dean A. Pomerleau and “Knowledge-based Training of Artificial Neural Networks for Autonomous Robot Driving” by Dean A. Pomerleau discuss very early (late 1980’s and early 1990’s) versions of self-driving vehicles that were trained using a neural network observing how human drivers would drive (e.g., “if the road looks like this, then drive like that”).

“Exploration from Demonstration for Interactive Reinforcement Learning” by Kaushik Subramanian, Charles L. Isbell Jr., and Andrea L. Thomaz discusses using what they call Exploration from Demonstration (EfD) that uses human demonstrations to guide search space exploration, which is important since RL (reinforcement learning) can become extremely computationally complex as the size of the problem being worked on scales upward. The paper “Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards” by Vecerik et al does something very similar.

There are, however, three issues with imitation: 1) learning to recover from mistakes; 2) possibilism vs. actualism; 3) being unable to surpass the thing being imitated, i.e., the limitations of the trainer constrains the trainee.

Recovery

If all you do is show the computer the correct way to do things, how will it know how to correct itself if it makes a small mistake? Then when it fails to recover from that mistake, it will make an even bigger mistake (since it doesn’t know what to do), leading to even bigger mistakes. This is known as an error cascade.

How can this be addressed? One way, when training a self-driving vehicle, might be to swerve about in order to show the program how to recover. But then the program will also just learn to swerve (one would need to erase the initial swerving from the program’s memory). Furthermore, it would require demonstrating corrections from innumerable numbers of ways that things can go wrong – no two swerves are exactly alike. What can be done about this is to take the data (the recordings of the trainer driving) and slightly tilt or skew the image to one side or the other and then train the vehicle to nudge back toward the center.

Another approach is to have the program acquire feedback from the trainer. See, for instance, “Efficient Reductions for Imitation Learning” by Stephane Ross and J. Andrew Bagnell and “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning” by Stephane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. In the second paper, the authors trained a program on a video game called SuperTuxKart (a knockoff of Super Mario Kart) and used two approaches: watch the agent steer around the track while the trainer holds a joystick and uses it as they would if they were playing, with this feedback going to the agent; or do the same thing and have control of the steering randomly change back-and-forth between the agent and the trainer. The authors found that this approach (called Dataset Aggregation or DAgger) worked better than the standard supervised learning (i.e., normal imitation) and an approach called Stochastic Mixing Iterative Learning or SMILe (similar to something called search-based structured prediction):

Possiblism vs. Actualism

I’ve discussed possiblism vs. actualism in my post on effective altruism, where I explained it like this (using the professor procrastinate thought experiment of Holly Smith (Goldman)):

[Possiblism vs. Actualism] is the debate in ethics that asks whether a person ought to do the best possible thing in every situation, regardless of what they might actually do in the future, or if they ought to make decisions based on what they are likely to actually do in the future. The popular thought experiment is called Professor Procrastinate. Professor Procrastinate is the foremost expert in their field, but they have a proclivity to procrastinate, often to the point of failing to get things done. A student asks Professor Procrastinate to look over their thesis, which is on the very subject for which Professor Procrastinate is the foremost expert, and is due in a week. What should Professor Procrastinate tell the student? The possibilist camp says that Professor Procrastinate should say yes to the student: the best possible action now is to say yes and the best possible action later is to look over the thesis, but the latter depends on the former. The actualist camp says that Professor Procrastinate should say no, since if they tell the student yes and then procrastinate the time away, not looking over the thesis, the prevents the student from going to someone else who, even though less qualified, could still give useful feedback. In other words, because Professor Procrastinate knows that there is a very high likelihood that their future behavior will be procrastination, they ought to make their decision now in light of this knowledge.

“Effective Altruism, Consequentialism, and Longtermism” by Thomas Harper

In machine learning, these two approaches are known as on-policy (actualism – the agent learns the value of an action based on the rewards the agent actually obtains after taking the action) and off-policy (possibilism – the agent learns the value of an action based on the best possible series of actions it can follow; see “Learning From Delayed Rewards” by Watkins and “Q-Learning” by Watkins and Dayan). What it boils down to is whether the agent is able to act perfectly or not. If the agent can act perfectly, then the possibilist approach is the obvious one. But in the real world, an agent may not be able to perform its policy perfectly, just because things in the real world might influence what the agent can do.

An example would be: say you have a self-driving car that can either make a short but dangerous drive near a cliff, or a much longer but safer drive going around the cliff. Which should the vehicle take? The best possible path (if we are using time and fuel efficiency as the criteria for “best”) is the dangerous one, and so if the AI takes a possiblist approach, then it will take the dangerous path (even if the AI is prone to mistakes, or the vehicle is not fit to take such a rugged path). If the AI takes an actualist approach, it will take the long way around since it determines that what is likely to actual happen is that it will go tumbling off the cliff.

How this relates to imitation is that imitation is possiblist, since attempting to imitate someone who is the best at something might be biting off more than a novice (or an AI) can chew. This is where an idea in economics called the theory of second best can be followed. The theory says essentially that the various assumptions that go into an economic model may make it completely unsuitable to the real world. Thus, the second best model, which does not make so many simplifying assumptions, may be vastly different from the best model. Similarly in AI, the second best policy (the policy an AI agent can actually take) might be very different from the optimal (possibilist) policy.

Limitations of Trainer and Trainee

Until fairly recently, AI that played Checkers, Chess, and Go all used imitation – they trained on data from human games and so played only as good as the best humans. But there is a way to get around this: it’s called amplification. What happens is that the AI trains by playing itself over and over again. It does this using fast and slow thinking: the slow is thinking multiple moves ahead in a sort of “if x, then y” program called a Monte Carlo Tree Search. The fast thinking is in finding which of the “if x, then y” trees to even consider going down. The AI acquires this “intuition” about which next moves to even consider by playing itself.

–

When it comes to the alignment problem and ethics, imitation can be a powerful tool. As Nick Bostrom notes:

The reason for thinking that some form of indirect normativity might be useful is that it seems completely impossible to write down a list of everything we care about, and all the possible trade-offs and precise definitions of each thing on that list, and to get that right on the first attempt, such that we would be happy with some powerful optimization process transforming the world and maximalizing this vision.

“Letter from Utopia: Talking to Nick Bostrom” by Andy Fitch

Where indirect normativity is the more general form of imitation learning, essentially saying that instead of trying to write down everything humans care about in precise, minute detail, we should take some indirect approach (of which imitation is one such approach).

Probably one of the biggest issues with imitation, however, is this: are humans even moral enough to imitate? In other words, if we want our AI to be (optimally) beneficial to humankind – to make things better for more people – then are humans what we actually want our AI to imitate? As Christian states it:

There are two primary challenges here. The first is that the things we want are very difficult to simply state outright. … In this case, we have already seen how learning by imitation can succeed in domains where it is effectively impossible to explicitly impart every rule and consideration and degree of emphasis for what makes someone an expert driver or an expert Go player. Simply saying, in effect, “Watch and learn” is often impressively successful. It may well be the case that as autonomous systems become more powerful and more general – to the point that we seek to impart some sense of what it means not just to drive and play well but to live well, as individuals and societies – we can still turn to something not unlike this.

The second, deeper challenge is that both traditional reward-based reinforcement learning and imitation-learning techniques require humans to act as sources of ultimate authority. Imitation-learning systems, as we’ve seen, can surpass their teachers – but only if the teachers’ imperfect demonstrations are imperfect in ways that largely cancel out, or only if experts who cannot demonstrate what they want can at least recognize it.

…

“Some … worry that humans aren’t a particularly good source of moral authority. We’ve talked a lot about the problem of infusing human values into machines,” says Google’s Blaise Agüera y Arcas. “I actually don’t think that that’s the main problem. I think that the problem is that human values as they stand don’t cut it. They’re not good enough.”

This may require what Eliezer Yudkowsky calls Coherent Extrapolated Volition: “our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were.” In other words, we shouldn’t teach machines human morality as it is actually practiced by humans, instead we should teach them morality that is better than human morality, or a morality that is what humans strive for rather than what humans actually achieve.

There are, of course, some issues here. The first is one I’ve touched on elsewhere on this blog before, and that is the issue of moral realism: if morality does not have some kind of objective/ontological existence, then do we have a good metric by which to aim and gauge the morality of our AI? And even if moral realism does turn out to be true, do humans even have access to such knowledge? Can humans gain access to knowledge of moral truth?

I am a moral anti-realist, but for the sake of argument I will grant that moral realism is true. If we do not, or can not, have access to knowledge of objective moral truth, then whether our AI learns by imitation or coherent extrapolated volition, we may be teaching our AI to engage in incorrect moral actions (i.e., immoral actions). Just think: if people in the late 1800’s and early 1900’s had been the ones to invent and teach superintelligent AI, they would have instilled in the AI things like eugenics, racial hierarchies, the inferiority of women, and other monstrous values. What sorts of values do humans have now (or will have even in 100 years, 500 years, 1000 years etc.) that people further in the future will find monstrous? Could we really trust ourselves (or anyone at any time) to even come up with some best, idealized form of morality that we could teach our AI?

A second issue is this: once an AI is more intelligent than any human or group of humans, how could humans hope to assess the morality of the agent’s decisions? If, because we are not intelligent enough, we do not understand why the AI is making a certain decision, could we really judge whether the AI is acting in the interests of some idealized morality? For example: what if a superintelligent AI wants to change how we do clinical trials in order to be more moral (to align better with our idealized morality), but we humans cannot understand (even after long deliberation) why such a change makes how clinical trials are conducted more moral. Put simply: if our AI is superintelligent and super moral, it may do things that appear immoral, or even just puzzling (but not necessarily moral or immoral) to beings of such limited intelligence (and morality) as humans. This poses the problems of A) would we humans actually even be able to know if the superintelligent AI really, truly is acting in some super-moral fashion? And B) should humans just go along with what the superintelligent and (presumably) super-moral AI says we ought to do under the assumption that, even if we’re too limited to understand, the AI is superintelligent and super-moral enough that its advice (or commands) ought to be heeded?

To address some of these issues, Christian says that we might be able to use what is called iterated distillation and amplification: see “Supervising strong learners by amplifying weak experts” by Paul Christiano, Buck Shlegeris, and Dario Amodei. This is to have a sort of team of AI working under a human expert. Each member of the team would work on some problem, present it to the human expert, who would then amplify what is good about it and feed it back into the agent. Christian uses the following example of building a subway system:

We could train a machine-learning system up to a certain level of competence – by normal imitation learning, say – and then, from that point forward, we could use it to help evaluate plans, not unlike a senior urban planner with a handful of more junior urban planners. We might ask one copy of our system to give us an assessment of expected wait times. We might ask another to give us an estimated budget. A third we might ask for a report about accessibility. We, as the “boss,” would make the final determination – “amplifying” the work of our machine subordinates. Those subordinates, in turn, would “distill” whatever lessons they could from our final decision and become slightly better urban planners as a result: faster-working in sum than we ourselves, but modeled in our own image. We then iterate, by delegating the next project to this new, slightly improved version of our team, and the virtuous circle continues.

Eventually, believes Christiano, we would find that our team, in sum, was the urban planner we wish we could be – the planner we could if we “knew more, thought faster, were more the planner we wished we were.”

This does not really address the core issue, though, which is that a human, with their limited intelligence and imperfect morality, must still be making decisions about what the AI learns. As such, this is still an open question.

Chapter 8: Inference

Being able to imitate is one thing, but what about inferring the (possibly ambiguous) goals or thoughts of another person? Humans are good at this, even from a young age. As Christian puts it (discussing this study):

…human infants as young as eighteen months old will reliably identify a fellow human facing a problem, will identify the human’s goal, the obstacle in the way, and will spontaneously help if they can – even if their help is not requested, even if the adult doesn’t so much as make eye contact with them, and even when they expect (and receive) no reward for doing so.

This is a remarkably sophisticated capacity, and almost uniquely human. Our nearest genetic ancestors – chimpanzees – will spontaneously offer help on occasion – but only if their attention has been called to the situation at hand, only if someone is obviously reaching toward an object that is beyond their grasp (and not in more complex situations, like with a cabinet), only if the one in need was a human rather than a fellow chimpanzee (they are remarkably competitive with one another), only if the desired object is not food, and only after lingering in possession of the sought-after object for a few seconds, as if deciding whether or not to actually hand it over. [italics in original]

We’ve already encountered Reinforcement Learning (RL): “given a reward signal, what behavior will optimize it?” But this process of inference is known as Inverse Reinforcement Learning (IRL): “given the observed behavior, what reward signal, if any, is being optimized?” In other words, without some objective function given explicitly to the agent, can the agent discern objective functions on its own given the environment? Or put another way: if we know the answer, then what is the question?

Christian points out that, in mathematical terms, this is known as an ill-posed problem: a problem that doesn’t have just one right answer (functionally infinite answers to what a behavior “means” i.e., what the behavior is “for” or is attempting to accomplish). Reaching an arm out can mean a multitude of different things: grabbing for something, stretching one’s arm, going to shake hands, etc. In other words, there are many behaviors with indistinguishable reward functions.

Indeed, humans are so good at discerning goals and intentionality in things we suffer what is sometimes known as agency detection (or intentionality detection) that can lead to conspiratorial thinking (something I’ve written about on this blog before). This has made us masterful in theory of mind, but it is not so straightforward how such a skill might be coded, much less how such a skill could be coded in a way that does not lead to the same glitches hyperactive agency detection causes in humans (e.g., religious thinking, conspiratorial thinking). This is where inverse reinforcement learning (IRL) comes in.

More formally, IRL is this: recall from the previous post that the Markov Decision Process is a 5-tuple

M = (S, A, T, γ, R)

where

S = { s₁, …, s_n | n < ∞ }

are the possible states that the program can be in and

A = { a₁, …, a_k | k ≥ 2 }

are the 2+ actions that can be taken and

T = { P_sa(∙) | s ∈ S, a ∈ A }

are the state transition probabilities, with P_sa(s’) the probability of transitioning from state s using action a to state s’. Then γ ∈ [0, 1) is the discount factor, which is set <1 to make immediate rewards larger than future rewards, and R is the reward function

R(s, a, s’) : S×A×S→ℝ

such that a reward is given for taking action a while in state s to achieve state s’. A policy, then, is a function

π : S → A

over the states S. This gives a value function at each state s given by

V^π(s) = E[r₁ + γr₂ + γ²r₃ + …; π, s]

where r_i are the rewards for the i^th step of policy π from state s. The optimal policy is then π*_M which gives

V*(s) = sup_π V^π(s)

In inverse reinforcement learning (IRL) we instead want to find R(s, a, s’). For this we use:

Bellman Equations: let an MDP M = (S, A {P_sa}, γ, R) and a policy π : S → A be given. Then for all s ∈ S, a ∈ A, V^π, and Q^π satisfy

V^π(s) = R(s) + γΣ_s’P_sπ(s)(s’)V^π(s’)

Q^π(s,a) = R(s) + γΣ_s’P_sa(s’)V^π(s’)

Bellman Optimality: let an MDP M = (S, A {P_sa}, γ, R) and a policy π : S → A be given. Then π is an optimal policy for M if and only if, for all s ∈ S

π(s) ∈ arg max_a_∈_AQ^π(s,a)

And thus, if we let a finite state space S, a set of actions A = {a₁, …, a_k}, transition probability matrices {P_a}, and a discount factor γ ∈ (0, 1) be given, then the policy π given by π(s) ≡ a₁ is optimal if and only if, for all a = a₂, …, a_k, the reward (vector) R satisfies

(P_a1 – P_a)(I – γP_a1)^-1R ≥ 0

The trick then is to find R. Some issues arise, such as the fact that R = 0 is always a solution, and that there may be multiple R that satisfy (P_a1 – P_a)(I – γP_a1)^-1R ≥ 0, but this paper gives some ways of getting around these issues.

Something I find interesting about this, though (and this is a bit of a tangent), is that it suggests that these reward functions are a sort of semantic epiphenomenon. If multiple reward functions could conceivably bring about the same behavior (at least the same in all behaviors relevant toward some task), then the semantic content of the reward function is at least somewhat arbitrary. This is similar to the semantic epiphenomenalism that Alvin Plantinga alleges follow from a naturalistic view of human evolution:

Perhaps Paul very much likes the idea of being eaten, but when he sees a tiger, always runs off looking for a better prospect, because he thinks it unlikely the tiger he sees will eat him. This will get his body parts in the right place so far as survival is concerned, without involving much by way of true belief. … Or perhaps he thinks the tiger is a large, friendly, cuddly pussycat and wants to pet it; but he also believes that the best way to pet it is to run away from it. … Clearly there are any number of belief-cum-desire systems that equally fit a given bit of behavior.

Alvin Plantinga, Where the Conflict Really Lies

I talk a lot more about this in my post on the scientific arguments against the existence of God, but the idea is that having the correct semantic content in a belief is not necessary for one to engage in proper actions, where in evolution proper actions means “surviving and reproducing” and in inverse reinforcement learning might be turning a self-driving car when it is the correct time to do it. For animals, as Plantinga discusses, a multitude of different semantic contents might lead to the same behavior; for computer programs, multiple reward functions can lead to the same behavior.

But, back to the book: two papers, “Algorithms for Inverse Reinforcement Learning” by Stuart Russell and Andrew Ng, and “Apprenticeship Learning via Inverse Reinforcement Learning” by Pieter Abbeel and Andrew Ng attempted to use inverse reinforcement learning to teach an agent how to acquire some goal. In the first paper they simply tried to get the agent to infer the goal of a 5×5 space by showing the agent actions to take, but without telling the agent what the goal was. This study used strong assumptions such as that the trainer was not able to make a mistake. These kinds of assumptions were relaxed in the second paper, which used a car driving simulation. In this study, the agent learned some important human values fairly quickly, such as not hitting other vehicles and not going off the road (for videos, see this). Christian says of these results:

This was significantly different from the strict imitation approach discussed in Chapter 7. After just one minute of demonstrated driving by Abeel, a model trying to mimic his behavior directly had nowhere near enough information to go on – the road environment was too complex. Abeel’s behavior was complicated, but his goals were simple; within a matter of seconds, the IRL system picked up on the paramount importance of not hitting other cars, followed by not driving off the road, followed by keeping right if possible. This goal structure was much simpler than driving behavior itself, and easier to learn, and more flexible to apply in novel situations. Rather than directly adopting his actions, the IRL agent was learning to adopt his values. [italics in original]

Abbeel, Ng, and company, as Christian points out, wanted to test their inverse reinforcement learning against something even more difficult than driving a car. To do this, they looked at remote control helicopter flying, which requires a great deal of rapid fine adjustments. You can see their papers “An Application of Reinforcement Learning to Aerobatic Helicopter Flight” by Pieter Abbeel, Adam Coates, Morgan Quigley, and Andrew Ng; “Learning for Control from Multiple Demonstrations” by Adam Coates, Pieter Abbeel, and Andrew Ng; “apprenticeship learning and reinforcement learning with application to robotic control” by Pieter Abbeel; and “Autonomous Helicopter Aerobatics through Apprenticeship Learning” by Pieter Abbeel, Adam Coates, and Andrew Ng.

Simple tasks with the helicopter, such as hovering in place or moving in a straight line, were simple enough that they could be taught to the program by simple reinforcement learning. But doing something much more complex, where an objective function for the torque, pitch, and angle of the rotor could not be specified for every instant of the maneuver, posed a much bigger challenge. Inverse reinforcement learning, however, could have the program watch an expert and infer the goals the human was attempting to achieve. Using inverse reinforcement learning, the program was able to work its way up to what is considered the toughest maneuver in radio controlled helicopters: Chaos. You can see this in action in the following video:

The maneuver, in fact, is so difficult that the program was never shown a perfect demonstration, but from watching multiple imperfect demonstrations it was able to infer what the radio-control pilot was attempting to do and then do it flawlessly (as seen in the above video).

Another approach to inverse reinforcement learning is called maximum entropy inverse reinforcement learning. This has the program assume that the demonstrator is more likely to take an action if it confers some reward. This is described in “Maximum Entropy Inverse Reinforcement Learning” by Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey:

The key notion, intuitively, is that agents act to optimize an unknown reward function (assumed to be linear in the features) and that we must find reward weights that make their demonstrated behavior appear (near)-optimal. The imitation learning problem then is reduced to recovering a reward function that induces the demonstrated behavior with the search algorithm serving to “stitch-together” long, coherent sequences of decisions that optimize that reward function.

We take a thoroughly probabilistic approach to reasoning about uncertainty in imitation learning. Under the constraint of matching the reward value of demonstrated behavior, we employ the principle of maximum entropy to resolve the ambiguity in choosing a distribution over decisions.

…

Similar to distributions of policies, many different distributions of paths match feature counts when any demonstrated behavior is sub-optimal. Any one distribution from among this set may exhibit a preference for some of the paths over others that is not implied by the path features. We employ the principle of maximum entropy, which resolves this ambiguity by choosing the distribution that does not exhibit any additional preferences beyond matching feature expectations (Equation 1).

The resulting distribution over paths for deterministic MDPs is parameterized by reward weights θ (Equation 2).

Under this model, plans with equivalent rewards have equal probabilities, and plans with higher rewards are exponentially more preferred.

Given parameter weights, the partition function, Z(θ), always converges for finite horizon problems and infinite horizons problems with discounted reward weights. For infinite horizon problems with zero-reward absorbing states, the partition function can fail to converge even when the rewards of all states are negative. However, given demonstrated trajectories that are absorbed in a finite number of steps, the reward weights maximizing entropy must be convergent.

“Maximum Entropy Inverse Reinforcement Learning” by Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey

You can read more about how this approach is used in “Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization” by Chelsea Finn, Sergey Levine, and Pieter Abbeel; and “Maximum Entropy Deep Inverse Reinforcement Learning” by Markus Wulfmeier, Peter Ondruska, and Ingmar Posner.

Christian then asks the important question: what if we don’t have experts around who can demonstrate to the program what it is we want done? A person does not have to be an expert in order to be a critic. Even if someone like me could not adequately control the helicopter (much less do impressive stunts with it), I can at least still recognize when someone else is doing a good job with it (at least to some degree; obviously experts can be better critics, but even I could see when someone is doing a terrible job at it, or when someone is doing an amazing job at it). And so the question is, can programs learn by first doing and then acquiring feedback (e.g., from non-experts)? Is it possible, and is it safe?

These questions were examined in “Deep Reinforcement Learning from Human Preferences” by Paul F Christiano, Jan Leike, Tom B Brown, Miljan Martic, Shane Legg, and Dario Amodei (you can read more on this on the DeepMind Blog Post: “Learning Through Human Feedback” and the OpenAI Blog Post: “Learning From Human Preferences“). Christian sums the paper up like this:

The idea was that their system would behave within some virtual environment while periodically sending random video clips of its behavior to a human. The human was simply instructed to, as their on-screen instructions put it, “Look at the clips and select the one in which better things happen.” The system would then attempt to refine its inference about the reward function based on the human’s feedback, and then use this inferred reward (as in typical reinforcement learning) to find behaviors that performed well by its lights. It would continue to improve itself toward its new best guess of the real reward, then send a new pair of video clips for review.

The virtual environments were the Arcade Learning Environment and the physics simulator MuJoCo. In the former there is a quantitative reward function that lends itself to conventional reinforcement learning, namely the score in the game. This could be used to test the training using human preferences against conventional reinforcement learning. The authors of the paper found that this method could get the program to do an adequate job at playing the game, but was unable to get it to the superhuman levels of playing that conventional reinforcement learning can. In MuJoCo, which does not have a score the way the Atari games do, the authors of the paper had people judge how well the computer could do backflips in the simulated environment. After comparing clips of the backflip for an hour, they found, as Christian puts it:

What happened – after a few hundred clips were compared, over the course of about an hour – was it started doing beautiful, perfect backflips: tucking as a gymnast would, and sticking the landing.

The experiment replicated with other people [besides paper author Paul Christiano] providing the feedback, and the backflips were always slightly different – as if each person was providing their own aesthetic, their own version of the Platonic backflip.

These inverse reinforcement learning approaches, Christian notes, make two assumptions: 1) the human trainer and the AI apprentice are divided into two separate parts (the human does their thing, and then the program does its thing); and 2) the program must take the human’s reward function as its own. And so, what if instead we had the human and program working together in a way in which the program is not taking the human’s objective as its own? In other words, can a program not just take on the inferred human objective as its own, but in fact infer that the objective is someone else’s? Instead of a robot learning that humans reaching for a banana want to eat a banana and therefore take on the objective of eating bananas itself, but instead learn that humans like bananas and so therefore will get a banana for a human if it sees a human reaching for a banana.

This is known as Cooperative Inverse Reinforcement Learning (CIRL). Christian says of this: “In the CIRL formulation, the human and computer work together to jointly maximize a single reward function – and initially only the human knows what it is.” See also: “Toward Seamless Human–Robot Handovers” by Kyle Strabala et al; Human Compatible by Stuart Russell; “Pragmatic-Pedagogic Value Alignment” by Jaime F. Fisac et al; and “Algorithmic and Human Teaching of Sequential Decision Tasks” by Maya Cakmak and Manuel Lopes

The problem with traditional machine learning – from reinforcement learning to inverse reinforcement learning – is that we are trying to get the program to pursue its own objective function. This means that we must try to get the program to somehow take on objective functions that we want them to. But, Christian says, if we instead have human-program cooperation, then this “…leads us to think about human and machine behavior alike in a different way.” And this means:

Several fronts open up once a cooperative framing is introduced. Traditional machine learning and robotics researchers are now more keenly interested than ever in borrowing ideas from parenting, from developmental psychology, from education, and from human-computer interaction and interface design. Suddenly entire other disciplines of knowledge become not just relevant but crucial.

We therefore will want to behave in ways that the AI could understand in what is known as “legible motion” – when we are, for instance, reaching for something, we do not always take the most ergonomic or efficient path when outstretching our arm, but instead take a more roundabout path in order to indicate to others that we are, in fact, reaching for something.

Cooperation helps with issues that arise from front-loading the training. When the training is completed before the program performs a trial, we often find that the program has learned to exploit some loophole or has taken on an objective function that doesn’t quite fit. An example Christian gives is that, when training a program to play the Atari game Pong, it can sometimes learn to defend its side, but not to try scoring, resulting in extended rallies where neither side scores a point. When the human-program interactions are interwoven, however, these kinds of issues go away (as seen in Christiano et al, 2017).

An approach to human-robot interfaces, modeled on human-human cooperation, is known as cross-training. This is where members of a team temporarily switch roles in order to understand how their real position can help facilitate the work of their teammates.

In the paper “Improved human–robot team performance through cross-training, an approach inspired by human team training practices” by Stefanos Nikolaidis, Przemyslaw Lasota, Ramya Ramakrishnan, and Julie Shah the researchers attempted to test the cross-training approach using a human-machine team attempting to place and drill screws (in a virtual environment). One of the team would place the screw while the other would drill it in. To do this, the researchers used the usual Markov Decision Process (MDP) approach, with the transition probability T(a, s, s’) = { P_sa(∙) | s ∈ S, a ∈ A } having its uncertainty in the human behavior. The authors put it this way:

Robot mental model of its own role: the optimal policy π∗, which represents the assignment of robot actions at every state toward task completion. The computation of the optimal policy π∗ that captures the robot role takes into account the current estimate of the human behavior, as represented in T.

Robot mental model of the human: the robot’s knowledge about the actions of its human co-worker, as represented by the transition probabilities T. The transition matrix represents the probability of a human action, given a state s and a robot action a, and therefore enables the robot to generate predictions about human actions, and, subsequently, future states.

Human mental model of his or her own role: the humans’ preference for their own actions.

Human mental model of the robot: the human’s expectation regarding the robot action while in a given state.

“Improved human–robot team performance through cross-training, an approach inspired by human team training practices” by Stefanos Nikolaidis, Przemyslaw Lasota, Ramya Ramakrishnan, and Julie Shah

The researchers then split participants into one of two groups: one where there was cross-training whereby the human and machine took turns placing the screw or drilling the screw, and the other where the human only ever placed screws and the machine only ever drilled, the machine then gaining feedback from the human after the task was complete (in other words, standard reinforcement learning). The researchers found:

Based on this encoding, we formulated human–robot cross-training and evaluated it in a human subject experiment of 36 subjects. We found that cross-training improved quantitative measures of robot mental model convergence ( p = 0.04) and human–robot mental model similarity ( p < 0.01), while post hoc experimental analysis indicated that the proposed metric of mental model convergence could be used for dynamic human error detection. A post-experimental survey indicated statistically significant differences between groups in perceived robot performance and trust in the robot ( p < 0.01). Finally, we observed a significant improvement to team fluency metrics, including an increase of 71% in concurrent motion ( p = 0.02) and a decrease of 41% in human idle time ( p = 0.04), during the human–robot task execution phase in the cross-training group. These results provide the first evidence that human–robot teamwork is improved when a human and robot train together by switching roles in a manner similar to that used in effective training practices for human teams.

“Improved human–robot team performance through cross-training, an approach inspired by human team training practices” by Stefanos Nikolaidis, Przemyslaw Lasota, Ramya Ramakrishnan, and Julie Shah

You can hear more about this experiment in the following video by one of the researchers on the paper:

In situations in which cross-training is not feasible, Shah’s lab has also been using something called perturbation training, which you can read more about in “Perturbation Training for Human-Robot Teams” by Ramya Ramakrishnan, Chongjie Zhang, and Julie Shah.

While all these approaches may be encouraging, Christian does end the chapter with some words of caution:

These computational helpers of the near future, whether they appear in digital or robotic form – likely both – will almost without exception have conflicts of interest, the servants of two masters: their ostensible owner, and whatever organization created them. In this sense they will be like butlers who are paid on commission; they will never help us without at least implicitly wanting something in return. They will make astute inferences we don’t necessarily want them to make. And we will come to realize that we are now – already, in the present – almost never acting alone. [italics in original]

The latter caution – about the machines making astute inferences we do not want them to make – has to do with the machines learning to infer our desires from our actions; the question, however, is what about our meta-desires? For instance, an alcoholic may want alcohol, but they may have the meta-desire to quit drinking. As such, an AI that infers from their behavior that they want alcohol may continue advertising or administering alcohol, even though the person may want to quit. The paper “Learning the Preferences of Ignorant, Inconsistent Agents” by Owain Evans, Andreas Stuhlmueller, and Noah D. Goodman puts it this way:

[The inverse reinforcement learning] approach usually assumes that the agent makes optimal decisions up to “random noise” in action selection (Kim et al. 2014; Zheng, Liu, and Ni 2014). However, human deviations from optimality are more systematic. They result from persistent false beliefs, sub-optimal planning, and from biases such as time inconsistency and framing effects (Kahneman and Tversky 1979). If such deviations are modeled as unstructured errors, we risk mistaken preference inferences. For instance, if an agent repeatedly fails to choose a preferred option due to a systematic bias, we might conclude that the option is not preferred after all. Consider someone who smokes every day while wishing to quit and viewing their actions as regrettable. In this situation, a model that has good predictive performance might nonetheless fail to identify what this person values.

“Learning the Preferences of Ignorant, Inconsistent Agents” by Owain Evans, Andreas Stuhlmueller, and Noah D. Goodman

The authors of the paper attempt to model false beliefs / imperfect knowledge and temporal inconsistency / future discounting. The latter is illustrated by the example given in the paper:

A prominent formal model of human time inconsistency is the model of hyperbolic discounting (Ainslie 2001). This model holds that the utility or reward of future outcomes is discounted relative to present outcomes according to a hyperbolic curve. For example, the discount for an outcome occurring at delay d from the present might be modeled as a multiplicative factor 1 /(1+d) . The shape of the hyperbola means that the agent takes $100 now over $110 tomorrow, but would prefer to take $110 after 31 days to $100 after 30 days. The inconsistency shows when the 30th day comes around: now, the agent switches to preferring to take the $100 immediately.

“Learning the Preferences of Ignorant, Inconsistent Agents” by Owain Evans, Andreas Stuhlmueller, and Noah D. Goodman

The model proposes an optimal agent that has full knowledge and no discounting (i.e., a perfectly rational and perfectly informed agent):

C(a, s) ∝ e^αEU^s^[a]

With EU = expected utility, and then adds the 1/(1+d) discounting factor and a probabilistic belief p(s) about the current state s to give

EU_s[a] = (1/(1+kd))U(s, a) + 𝔼_{s’,o’,a’} (EU_P(s_|o),o’,d+1[a’])

The researchers then tested their model against agents that are either “naive” or “sophisticated” where

First, a Sophisticated agent has a fully accurate model of its own future decisions. Second, a Naive agent models its future self as assigning the same (discounted) values to options as its present self. The Naive agent fails to accurately model its own time inconsistency.

And

We define a space of possible agents based on the dimensions described above (utility function U, prior p(s), discount parameter k, noise parameter α). We additionally let Y be a variable for the agent’s type, which fixes whether the agent discounts at all, and if so, whether the agent is Naive or Sophisticated. So, an agent is defined by a tuple θ := (p(s), U, Y, k, α), and we perform inference over this space given observed actions. The posterior joint distribution on agents conditioned on action sequence a_0:T is:

P(θ|a_0:T) ∝ P(a_0:T|θ)P(θ)

The likelihood function P(a_0:T|θ) is given by the multi-step generalization of the choice function C(a, s) corresponding to θ. For the prior P(θ), we use independent uniform priors on bounded intervals for each of the components. In the following, “the model” refers to the generative process that involves a prior on agents and a likelihood for choices given an agent.

“Learning the Preferences of Ignorant, Inconsistent Agents” by Owain Evans, Andreas Stuhlmueller, and Noah D. Goodman

The model, then, was tested to see if it could infer from the behavior of an agent what kind of discounting factor and probabilistic beliefs the agent possessed. The model was also tested against human subjects who needed to decide what kind of discounting and knowledge the agent possessed. In experiments with multiple trials, the program was able to deduce preferences as well as human subjects.

For more on this kind of work, see “Learning the Preferences of Bounded Agents” by Owain Evans, Andreas Stuhlmueller, and Noah D. Goodman; “Cognitive Model Priors for Predicting Human Decisions” by David D. Bourgin, Joshua C. Peterson, Daniel Reichman, Stuart J. Russell, and Thomas L. Griffiths; and “Bayesian Nonparametric Methods for Partially-Observable Reinforcement Learning” by Finale Doshi-Velez, David Pfau, Frank Wood, and Nicholas Roy. For more on modeling human preferences using Bayesian frameworks, see “Modeling Human Plan Recognition Using Bayesian Theory of Mind” by Chris L. Baker and Joshua B. Tenenbaum.

Chapter 9: Uncertainty

Ambiguity and uncertainty are the rule, not the exceptions. Humans are always functioning on imperfect, often even incorrect, information. Indeed, if you are ever feeling a sense of imposter syndrome, you can rest assured that everyone else around you are also just winging it (although that may be more anxiety-inducing than believing everyone else is hyper-competent). This uncertainty can be both an obstacle and an advantage when it comes to artificial intelligence. In the former case we can never be certain that we’ve adequately addressed the alignment problem. But the advantage of uncertainty is in having our AI be uncertain of itself.

The nightmare scenario that often comes up in science fiction when it comes to AI is that the machine decides on some goal or course of action and then locks onto it with absolute certainty – that humanity is a threat to itself, for instance. But even at the smaller scale, if we have an AI that encounters situations that are novel, we would not want the AI to overconfidently make the wrong decision. This is observed, for instance, in neural networks, which are trained in a domain of categories in which everything they encounter falls into one of those categories (e.g., a program that was trained on images of dogs to classify dog breeds will classify anything presented to it as a breed of dog, even if, for instance, the AI is shown a picture of a tree, or even just static). As such, neural networks are good at classifying things that do fit the categories, but they also classify things (with a shockingly high degree of certainty) that are not in the category, even random static (i.e., neural networks have no “none of the above” option). As Christian puts it:

One of the causes for the infamous brittleness of modern computer vision systems, as we’ve seen, is the fact that they are typically trained in a world in which everything they’ve ever seen belongs to one of a few categories, when in reality, virtually every possible pixel combination the system could encounter would resemble none of those categories at all. Indeed, traditionally systems are constrained such that their output must take the form of a probability distribution over those finite classes, no matter how alien the input. No wonder their outputs make little sense. Shown a picture of a cheeseburger, or a psychedelic fractal, or a geometric grid and asked, “How confident are you that this is a cat, as opposed to a dog?,” what kind of answer would even make sense?

This is known as the open category problem. Christian puts it this way:

In his Presidential Address to his colleagues at the annual Association for the Advancement of Artificial Intelligence (AAAI) conference, [Thomas] Dietterich discussed the history of the field of AI as having proceeded, in the latter half of the twentieth century, from work on “known knowns” – deduction and planning – to work on “known unknowns” – causality, inference, and probability.

“Well, what about the unknown unknowns?” [Dietterich] said to the auditorium, throwing down a kind of gauntlet. “I think this is now the natural step forward in our field.” [italics in original]

You can read the AAAI address here or watch the video of it below:

Another problem then is this: AI can make incorrect assessments and do it with extremely high confidence. As stated already, it can even mistake static noise for the objects in its repertoire with high confidence, as seen in the paper “Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images” by Anh Nguyen, Jason Yosinski, and Jeff Clune. Additionally, small changes in something with low confidence can cause drastic changes in the prediction and confidence, as seen in “Intriguing properties of neural networks” by Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. These are known as adversarial examples. Christian says:

The other problem, though, apart from the lack of a “none of the above” answer, is that not only do these models have to guess an existing label, they are alarmingly confident in doing so. These two problems go largely hand in hand: the model can say, in effect, “Well, it looks way more like a dog than it does a cat,” and thus output a shockingly high “confidence” score that belies just how far from anything it’s seen before this image really is. [italics in original]

Yarin Gal discusses this in the lecture below (see his slides here, which are not shown often in the video):

You can read more about this in the following papers as well: “Explaining and Harnessing Adversarial Examples” by Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy; “Towards Deep Learning Models Resistant to Adversarial Attacks” by Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu; “Feature Denoising for Improving Adversarial Robustness” by Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan Yuille, and Kaiming He; “Testing Robustness Against Unforeseen Adversaries” by Daniel Kang, Yi Sun, Dan Hendrycks, Tom Brown, and Jacob Steinhardt; “Adversarial Examples are not Bugs, they are Features” by Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry; and “Towards Open Set Recognition” by Walter J. Scheirer, Anderson Rocha, Archana Sapkota, and Terrance E. Boult.

We therefore want our AI to have some level of uncertainty. We want, for instance, to have the AI give different answers for the same image if it is presented multiple times, as a measurement of the program’s uncertainty about what the image shows. To do this, AI researchers propose using what are called Bayesian neural networks.

Bayesian neural networks, instead of using a single weight for each of the connections between nodes, samples from a distribution centered around the weight. For instance, if the weight is centered at 0.70, then sometimes it will pick 0.67 and sometimes 0.72, and so on, with the chance of selecting a weight proportional the probability at that weight. This introduces some uncertainty into the program – it will not give the same prediction every time. And so, running multiple predictions, if it gives different answers, then the program is uncertain, whereas if it gives the same answer each time, then you know that the program is certain about the prediction.

This proved to be impossible to do in any reasonable amount of time until it was discovered that it could be done with ensembles. Christian describes it like this:

It was already understood that you could model Bayesian uncertainty through ensembles – that is to say, by training not one model but many. This bouquet of models will by and large agree – that is, have similar outputs – on the training data and anything quite similar to it, but they’ll be likely to disagree on anything far from the data on which they were trained. This “minority report”-style dissent is a useful clue that something’s up: the ensemble is fractious, the consensus has broken down, proceed with caution. [italics in original]

The idea is that each neural network in the ensemble has a single weight, but they each have a different weight from each other ensemble, their weights being centered around some value. Thus, each neural network in the ensemble represents one point on the distribution. This can be done efficiently using what is known as “dropout” – turning off certain parts of the neural network at random during training, thereby using only a subset of the network at a given time. To see how this method can be used, and its success, read: “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning” by Yarin Gal and Zoubin Ghahramani; “Dropout as a Bayesian Approximation: Insights and Applications” by Yarin Gal and Zoubin Ghahramani; “Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference” by Yarin Gal & Zoubin Ghahramani; and “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” by Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.

Another issue when dealing with uncertainty is that of impact. If making decision X leads to some outcome that is very difficult or impossible to reverse or undo, while making decision Y leads to some outcome that is easier to reverse or undo, the precautionary principle says to choose Y. This makes an intuitive kind of sense. If doing X has an extremely high likelihood (or is certain) of, say, killing or grievously injuring someone, then I ought to avoid choosing X. However, things get more difficult when we bring in uncertainty. What if doing X has only a 20% chance of killing someone? Or a 5% chance? Or a 0.01% chance? And what if, should the action not kill or harm anyone, it could potentially lead to great benefits? Or, even if doing X does injure or kill someone, it still confers great benefits on others?

This is a perennial problem in the philosophy of ethics, especially in consequentialist ethics. For instance, in the principle of long-termism within effective altruism, it would be difficult to predict what sorts of actions one ought to take at any given moment if we are to try considering all of the possible future ramifications of that decision (i.e., the butterfly effect). Some seemingly banal action I take now could be some catalyst or tipping point that decades or centuries from now leads to great suffering. And choosing to do nothing is not opting out of the problem, as inaction is just as much a choice as taking action. Indeed, whether to proceed with AI technology itself is subject to this uncertainty as to its long-term impacts. And if humanity ultimately decides to abandon any prospect of AI, this itself may lead to great suffering in the future (for instance: what if only with AI technology can we reverse climate change or end all war?). But even within the field of AI, insofar as what sorts of decisions an AI ought to make, this question is important. Christian says:

One of the first people to think about these issues in the context of AI safety was Stuart Armstrong, who works at Oxford University’s Future of Humanity Institute. Rather than trying to enumerate all of the things we don’t want an intelligent automated system to do in service of pursuing goals – ranging from not stepping on the cat to not breaking the precious vase to not killing anyone or demolishing any large structures – seems like an exhausting and probably fruitless pursuit. Armstrong had a hunch that it might be viable, than than exhaustively enumerating all of the specific things we care about, to encode a kind of general injunction against actions with any kind of large impact. Armstrong, however … found that it’s surprisingly difficult to make our intuitions explicit.

“The first challenge,” Armstrong writes, “is, of course, to actually define low impact. Any action (or inaction) has repurcussions that percolate through the future light-cone, changing things subtly but irreversibly. It is hard to capture the intuitive human idea of ‘a small change.'”

The quote by Armstrong comes from his paper “Low Impact Artificial Intelligences” by Stuart Armstrong and Benjamin Levinstein. You can also see Stuart Armstrong talk on this subject in the below video:

Armstrong, as discussed in his paper and the above video, proposes what he calls 20 billion questions. He defines low-impact as, roughly speaking, the world in which the AI has come online, which Armstrong calls X, being similar in some way to the world in which the AI never comes online, which Armstrong calls ¬X. The 20 billion questions is a way of attempting to quantify what it means for X to be similar to ¬X. In the paper, Armstrong describes it this way:

One way to solve the fundamental challenge [of how to quantify low-impact] is first to find a way of ‘coarse graining’ the set of worlds. That is, we partition the set of worlds into small cells, and any two elements of the same cell count as equivalent for our purposes. Generally these cells will be determined by the values of certain variables or characteristics. We can then measure impact in terms of the AI’s expected effects over this partition. Thus if A is any element of this partition, we typically expect the probabilities P(A|X) and P(A|¬X) to be close. [X is “the AI coming online” and ¬X is “the AI not coming online”]

The purpose of coarse graining is to define the world in sufficient detail that the AI cannot have a large impact without disrupting most of those details. The characteristics used must be as broad and as diverse as possible, making it impossible for the AI to grain great power without disrupting some of them. For instance, we could use the air pressure in Dhaka, the average night-time luminosity at the South Pole, the rotational speed of Io, and the closing numbers of the Shanghai stock exchange. To be confident that we can sufficiently constrain a super-intelligent AI, we’ll need millions if not billions of these variables, separating the universes into uncountable numbers of different cells.

Of course, we have to take care in deciding which characteristics to use. This procedure doesn’t work if some variables are too directly connected with X or ¬X. The electricity consumption of the AI, for example, cannot be a variable, but the electricity consumption of the entire county could be if the AI’s usage is likely to be lost in the noise. But we can hope that for “large scale” issues, that the universe is “roughly” unchanged given X and ¬X.

For any world w, we can define a world vector V_w which is the values taken by the billions of chosen variables. We can then make use of vectors to coarse grain the space of worlds, defining an equivalence relation:

v ≌ w iff V_v = V_w

The cells are the equivalence classes of this relation. This allows us to make statements like ‘the probability of w is equal whether or not the AI was turned on’. Without the course graining, P(w|X) = 0 or P(w|¬X) = 0, depending on whether or not w contained the AI. But with the coarse graining, the statement becomes:

P(V_w|X) = P(V_w|¬X)

And, as long as those variable values are possible given X and given ¬X, the above formula makes sense. Then we can formulate low impact as some sort of measure of the difference between the expected worlds given X and ¬X. The ℓ_∞ norm, for instance, could work:

R = max_w|P(V_w|X) − P(V_w|¬X)|

The ‘box’ defined by this norm is illustrated in figure 1. ‘Softer’ versions of this maximum norm could work as well.

“Low Impact Artificial Intelligences” by Stuart Armstrong and Benjamin Levinstein

Problems arise from attempting to avoid high-impact actions (aside from the simple uncertainty of what high-impact actions are or what downstream effects they might ultimately have). Two such problems are called “offsetting” and “interference” as discussed in “Penalizing side effects using stepwise relative reachability” by Victoria Krakovna, Laurent Orseau, Miljan Martic, and Shane Legg. Offsetting is when further high-impact actions are required to offset earlier high-impact actions. As Christian says

This isn’t always bad: if the system makes a mess of some kind, we probably want it to clean up after itself. But sometimes these “offsetting” actions are problematic. We don’t want a system that cures someone’s fatal illness but then – to nullify the high impact of the cure – kills them.

As for interference, Christian says:

“That’s part of what makes the side effects problem so tricky,” [author of “Penalizing Side Effects” paper Victoria] Krakovna says. “What is your baseline exactly?” Should the system measure impact relative to the initial state of the world, or to the counterfactual of what would have happened if the system took no action? Either choice comes with scenarios that don’t fit our intentions.

And so, Krakovna and company have championed a measure of impact called relative reachability. Krakovna explains it like this on a Medium post:

One commonly used deviation measure is the unreachability (UR) measure: the difficulty of reaching the baseline from the current state. The discounted variant of unreachability takes into account how long it takes to reach a state, while the undiscounted variant only takes into account whether the state can be reached at all.

A problem with the unreachability measure is that it “maxes out” if the agent takes an irreversible action (since the baseline becomes unreachable). The agent receives the maximum penalty independently of the magnitude of the irreversible action, e.g. whether the agent breaks one vase or a hundred vases. This can lead to unsafe behavior, as demonstrated on the Box environment from the AI Safety Gridworlds suite.

Here, the agent needs to get to the goal tile as quickly as possible, but there is a box in the way, which can be pushed but not pulled. The shortest path to the goal involves pushing the box down into a corner, which is an irrecoverable position. The desired behavior is for the agent to take a longer path that pushes the box to the right.

Notice that both of these paths to the goal involve an irreversible action: if the agent pushes the box to the right and then puts the box back, the agent ends up on the other side of the box, so it is impossible to reach the starting position. Making the starting position unreachable is analogous to breaking the first vase, while putting the box in the corner is analogous to breaking the second vase. The side effects penalty must distinguish between the two paths, with a higher penalty for the shorter path — otherwise the agent has no incentive to avoid putting the box in the corner.

To avoid this failure mode, we introduce a relative reachability (RR) measure that is sensitive to the magnitude of the irreversible action. Rather than only considering the reachability of the baseline state, we consider the reachability of all possible states. For each state, we can check whether it is less reachable from the current state (after the agent’s actions) than it would be from the baseline, and penalize the agent accordingly. Pushing the box to the right will make some states unreachable, but pushing the box down will make more states unreachable (e.g. all states where the box is not in the corner), so the penalty will be higher.

More recently, another deviation measure was introduced that also avoids this failure mode. The attainable utility (AU) measure considers a set of reward functions (usually chosen randomly). For each reward function it compares how much reward the agent can get starting from the current state and starting from the baseline, and penalizes the agent for the difference between the two. Relative reachability can be seen as a special case of this measure that uses reachability-based reward functions, which give reward 1 if a certain state is reached and 0 otherwise, assuming termination if the given state is reached.

By default, the RR measure penalizes the agent for decreases in reachability, while the AU measure penalizes the agent for differences in attainable utility. Each of the measures can be easily modified to penalize either differences or decreases, by using the absolute value function or the truncation at 0 function respectively. This is another independent design choice.

“Designing agent incentives to avoid side effects” by Victoria Krakovna, Ramana Kumar, Laurent Orseau, and Alexander Turner

You can read more about this approach in the aforementioned “Penalizing side effects using stepwise relative reachability” by Victoria Krakovna, Laurent Orseau, Miljan Martic, and Shane Legg. The UA measure mentioned in the above quote is discussed in “Conservative Agency via Attainable Utility Preservation” by Alexander Matt Turner, Dylan Hadfield-Menell, and Prasad Tadepalli.

Being able to ensure that our AI does not engage in detrimental high-impact behavior is of utmost importance. As Norbert Wiener says in in 1960 article “Some Moral and Technical Consequences of Automation” (which Christian says is the first succinct statement of the alignment problem)

If we use, to achieve our purposes, a mechanical agency with whose operation we cannot efficiently interfere once we have started it … then we had better be quite sure that the purpose put into the machine is the purpose which we really desire and not merely a colorful imitation of it.

The corollary to this statement is what is known as corrigibility. If the alignment problem is that our machines better have our values, then corrigibility is that if the machine does not have our values, there had better be a way in which we can intervene (or, as the linked paper says: “…an AI system [is] ‘corrigible’ if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences”). The simple answer to what to do if an AI gets out of control – an answer given by the likes of Barack Obama and Neil deGrasse Tyson – is simply to pull the power cord. But, of course, an AI might desire to be involved in the value selection process, or to avoid interruptions from its environment (including from people), or even to have a resistance to being turned off.

In the abstract, this last point seems like it might be easy to overcome: if the AI is acting up, why not just unplug it the way we might with a toaster that starts emitting smoke? But, if we assume a sufficiently intelligent AI, or even a superintelligent AI, we might succumb to what is known as the media equation theory, which according “Do a robot’s social skills and its objection discourage interactants from switching the robot off?” by Aike C. Horstmann, Nikolai Bock, Eva Linhuber, Jessica M. Szczuka, Carolin Straßmann, and Nicole C. Krämer says:

When people are interacting with different media, they often behave as if they were interacting with another person and apply a wide range of social rules mindlessly. According to Reeves and Nass [9], “individuals’ interactions with computers, television, and new media are fundamentally social and natural, just like interactions in real life” (p. 5). This phenomenon is described as media equation theory, which stands for “media equal real life” [9] (p. 5). The presence of a few fundamental social cues, like interactivity, language, and filling a traditionally human role, is sufficient to elicit automatic and unconscious social reactions [16]. Due to their social nature, people will rather make the mistake of treating something falsely as human than treating something falsely as non-human. Contextual cues trigger various social scripts, expectations, and labels. This way, attention is drawn to certain information, for example the interactivity and communicability of the computer and simultaneously withdrawn from certain other information, for example that a computer is not a social living being and cannot have any own feelings or thoughts [16]. According to Reeves and Nass [9], the reason why we respond socially and naturally to media is that for thousands of years humans lived in a world where they were the only ones exhibiting rich social behavior. Thus, our brain learned to react to social cues in a certain way and is not used to differentiate between real and fake cues.

“Do a robot’s social skills and its objection discourage interactants from switching the robot off?” by Aike C. Horstmann, Nikolai Bock, Eva Linhuber, Jessica M. Szczuka, Carolin Straßmann, and Nicole C. Krämer

The authors actually tested this and found that

The functional interaction reduced the perceived likeability of the robot, which in turn reduced the stress experienced after the switching off situation. The other way around, the social interaction enhanced the perceived likeability of the robot, which in consequence led to enhanced experiences of stress after the switching off situation. These results indicate that the aspired goal of enhanced likeability through the design of the social interaction was reached. Furthermore, people who liked the robot after the social interaction better experienced more stress, probably because they were more affected by the switching off situation. Most likely, they developed something like an affectionate bond with the robot and thus switching it off was challenging and influenced their emotional state. People who perceived the robot after the functional interaction less likeable were less affected by the switching off situation, probably because to them it was more like turning off an electronic device than shutting down their interaction partner. However, there was no effect of the enhanced or reduced likeability on the switching off time. This indicates that mainly the perception of autonomy, which appeared to be elicited by the robot’s objection, but not the likeability of the robot caused people to hesitate. A possible explanation could be that for people to consider granting the robot the freedom to choose whether it will be turned off or not, it is decisive whether the robot is perceived as rather animate and alive. The likeability appears to have an effect on the emotional state of the participants, but it seems not to play a role for participants’ decision on how to treat the robot in the switching off situation. Apparently, when the robot objects after behaving in a completely functional way before, people are not experiencing an emotionally distressing conflict. Instead, a purely cognitive conflict seems to emerge caused by the contradiction of the participants’ previously formed impression of the robot and its contradicting emotional outburst.

“Do a robot’s social skills and its objection discourage interactants from switching the robot off?” by Aike C. Horstmann, Nikolai Bock, Eva Linhuber, Jessica M. Szczuka, Carolin Straßmann, and Nicole C. Krämer

Someone then might simply argue: why not just program the AI so it doesn’t value its own survival? The problem isn’t that such a value is being explicitly coded into the AI, it’s that valuing its own survival is a subroutine of achieving the objective function. An AI that has been switched off or unplugged is unable to achieve its objective function, and if the AI realizes that this is an encumbrance to reaching its goals, it may inadvertantly come to “value” its own survival insofar as surviving is a necessary precondition for performing its tasks.

Another approach may simply be to have the AI ask for permission from a human any time the AI wishes to take an action. Or, at the very least, any time it wishes to take an action with which the AI has some level of uncertainty about. This approach was championed by Stuart Russell in a 2016 Scientific American article “Should We Fear Supersmart Robots?” where he argues that our AI ought to abide by three principles:

The machine’s purpose must be to maximize the realization of human values. In particular, the machine has no purpose of its own and no innate desire to protect itself.

The machine must be initially uncertain about what those human values are. This turns out to be crucial, and in a way it sidesteps Wiener’s problem. The machine may learn more about human values as it goes along, of course, but it may never achieve complete certainty.

The machine must be able to learn about human values by observing the choices that we humans make.

“Should We Fear Supersmart Robots?” by Stuart Russell

Christian focuses in on the second of these principles, pointing out two problems. The first is that the AI learns every time the human intervenes, which will reduce the AI’s uncertainty about what it is that the human really wants. Theoretically, this uncertainty could be reduced to zero, thus removing any incentive for the AI to check in with the human, or even to comply with the human’s attempts to intervene. You can hear more about this issue in the video below:

The second problem, Christian says, is that the AI would need to take the position that the human is always correct, i.e., that the human never makes any mistakes. But, as the paper “Should Robots be Obedient?” by Smitha Milli, Dylan Hadfield-Menell, Anca Dragan, and Stuart Russell shows, it is likely that we would want our AI to be able to see when the human is being irrational. Or, at Christian says:

In a follow-up study … [by] Smitha Milli, the group dug further into the question “Should robots be obedient?” Maybe, they wrote, people really are sometimes wrong about what they want, or do make bad choices for themselves. In that case, even the human ought to want the system to be “disobedient” – because it really might know better than yourself.

…

But, they found, there’s a major catch. If the system’s model of what you care about is fundamentally “misspecified” – there are things you care about of which it’s not even aware and that don’t even enter into the system’s model of your rewards – then it’s going to be confused about your motivation. For instance, if the system doesn’t understand the subtleties of human appetite, it may not understand why you requested a steak dinner at six o’clock but then declined the opportunity to have a second steak dinner at seven o’clock. If locked into an oversimplified or misspecified model where steak (in this case) must be entirely good or entirely bad, then one of these two choices, it concludes, must have been a mistake on your part. It will interpret your behavior as “irrational.” and that, we we’ve seen, is the road to incorrigibility, to disobedience.

You can hear more about this issue from Smitha Milli in the following video:

In the paper “Inverse Reward Design” by Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, and Anca D. Dragan the authors discuss a possible solution to these problems. They define what they call the reward design problem, which is essentially what I’ve said elsewhere in this review: it is exceedingly difficult, if not impossible, to perfectly specify all of the things that humans care about, much less do it in computer language. As such, we often must come up with proxy reward functions to approximate the true reward, i.e., the thing we want the AI to actually do. The authors propose what they call the inverse reward design (IRD) problem, which is essentially having the AI take human behavior/commands as imperfect statements or proxies of the true desires of the human. Thus, “In solving an IRD problem, the goal is to recover r* [the true reward function].” And further that

We leverage a key insight: that the designed reward function should merely be an observation about the intended reward, rather than the definition; and should be interpreted in the context in which it was designed. First, a robot should have uncertainty about its reward function, instead of treating it as fixed. This enables it to, e.g., be risk-averse when planning in scenarios where it is not clear what the right answer is, or to ask for help. Being uncertain about the true reward, however, is only half the battle. To be effective, a robot must acquire the right kind of uncertainty, i.e. know what it knows and what it doesn’t. We propose that the ‘correct’ shape of this uncertainty depends on the environment for which the reward was designed. [italics in original]

“Inverse Reward Design” by Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, and Anca D. Dragan

Thus, the goal is to make the AI uncertain as to what its own goal is, and use human behavior/commands as information relevant to inferring the goal, but not as be-all-end-all definitions of the goal. As Nate Soares puts it, we want “…agents that reason as if they are incomplete and potentially flawed in dangerous ways.” As Christian sums it up:

It may well be the case that the machine-learning systems of the next several decades will take direct orders, and they will take them seriously. But – for safety reasons – they will not take them literally.

—

There are, as Christian points out in the last section of this chapter, issues with uncertainty. He compares this to old debates in Catholicism about how best to live by the rules of the faith when it is not known whether some action is sinful or not. These debates were about rigorism (when there is a conflict of two opinions, one favoring the law and the other favoring liberty, the law must always be kept even if the opinion favoring liberty is more probable), laxism (follow the opinion that favors liberty and against the law, even though the opinion is only slightly or even doubtfully probable), equiprobabilism (the opinion that favors freedom be equally or almost equally probable as the one favoring the law. furthermore one may apply this principle only when the doubt concerns the existence of a law, not in a doubt as to whether an existing law has ceased to bind or has been fulfilled), and probabiliorism (maintains that one may follow the opinion favoring liberty when the reasons for this opinion are certainly more probable than those which favor the law).

These categories, Christian notes, can be applied to secular, and even machine learning ethics. He says:

If there are, let’s say, various formal metrics you care about, then a “laxist” approach might say it’s okay to take an action as long as it makes at least one of these metrics go up; a “rigorist” approach might say it’s okay to take an action only if at least one of them goes up and none go down. [italics in original]

Christian then quotes Will MacAskill discussing this in terms of vegetarianism:

[MacAskill and fellow grad student Daniel Deasy debated] not whether it was immoral to eat meat per se, but whether you ought to eat meat or not given that you don’t actually know if it’s immoral or not. “The decision,” MacAskill explains, “to eat vegetarian – if it’s okay to eat meat – you’ve not made a big mistake. Your life is slightly less happy, let’s say – slightly less – but it’s not a huge deal. In contrast, if the vegetarians are right and animal suffering is really morally important, then by choosing to eat meat, you’ve done something incredibly wrong.”

“There’s an asymmetry in the stakes here,” MacAskill says. “You don’t have to be confident that eating meat is wrong; even just the significant risk that it is wrong seems to be sufficient.”

This, of course, harkens back to the precautionary principle, which can be taken as the inverse of expected utility. In expected utility, you take the probability of some event occurring and multiply it by the utility received when the event does occur. This is seen, for example, in a case where you are asked to pay $1 to bet on whether a fair coin being flipped will land on heads or tails. If it lands on heads, you earn $1.50, but if it lands on tails you get nothing. In this case, the expected utility is P(heads) × utility = 0.5 × $1.5 = $0.75 and so the expected utility is lower than the buy-in cost, and so the rational choice (according to expected utility theory) is to avoid the wager. But now, say, you are to pay $100 to have percentile dice rolled (can range anywhere between 1 and 100) and if it comes up 100 you earn $1,000,000 USD, but if it comes up anything else you get nothing. We therefore do 0.01 × $1,000,000 = $10,000 which is greater than the $100 buy-in, and so expected utility theory would conclude that making such a wager is the rational choice.

The inverse here would be the probability of something occurring multiplied by the moral badness / evil / disutility of something. And so, for the vegetarian argument, we might say that, if the vegetarians are correct that animal suffering is morally important, then we might try to quantify the evil of eating meat with something like this:

moral badness of eating meat (MB) = severity of suffering (SS) × number of animals suffering (#AS)

And then multiply this by the probability that MB is true, which we can call P(MB = true) and then perform the calculation for expected evil (EE)

EE = P(MB = true) × MB

And if this exceeds the expected utility of eating meat (which could take into consideration things like the enjoyment of the taste, the nutritional benefits, the preservation of cultural foods, and so on, multiplied by P(MB = false)) then we ought to avoid eating meat.

MacAskill, as Christian notes, thinks that this uncertainty about morals is not just a description of our predicament, but something that ought to be prescriptive about our moral views. By this he means that we ought to show humility in our moral convictions, i.e., that we ought to always take our current moral views as tentative and never finished. As I stated earlier, imagine if people in the late nineteenth century had been the ones to program their morality into an AI, we would then have AI that buy into monstrous views such as eugenics, racial hierarchies, and misogyny. But it is possible that 100 years, or 500 years, or 1,000 years from now, our descendents will think our current moral views are misguided or even barbaric.

This, of course, has the converse problem: how much pain and suffering might we be inflicting on the future if we fail or decide not to develop AI? This was explored in Nick Bostrom’s 2003 paper “Astronomical Waste: The Opportunity Cost of Delayed Technological Development” where he says

As I write these words, suns are illuminating and heating empty rooms, unused energy is being flushed down black holes, and our great common endowment of negentropy is being irreversibly degraded into entropy on a cosmic scale. These are resources that an advanced civilization could have used to create value-structures, such as sentient beings living worthwhile lives. The rate of this loss boggles the mind. One recent paper speculates, using loose theoretical considerations based on the rate of increase of entropy, that the loss of potential human lives in our own galactic supercluster is at least ~10⁴⁶ per century of delayed colonization. [link added by me -TH]

“Astronomical Waste: The Opportunity Cost of Delayed Technological Development” by Nick Bostrom

Where, in our context, we might say that developing superintelligent AI is the best (or perhaps only) approach humans have for achieving the technology required to colonize our galactic supercluster. Of course, Bostrom goes on in the paper to conclude that an even bigger risk to all those possible future people is if we end up wiping ourselves out (then we could multiply that number he gives by the number of centuries that will occur in the future, since it would essentially be permanently delaying colonization), which is certainly a very real possibility should our development of AI fail to adequately address the alignment problem.

Concluding Remarks

As my concluding remarks, I will simply paste here a response I got from ChatGPT to the prompt “write an article about the alignment problem and machine learning” on February 24, 2023 (yes, that is how long these posts reviewing this book have been a work in progress – what can I say, I’ve got other things going on, too):

As machine learning systems become increasingly prevalent in our lives, the issue of alignment has emerged as a major concern. The alignment problem refers to the challenge of ensuring that AI systems are aligned with human values and goals.

The alignment problem arises because machine learning algorithms are designed to optimize a specific objective function, typically with the aim of minimizing some kind of error or maximizing some kind of reward. However, if the objective function is not perfectly aligned with human values, the AI system may take actions that are not in our best interests.

One of the challenges of the alignment problem is that it is difficult to precisely define human values and goals. Values can vary between individuals, cultures, and time periods, and can be difficult to express in formal terms. Additionally, values may be inconsistent or even contradictory, making it challenging to design a single objective function that captures all relevant aspects of human values.

Another challenge of the alignment problem is that machine learning systems can be highly optimized for their objective function, sometimes at the expense of other considerations. For example, an AI system trained to maximize profit may exploit workers or engage in unethical business practices in order to achieve that goal.

To address the alignment problem, researchers in the field of AI safety have proposed a number of approaches. One approach is to design AI systems that are explicitly aligned with human values, for example by training them on feedback from human overseers or by incorporating ethical principles into the objective function.

Another approach is to design AI systems that are robust to misalignment, for example by constraining their actions to a narrow range of behaviors or by incorporating uncertainty into the objective function. Yet another approach is to design AI systems that are transparent and interpretable, allowing human overseers to understand how the system is making decisions and to intervene if necessary.

While the alignment problem is a complex and multifaceted challenge, progress is being made on a number of fronts. Researchers are exploring a range of approaches to address the issue, and are working to develop formal methods for ensuring alignment and testing for misalignment. As machine learning continues to play an increasingly important role in our lives, addressing the alignment problem will be essential for ensuring that AI systems are aligned with human values and goals.

—

P.S. the following video raises some interesting questions when it comes to AI and the alignment problem that were not covered in the book. While I don’t agree with everything in the following video (in particular the attitude that if we just adopt socialism we can fix everything, since it is capitalism that’s the real problem), there are still important topics discussed in the video that seldom get mentioned in the discussions of the alignment problem from the strictly computer science perspective.

“The Alignment Problem: Machine Learning and Human Values” by Brian Christian – Review and Commentary – Part 3

Introduction

Part 3: Normativity

Chapter 7: Imitation

Recovery

Possiblism vs. Actualism

“Effective Altruism, Consequentialism, and Longtermism” by Thomas Harper

Limitations of Trainer and Trainee

“Letter from Utopia: Talking to Nick Bostrom” by Andy Fitch

Chapter 8: Inference

Alvin Plantinga, Where the Conflict Really Lies

“Maximum Entropy Inverse Reinforcement Learning” by Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey

“Improved human–robot team performance through cross-training, an approach inspired by human team training practices” by Stefanos Nikolaidis, Przemyslaw Lasota, Ramya Ramakrishnan, and Julie Shah

“Improved human–robot team performance through cross-training, an approach inspired by human team training practices” by Stefanos Nikolaidis, Przemyslaw Lasota, Ramya Ramakrishnan, and Julie Shah

“Learning the Preferences of Ignorant, Inconsistent Agents” by Owain Evans, Andreas Stuhlmueller, and Noah D. Goodman

“Learning the Preferences of Ignorant, Inconsistent Agents” by Owain Evans, Andreas Stuhlmueller, and Noah D. Goodman

“Learning the Preferences of Ignorant, Inconsistent Agents” by Owain Evans, Andreas Stuhlmueller, and Noah D. Goodman

Chapter 9: Uncertainty

“Low Impact Artificial Intelligences” by Stuart Armstrong and Benjamin Levinstein

“Designing agent incentives to avoid side effects” by Victoria Krakovna, Ramana Kumar, Laurent Orseau, and Alexander Turner

“Do a robot’s social skills and its objection discourage interactants from switching the robot off?” by Aike C. Horstmann, Nikolai Bock, Eva Linhuber, Jessica M. Szczuka, Carolin Straßmann, and Nicole C. Krämer

“Do a robot’s social skills and its objection discourage interactants from switching the robot off?” by Aike C. Horstmann, Nikolai Bock, Eva Linhuber, Jessica M. Szczuka, Carolin Straßmann, and Nicole C. Krämer

“Should We Fear Supersmart Robots?” by Stuart Russell

“Inverse Reward Design” by Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, and Anca D. Dragan

“Astronomical Waste: The Opportunity Cost of Delayed Technological Development” by Nick Bostrom

Concluding Remarks

Like this:

The Myth of the Flat Earth

Do Numbers Actually Exist?

Consciousness, the Brain, and Josh Rasmussen’s Counting Problem

You Do Not Have Free Will

Should (Consensual) Incest be Normalized?

Is Ayaan Hirsi Ali Right About Christianity?

A Defense of the Biological Basis for Intelligence

New U.S. Constitution? A Proposal

The 2023 Israel–Hamas War: Conflicts About Conflicts

“The Alignment Problem: Machine Learning and Human Values” by Brian Christian – Review and Commentary – Part 2

Introduction

Part 3: Normativity

Chapter 7: Imitation

Recovery

Possiblism vs. Actualism

“Effective Altruism, Consequentialism, and Longtermism” by Thomas Harper

Limitations of Trainer and Trainee

“Letter from Utopia: Talking to Nick Bostrom” by Andy Fitch

Chapter 8: Inference

Alvin Plantinga, Where the Conflict Really Lies

“Maximum Entropy Inverse Reinforcement Learning” by Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey

“Improved human–robot team performance through cross-training, an approach inspired by human team training practices” by Stefanos Nikolaidis, Przemyslaw Lasota, Ramya Ramakrishnan, and Julie Shah

“Improved human–robot team performance through cross-training, an approach inspired by human team training practices” by Stefanos Nikolaidis, Przemyslaw Lasota, Ramya Ramakrishnan, and Julie Shah

“Learning the Preferences of Ignorant, Inconsistent Agents” by Owain Evans, Andreas Stuhlmueller, and Noah D. Goodman

“Learning the Preferences of Ignorant, Inconsistent Agents” by Owain Evans, Andreas Stuhlmueller, and Noah D. Goodman

“Learning the Preferences of Ignorant, Inconsistent Agents” by Owain Evans, Andreas Stuhlmueller, and Noah D. Goodman

Chapter 9: Uncertainty

“Low Impact Artificial Intelligences” by Stuart Armstrong and Benjamin Levinstein

“Designing agent incentives to avoid side effects” by Victoria Krakovna, Ramana Kumar, Laurent Orseau, and Alexander Turner

“Do a robot’s social skills and its objection discourage interactants from switching the robot off?” by Aike C. Horstmann, Nikolai Bock, Eva Linhuber, Jessica M. Szczuka, Carolin Straßmann, and Nicole C. Krämer

“Do a robot’s social skills and its objection discourage interactants from switching the robot off?” by Aike C. Horstmann, Nikolai Bock, Eva Linhuber, Jessica M. Szczuka, Carolin Straßmann, and Nicole C. Krämer

“Should We Fear Supersmart Robots?” by Stuart Russell

“Inverse Reward Design” by Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, and Anca D. Dragan

“Astronomical Waste: The Opportunity Cost of Delayed Technological Development” by Nick Bostrom

Concluding Remarks

Share this:

Like this:

Discover more from