“The Alignment Problem: Machine Learning and Human Values” by Brian Christian – Review and Commentary – Part 2

The Alignment Problem: Machine Learning and Human Values by Briant Christian, W. W. Norton & Company (October 6, 2020), 496 pages

See Part 1 of the review here.


Intelligence Quotient, or IQ, is supposed to be a measure of a person’s ability to reason, see patterns, and make predictions. Yet IQ is quite controversial – a controversy ranging anywhere from IQ tests being inaccurate or biased and all the way to IQ tests, and anything concerning IQ, being immoral. Yet, even if there is no test that can accurately and reliably gauge an individual’s intelligence in some quantitative way, most people are aware of some ineffable sense in which some individuals are just smarter than the average individual (and vice versa with some people just being less smart in some ways than other people). Some people just seem to get it a lot faster than other people when learning new things about, for instance, math, computers, cars, music, and so on. Some people can be shown how to do something once and are then able to repeat it with near mastery while others can practice for hours and still not do it as well. Some people are just fountains of creative ideas while others struggle in this department. Some people just have quick wits and can recall relevant information so easily in almost any situation, while others (like me) are seriously lacking in this department. The existence of savant syndrome itself attests to the notion that someone can just naturally be good at something.

Indeed, if it were the case that all humans had exactly equal intelligence (however we want to define or measure it), this would demand explanation. How, exactly, could such a state of affairs have even come about? We know from biology that there is variation within a species, and this will apply to intelligence(s) as well. If it so happened that humans the world over developed exactly equal intelligence, this would be a biologically unprecedented phenomenon, something that would challenge what we know about evolution, genetics, neuroscience, and human (or, really, any species) development. It would also demand an explanation as to why people have such large differences in their capacity and aptitude for the kinds of domains I mentioned in the previous paragraph. All students from a particular region, going to the same school, and indeed taking the same class from the same teacher giving the same assignments and grading from the same rubric, can still have a very large standard deviation in grades. If it were the case that all humans had exactly equal intelligence, or at least exactly equal potential intelligence (i.e., before having their tabula rasa filled), then this kind of variance would call out for explanation. Additionally, an explanation as to why there can be inter-species differences in intelligence but not intra-species differences in intelligence would be required: if biology has nothing to do with intelligence (i.e., if intelligence is some thing added on, in addition to, and separate from, neurobiology), then why are there differences in intelligence between, say, humans and chimpanzees?

This last point brings up an interesting issue with intelligence: when we humans talk about intelligence, what we usually mean is human intelligence – competency in the domains humans find important. But, of course, there are domains where chimpanzees outperform humans, even in domains that humans consider important, such as working memory.

My point here is not to defend (or criticize) IQ as a measure of intelligence. My point is to highlight that there are differences in intelligence, in whatever way we define or measure it, within and between species. Yet, as much of a difference as there may be in the intelligence(s) of different humans, the intelligence of any two individual humans, in the space of all possible intelligence, are extremely close together. In other words, for all our variance in intelligence, if we zoom out from our human-centric bias, we are not all that different. Certainly, when it comes to doing algebra, just about any given human will outperform just about any member of any other species. Yet, a human would be bad at most kinds of problem solving that an ant must engage in. In other words, humans are going to be terrible in most of the domains of interest for other species. When humans compare ourselves to other species in the domains that interest humans, we obviously win almost every time, but human intelligence is not the only kind of intelligence.

It is also the case that, in the domains of interest to humans, we humans are not even always that great. Performing arithmetical calculations, for instance, is something that a simple calculator is vastly better at than any human. Meaning that, even in the domains of interest to humans, the intelligence difference between any two given humans is going to be quite small when compared to the space of potential competency within the domain. Arithmetic, working memory, long-term memory, and pattern recognition are just a few areas in which humans are now vastly out-competed by our technology.

What is interesting in AI science is that, the kinds of things that humans have found difficult (e.g., arithmetical operations) tend to be easy for a computer, yet the kinds of things humans tend to find easy are difficult for computers. For instance, the so-called cocktail party effect, where humans can focus on specific voices or sounds among a cacophony of other voices and sounds. This is something that humans can do automatically and with ease, yet it is still an issue in machine learning.

My take away in this introduction consists of two things:

  1. The space of potential competency in any given domain of intelligence for which humans are interested is enormous, with all humans occupying just a small corner of this space
  2. The space of different kinds of intelligence is also gigantic, with humans taking up an even smaller region of this space

Yet, intelligence is not the only aspect of intelligent beings that is of interest. Just as important, if not more important, are the goals, desires, and motivations of those intelligent beings. I think there is often a tacit assumption that these two things – intelligence and goals/motivations – are linked, i.e., that the intelligence of the person (or organism) determines their goals. In other words, smart people are motivated by smart people things. But this is not necessarily the case. Indeed, the causality likely works in the opposite direction – the kinds of goals a person, or members of any species, possesses is what has determined (to some significant extent) their level (and repertoire) of intelligence(s). Yet, it is conceivable that the two things – intelligence and the kinds of goals something has – are almost completely separable. Indeed, this is very much the case with artificial intelligence, where the intelligence and what the technology is for do not have any causality in either direction. A calculator can be used to aid in the logistics of sending food to starving people, or calculating the precise trajectory of firing a nuclear missile toward a densely populated city.

It’s here where the alignment problem comes in: aligning the intelligence of our AI with the goals of humans (or, even better, the kinds of goals humans determine to be the ones that humans ought to have, not necessarily the ones we do have). For humans, the goal is usually some vague notion of achieving the good life, however one wishes to define that, but this does not necessarily have to be the case with AI.

Part 2: Agency

So, just what is the good life? Presumably one with the greatest number of the most rewarding subjective states rather than one with slightly more punishing subjective states. This is certainly the case for AI. An AI, if it is to do anything, requires some kind of motivation. This motivation comes from the objective function wherein the AI is rewarded for getting near the objective and punished for getting further away from it. The AI is then programmed to maximize reward (and minimize punishment). But, as discussed in the previous post, there are two issues: (A) what sorts of objective functions will align with human desires, values, and morals? And (B) how can such objective functions be made explicit so that they can be coded into AI software?

For (A) one issue that most people are painfully aware of is that all humans cannot even agree on what the good life is (i.e., a life that satisfies the most desires, aligns most with one’s values, and is lived morally). Not only that, but desires, values, and morals change over time: even if we got our desires, values, and morals right and were able to explicitly code them, we cannot be sure that our current desires, values, and morals are even the ones that humans ought to possess.

For (B) there are many issues, but two big ones are:
1. Can everything humans care about be explicitly and unambiguously stated in any kind of language?
2. Will coding these things into a computer give the result we think (or hope) they will?

It doesn’t require much effort to discover for oneself that the first issue becomes exceedingly difficult. Coming up with necessary and sufficient conditions to define anything becomes very difficult (see, for instance, natural kinds, social ontology, sorites paradox, fuzzy concept, and homeostatic property clusters). But these tend to be problems left to philosophers, sociologists, and linguists. It’s often the second issue that computer scientists focus on. The famous thought experiment by Nick Bostrom about a superintelligent AI with the goal of maximizing paperclip production is often brought up to illustrate this point:

Artificial intellects need not have humanlike motives.

Human are rarely willing slaves, but there is nothing implausible about the idea of a superintelligence having as its supergoal to serve humanity or some particular human, with no desire whatsoever to revolt or to “liberate” itself. It also seems perfectly possible to have a superintelligence whose sole goal is something completely arbitrary, such as to manufacture as many paperclips as possible, and who would resist with all its might any attempt to alter this goal. For better or worse, artificial intellects need not share our human motivational tendencies.

The idea is that, if you are the CEO of a company that makes paperclips, and your company invents or purchases a superintelligent AI Genie (or Sovereign), you might naively program it or command it to maximize paperclip productivity. This is what your company’s goal is (or, at least, it’s a necessary sub-goal of maximizing profit). The AI might then take this very literally and, being superintelligent, could hack into any network, socially engineer any powerful person, and use its incomprehensible intellect to turn the entire planet into a paperclip maximizing system, ingeniously stopping anyone who might attempt to thwart this goal (perhaps through its vast data monitoring capabilities could even stop people from doing so before the person even knows it’s what they want to do).

The idea sounds ludicrous to us because maximizing paperclip production seems like such an inane goal: wouldn’t something superintelligent want to explore the cosmos or discover the theory of everything or have as many fun and pleasurable experiences with friends as possible, or some other very human goal? But the absurdity of paperclip maximization is why it illustrates Bostrom’s point so well: the AI will not necessarily have the same kinds of ultimate goals as humans. Indeed, the AI will have goals set by its objective function, and it will be rewarded by getting nearer to those goals. Thus, every paperclip the AI manages to produce rewards it, and so to maximize reward the AI will seek to maximize the number of paperclips produced. Turning the world into paperclips is the good life for that AI. And, being superintelligent, it will not simply take a brute force approach to this, but will be able to plan and strategize in ways so complex and brilliant that no human or group of humans could never comprehend or predict the AI’s novel and ingenious methods to achieve its goal (and ways to thwart anyone who might stand in its way).

And so, the question is not just what objective functions to give our AI, but what kinds of things should our AI find rewarding in the first place? It’s to these questions to which Christian turns his attention in Agency, the second part of the book – but looking at more immediate and practical issues in AI than our superintelligent paperclip manufacturer, though the problems are not unrelated.

Chapter 4: Reinforcement

Before getting to the book, I’ll give a brief crash course in Reinforcement Learning (RL). RL is, in essence, giving a reward to an agent (e.g., an artificial intelligence) for achieving certain states using its repertoire of actions. What this means is that, in the environment in which the agent exists, it occupies a certain state. There are other possible states to which the agent can transition through its repertoire of actions. For instance, say it is an agent in a 6×6 gridworld environment in which a certain space can have “food” while another can have “water”. The agent can then be in states that are the 36 spaces while also being in state “hungry” or “not hungry” as well as “thirsty” or “not thirsty”. The agent’s repertoire of actions are “move north”, “move east”, “move south”, “move west”, “eat”, and “drink”, with the last two doing nothing if the agent is not on the “food” or “water” spaces, respectively.

This is what Satinder Singh, Richard L. Lewis, and Andrew G. Barto did in this paper. The agent in this environment can be rewarded, say, for maintaining states of “not hungry” and “not thirsty” for as long as possible. What’s interesting with reinforcement learning, though, is that the agent’s behavior is not determined beforehand, but emerges out of its attempt to maximize rewards. With the environment changed, i.e., by putting barriers in it and altering where the “food” and “water” squares are, the agent can come up with different strategies for maximizing reward.

There are of course multiple kinds of RL algorithms that can be used in a number of different environments with various degrees of freedom.


More generally, in RL there are a set of states S = { s1, …, sn } and k ≥ 2 possible actions A = { a1, …, ak | k ≥ 2 } and time steps t. The agent is in state st S  at t and then takes action at A at time t with some probability P(st, at) which transitions the agent into state st+1 S. The agent is then given a scalar reward for the state-action-state combination R(s, a, s’) : S×A×S→ℝ. The agent will learn to adopt policies π : S → A (a function that says: when in state si S, take action ai A) that maximize the reward R by adjusting P(si, ai) for that particular si.

Now back to the book.

Christian begins the chapter with a brief overview of some of the big names in early reinforcement learning: Edward Thorndike, Arthur Samuel, Andrew Barto, and Richard Sutton, among others. Reinforcement learning, which can be called the reward hypothesis, bucked against early 20th century thought that claimed that biological systems sought out homeostasis through systems of negative feedback, looking to minimize perturbations from the norm. Reinforcement learning instead says that biological systems – cells, organisms, societies – are in fact heterostatic maximizers, looking to maximize reward through positive feedback mechanisms. A reward can be any kind of “scalar”, meaning that it is fungible and scalable, and so can be different depending on the context.

What this means in practice is that a biological system, upon earning some reward, will desire to continue earning that reward. To do this the system will often repeat behaviors it had undertook prior to receiving the reward. So, for instance, a mouse given a switch that, when flipped, food is dispensed, will continue flipping that switch in order to get more food. Now, say, the switch will only dispense food if a certain light turns on, and otherwise it does not give food. The mouse will quickly learn that the light coming on signals food – the light is a conditioned stimuli and the food is an unconditioned stimuli. This is known as the Rescorla-Wagner model.

Reinforcement can also come with punishments, which the biological system will wish to avoid. Christian points out that reinforcement is different from supervised and unsupervised learning in three important ways. First, in reinforcement learning each decision is connected. So, for instance, if someone is trying to learn to play chess, they are rewarded upon winning the game and punished upon losing. But at what point could a person be said to have lost the game? If the game takes 40 moves, it wasn’t just the 39th move that caused the game to be lost. As such, it’s difficult to distribute the rewards and punishments optimally. Second, reinforcement learning is, as Christian says, like learning from a critic rather than a teacher. Where a teacher will tell you when you did something wrong and then give you the correct answer (as is done in supervised and unsupervised learning), the critic just yells “booooo!” when you get it wrong but never tells you the right answer. Third, reinforcement learning is delayed. In our chess game we don’t know which move(s) causes us to lose, we don’t get the punishment right when the guilty move(s) were taken. These are collectively known as the credit assignment problem, discussed by Marvin Minsky in “Steps Toward Artificial Intelligence

This is where prediction comes in handy. A person playing a game of chess can, at most points in the game, come up with some probability of their chances of winning. These can then act like rewards and punishments in the middle of the game. This is what is known as temporal difference learning, where estimates (predictions) are periodically (or continuously) updated as new information about the situation is acquired.

Indeed, the human brain is now known to function on temporal difference learning (see “A framework for mesencephalic dopamine systems based on predictive Hebbian learning” and “A Neural Substrate of Prediction and Reward“). The neurotransmitter dopamine is often said to be the reward molecule – you do something good and get a hit of dopamine as a reward. But this is actually inaccurate. In fact, dopamine is released at the onset of the conditioned stimuli, not the unconditioned stimuli. Or in terms of the mouse example from earlier: the mouse gets a dopamine hit when the light comes on, not when the food is dispensed. But, that dopamine will be quickly shut off if the mouse hits the switch when the light is on and no food is dispensed, acting as a sort of punishment. As Christian says:

A sudden spike above the brain’s dopamine background chatter meant that suddenly the world seemed more promising than it had a moment ago. A sudden hush, on the other hand, meant that suddenly things seemed less promising than imagined. A normal background static meant that things, however good or bad they were, were as good as expected. [italics in original]

And, famously, what is expected can change depending on a person’s circumstances – the so-called hedonic treadmill. Happiness is derived from things going better than expected (as opposed to simply having gone well), and so once a person’s expectations adjust to their present circumstances, that becomes the new background level of dopamine.

Christian discusses briefly that this raises a couple interesting philosophical questions. First, if happiness is being pleasantly surprised (having outcomes turn out to be better than expected), then what of a person (or AI) that knows everything about its environment? Such an entity would be incapable of happiness on the temporal differences model. Second, if happiness is defined as being pleasantly surprised, then don’t AI systems that learn through temporal differences have some semblance of happiness (and lack thereof)? In other words, just like we take the happiness of humans and animals into our moral considerations, should we also spare some moral consideration for the happiness of neural networks?

There are still questions that the temporal differences model is unable to answer. For instance, if dopamine is what brings happiness of things going better than expected, then what of actual pleasurable sensations? For instance, the mouse gets a hit of dopamine when the light comes on, but what about the pleasure the mouse feels upon eating the food? Further, how is pleasure actually “measured” via dopamine? For instance, how does the brain adjudicate between the pleasures of a tropical beach vacation and a mountain skiing vacation – how is it that a person considers one “better” than the other? Or even more mundane, choosing between Indian and Mexican food.

This chapter wraps up by coming back around to the alignment problem. We have a well-supported model for how learning occurs, but how ought we get our AI to actually learn what we want them to? Christian says:

Reinforcement learning in its classical form takes for granted the structure of the rewards in the world and asks the question of how to arrive at the behavior … that maximally reaps them. But in many ways this obscures the more interesting – and more dire -matter that faces us at the brink of AI. … Given the behavior we want from our machines, how do we structure the environment’s rewards to bring that behavior about? How do we get what we want when it is we who sit in…the critic’s chair – we who administer the food pellets, or their digital equivalent? [italics in original]

Chapter 5: Shaping

Christian begins with the story of how B.F. Skinner came up with the notion of shaping. Shaping is the process of rewarding behaviors that even slightly resemble the behavior one is interested in. This will make the subject (pigeons in Skinner’s case) perform those behaviors more. Rewards can then be given for getting even closer to the desired behavior. Over time, the subject will land on performing the behavior of interest. Christian puts it like this:

[Shaping is] a technique for instilling complex behaviors through simple rewards, namely by rewarding a series of successive approximations of the behavior. [italics in original]

This is a trial-and-error process. The subject (a pigeon, a human student, an AI) behaves more-or-less randomly. Some of those behaviors are rewarded while others are not. The subject will then attempt to repeat the behaviors that conferred some reward. In engaging in those behaviors, there is some random noise to it, so the action(s) is (are) not always exactly the same. Thus, sometimes the action(s) will be somewhat closer to the behavior(s) desired by the trainer, and so the reward is increased. The agent will then attempt to shift their behavior(s) toward being even closer to the desired behavior(s). This process iterates until the desired behavior(s) is (are) arrived at.

Algorithmically, this shaping process can be expressed in so-called epsilon-greedy, where the program (our subject) engages in behaviors it believes will earn reward 99% of the time, but 1% (epsilon) of the time it behaves randomly, allowing for the program to explore a wider range of behavior space. There are a number of different forms of this trial-and-error reinforcement learning, including model-based, model free, value, and policy.

When using reinforcement learning, there are a few things to keep in mind. One is what is called the schedule of reinforcement, which is the set of rules dictating how often an agent (animal, human, AI) is reinforced for a particular behavior – it can be continuous or non-contintinous. Then, of course, one needs to be conscious of the sparsity of rewards: if they are too sparse, the subject may be unable to determine which behaviors are being rewarded. For instance, if an AI is being trained to play chess, rewarding it for winning does not give much feedback about which moves were the correct ones to make along the way.

To help overcome sparsity, there are a couple different things a programmer can take into account. There is also the use of a curriculum, which can improve performance as well. This is using easier examples of a problem early on and gradually working up to more difficult ones.  The program can start training on a simplified version of the problem, only then to increase in complexity.

Another approach is through incentives or pseudorewards, which is rewarding steps along the way to the ultimate goal. But, Christian warns, one must be careful with incentives. You might incentivize a certain behavior A thinking it will bring out some outcome B, but in fact doing A will bring about some unwanted result C. Alternatively, if one is trying to train a program to engage in some desired behavior Y by rewarding Z, the agent may instead discover that some behavior X is a more efficient way of attaining Z than Y is, and so learns instead to engage in the undesirable behavior X.

These are often called loopholes when applied to human laws and regulations: lets have a progressive tax that taxes the poorest people the least. Well, I’m a rich person and would have to pay higher taxes on income, so what if instead of getting paid a salary I keep most of my assets in stocks and just take out loans with those stocks as collateral, paying back the interest with dividends, then I look on paper like I have no money or income and therefore do not have to pay taxes! The desired outcome would be that poor people don’t have to pay as much, but it rewards unscrupulous behavior among the rich. Similarly, in machine learning, you might reward a self-driving car on a test course for small movements toward the goal and end up having the vehicle drive in circles, since for half the circle it is moving in the direction of the goal.

Christian quotes Andrew Ng:

A very simple pattern of extra rewards often suffices to render straightforward an otherwise intractable problem. However, a difficulty with reward shaping is that by modifying the reward function, it is changing the original problem M to some new problem M’, and asking our algorithms to solve M’ in the hope that solutions can be more easily or more quickly found there than in the original problem. But, it is not always clear that solutions/policies found for the modified problem M’ will also be good for the original problem M.

More formally, this can be stated as a Markov Decision Process (MDP), which is a 5-tuple

M = (S, A, T, γ, R)


S = { s1, …, sn | n < ∞ }

are the possible states that the program can be in and

A = { a1, …, ak | k 2 }

are the 2+ actions that can be taken and

T = { Psa(∙) | s S, a A }

are the state transition probabilities, with Psa(s’) the probability of transitioning from state s using action a to state s’. Then γ ∈ [0, 1) is the discount factor, which is set <1 to make immediate rewards larger than future rewards, and R is the reward function

R(s, a, s’) : S×A×S→

such that a reward is given for taking action a while in state s to achieve state s’. A policy, then, is a function

π : S → A

over the states S. This gives a value function at each state s given by

Vπ(s) = E[r1 + γr2 + γ2r3 + …; π, s]

where ri are the rewards for the ith step of policy π from state s. The optimal policy is then π*M which gives

V*(s) = supπ Vπ(s)

The shaping reward then is a transformation of our function M to M’ such that we now have

M’ = (S, A, T, γ, R’)

with the new reward function R’ = R + F such that

F(s, a, s’) : S×A×S→

is called the shaping reward function. This means the new reward function is R(s, a, s’) + F(s, a, s’). We then have F(s, a, s’) = r when the program moves toward the goal and F(s, a, s’) = 0 when it moves away from the goal. The issue then is to find that π*M’ is also optimal in M.

What we can see is that, instead of F(s, a, s’) = 0 we could instead have F(s, a, s’) = -r. In other words, punish the program for moving away from the goal so that, for example, the vehicle will not drive in circles because half the circle is spent moving away from the goal, thereby subtracting away all reward. This can be done by modeling the state space as a potential φ(S)  such that

F(s, a, s’) = γφ(s’) – φ(s)

Thus, moving from s to s’ yields positive reward if s’ is closer to the goal (it is at a higher magnitude of the potential), but a negative reward (i.e., a punishment) if s’ is further from the goal (at a lower magnitude of the potential). In this way, different states are rewarded (or punished) as opposed to actions.

Christian discusses the two-level reward system. In terms of humans, we could say that level one are rewards that direct behavior (feeling good by earning respect of peers), and level two are goals that determine the rewards (making things feel good if they lead to greater reproductive success). Similarly in machine learning, there can be the short-term rewards that seem haphazard but can actually help accomplish long-term goals (e.g., for humans that would be like spending time and effort to earn the respect of peers, which is only a sub-goal to achieve reproductive success).

Computer scientists find that it is often beneficial to allow the AI to come up with their own short-term rewards. As Singh et al put it:

Motivating the RL [reinforcement learning] framework are the following correspondences to animal reward processes. Rewards in an RL system correspond to primary rewards, i.e., rewards that in animals have been hard-wired by the evolutionary process due to their relevance to reproductive success. In RL, they are thought of as the output of a “critic” that evaluates the RL agent’s behavior. Further, RL systems that form value functions, using, for example, Temporal Difference (TD) algorithms, effectively create conditioned or secondary reward processes whereby predictors of primary rewards act as rewards themselves. The learned value function provides ongoing evaluations that are consistent with the more intermittent evaluations of the hard-wired critic. The result is that the local landscape of a value function gives direction to the system’s  preferred behavior: decisions are made to cause transitions to higher-valued states.

In humans, we have engineered our society to hijack many of our own short-term rewards. Packing our food with sugar, fat, and salt hijacks the evolutionary instinct that says “when you come by food with sugar/fat/salt, eat as much as you can because you don’t know when you will get another opportunity.” Video games, too, are good at feeding us regular rewards for simple actions. This has led to the theory of gamification, which says that humans ought to bend this to our will and make doing beneficial things into a kind of game “…through the application of game-design elements and game principles (dynamics and mechanics) in non-game contexts” (e.g., earning points for things in real life, or having achievements and leaderboards in the real world). 


These kinds of rewards (food tasting good, earning points in a video game), of course, are extrinsic rewards. But what of intrinsic rewards? That brings us to the next chapter.

Chapter 6: Curiosity

You may have heard the term artificial general intelligence, or AGI. This is an artificial intelligence that can function in multiple domains. What this book has been talking about up to this point is what is known as narrow artificial intelligence, i.e., an artificial intelligence that functions in one or just a narrow range of domains. A calculator, for instance, is a narrow AI that functions only in mathematics. A phone is a narrow AI that functions in apps (and, ever so occasionally, making phone calls).  Your calculator or phone, for instance, cannot drive your car, prepare you a meal, fix your plumbing, take X-ray images, play a guitar, defuse a bomb, sow and harvest wheat, appreciate art, or any number of other domains. They are narrowly focused on a few things, and they do them at a superhuman level.

An artificial general intelligence, on the other hand, is an AI that is good at many things, or more likely an AI that has the capacity to learn to be good at many things. Usually with AGI, “many things” has the meaning of “at least most of the domains in which humans are interested” though it does not necessarily need to be limited to human interests. What makes this a frightening notion is an AI that can function in every domain humans can, but do all of them at a superhuman level in the same way that a calculator can do math at a superhuman level. This is known as a superintelligence.

But lets back up a bit. What is it that something requires in order to become proficient in multiple domains? One such requirement is the ability to be flexible, i.e., to alter behavior in light of new information. Many narrow AI’s can do this, as has been explored so far throughout the book. But another requirement seems to be curiosity, or at least something like curiosity. A calculator, for instance, has no curiosity. It has no issue being left alone in a dark drawer for months, even years, at a time. Nor will it ever learn anything more (nor desire to learn anything more) than what is already programmed into it. A neural network as discussed so far is also not curious. It does not want for more data, nor does it ever bore of the data it’s given.

Yet, curiosity seems to be something only animals, such as humans, possess. How can a computer be curious?

To understand curiosity, we need to understand what motivates something. Animals (including humans) can be motivated by explicit rewards: animals hunt or graze in order to get food, humans get jobs in order to acquire money, which is needed for humans to eat and have shelter. When humans play a video game, they are motivated to earn the highest score or complete the explicitly stated objective. This is all what is known as extrinsic motivation. But organisms also have what is known as intrinsic motivation, which is the motivation to do something for its own sake. Curiosity arises from intrinsic motivation – the motivation to explore some topic or space just because you find it interesting. There is no tangible, external reward for going down a Wikipedia rabbit hole, for instance, but you might simply just find the next link too enticing not to click on it.

Curiosity, Christian explains, consists of novelty, surprise, incongruity, and complexity. Novelty has to do with encountering situations not previously encountered. Rarely does anyone remain interested in just staring at a single wire, even if they find electronics interesting. Often, once someone has achieved some proficiency over a particular subject, their intrinsic motivation will compel them to branch to something else (though it may be something closely related to the previous topic). Surprise has to do with just how novel something is. Complexity has to do with the level of cognitive load the novelty requires. Novelty and surprise are things that are new (not experienced before in key/salient ways) and also defy expectations (things that make us have to guess how it will behave in the future or that violate expectations; a desire to resolve ambiguity)

The idea, then, Christian says is: “What would happen if you actually rewarded the agent, not just for scoring points, but simply for seeing something new? And would that, in turn, make for better agents, able to make faster progress than those trained only to maximize reward and occasionally mash buttons at random?” In other words, allow the agent to obtain rewards for experiencing novel or surprising states (i.e., have curiosity) and see if that makes the agent learn faster.

To even begin constructing AI’s with curiosity, two things need to be in place. One, the AI needs to be able to have some way of sensing its environment. Only this way can the AI determine that something is a novel experience. The second thing is that the agent requires some sort of reward function, i.e., a way of measuring novelty such that it is preferred or selected for over non-novelty. To do this, computer scientists can combine convolutional neural networks and reinforcement learning in what is known as deep reinforcement learning and deep Q networks (DQN). This was first explored in the 2013 paper “Playing Atari with Deep Reinforcement Learning” by Mnih et al. and the 2015 paper “Human Level Control Through Deep Reinforcement Learning” by Mnih et al.

In two papers, “Intrinsically Motivated Learning of Hierarchical Collections of Skills” by Barto, Singh, and Chentanez and “Intrinsically Motivated Reinforcement Learning” by Singh, Barto, and Chentanez, the authors used intrinsic motivation to teach an AI how to combine different actions into more complex actions in order to seek novelty. They used a program that works as follows:


In the playroom are a number of objects: a light switch, a ball, a bell, two movable blocks that are also buttons for turning music on and off, as well as a toy monkey that can make sounds. The agent has an eye, a hand, and a visual marker (seen as a cross hair in the figure). The agent’s sensors tell it what objects (if any) are under the eye, hand and marker. At any time step, the agent has the following actions available to it: 1) move eye to hand, 2) move eye to marker, 3) move eye one step north, south, east or west, 4) move eye to random object, 5) move hand to eye, and 6) move marker to eye. In addition, if both the eye and and hand are on some object, then natural operations suggested by the object become available, e.g., if both the hand and the eye are on the light switch, then the action of flicking the light switch becomes available, and if both the hand and eye are on the ball, then the action of kicking the ball becomes available (which when pushed, moves in a straight line to the marker).

The objects in the playroom all have potentially interesting characteristics. The bell rings once and moves to a random adjacent square if the ball is kicked into it. The light switch controls the lighting in the room. The colors of any of the blocks in the room are only visible if the light is on, otherwise they appear similarly gray. The blue block if pressed turns music on, while the red block if pressed turns music off. Either block can be pushed and as a result moves to a random adjacent square. The toy monkey makes frightened sounds if simultaneously the room is dark and the music is on and the bell is rung. These objects were designed to have varying degrees of difficulty to engage. For example, to get the monkey to cry out requires the agent to do the following sequence of actions: 1) get its eye to the light switch, 2) move hand to eye, 3) push the light switch to turn the light on, 4) find the blue block with its eye, 5) move the hand to the eye, 6) press the blue block to turn music on, 7) find the light switch with its eye, 8) move hand to eye, 9) press light switch to turn light off, 10) find the bell with its eye, 11) move the marker to the eye, 12) find the  ball with its eye, 13) move its hand to the ball, and 14) kick the ball to make the bell ring. Notice that if the agent has already learned how to turn the light on and off, how to turn music on, and how to make the bell ring, then those learned skills would be of obvious use in simplifying this process of engaging the toy monkey.

the intrinsic reward is used to update QB . As a result, when the agent encounters an unpredicted salient event a few times, its updated action value function drives it to repeatedly attempt to achieve that salient event. There are two interesting side effects of this: 1) as the agent tries to repeatedly achieve the salient event, learning improves both its policy for doing so and its option-model that predicts the salient event, and 2) as its option policy and option model improve, the intrinsic reward diminishes and the agent gets “bored” with the associated salient event and moves on. Of course, the option policy and model become accurate in states the agent encounters frequently. Occasionally, the agent encounters the salient event in a state (set of sensor readings) that it has not encountered before, and it generates intrinsic reward again (it is “surprised”).

Intrinsically Motivated Reinforcement Learning” by Singh, Barto, and Chentanez

What happens is that the AI is able to string together the different actions to bring about new salient (novel/surprising) outcomes. The AI then learns how to string the actions together in ever more complex ways.  As seen in B in the figure above, the AI learns to bring about the various outcomes more easily the more it has been playing in the playroom (the x-axis is how long it has been in the playroom environment, the y-axis is how quickly it can perform the various actions). In C in the above figure they show that having this intrinsic reward greatly reduces the number of steps needed to get to the extrinsic reward (which, in both cases, the AI obtains when the monkey cries out, which requires the complex string of tasks: light on, music on, light off, sound on, in that order).


The above figure, taken from the same paper, shows how many actions it took the AI to learn the more and more complex tasks (increasing complexity as it goes down: Lon = light on; Loff = light off; Son = sound on; Mon = music on; Moff = music off; Non = monkey noise on; Noff = monkey noise off).

Similar (yet different in important ways) experiments have used curiosity as well. See “Large-Scale Study of Curiosity-Driven Learning” by Burda et al. and “Exploration of Random Network Distillation” by Burda et al. See also “Unifying Count-Based Exploration and Intrinsic Motivation” by Bellemare et al., “Count-Based Exploration with Neural Density Models” by Ostrovski et al., and the video below:


What researchers have discovered, however, is that these intrinsic motivation models can lead to both boredom and addiction from the agent. Boredom, as Christian explains, is when the agent encounters a situation (e.g., in playing Mario Brothers) that requires some very precise sequence of actions to succeed. If the agent can’t learn the sequence of actions, it gets “bored” and just stops playing. In the Mario Brothers example Christian uses, there is a jump the agent wasn’t able to learn how to master, and the game does not allow the player to go backwards in the level, so the agent just stopped playing.

Addiction, arises when the novelty seeking is overcome, such as by TV static or constant channel flipping. Because, for instance, the randomness of TV static, there is a lot of novelty occurring. As such, the agent can become enthralled by just watching a TV screen tuned to static. Christian describes an agent that attempted to navigate a maze, but at one place in the maze there was a TV screen. Whenever the screen came within the agent’s vision, it immediately centered on the screen and started flipping channels, becoming permanently fixated. For more on this idea, see both “Universal Knowledge-Seeking Agents” by Laurent Orseau and “Universal Knowledge-Seeking Agents for Stochastic Environments” by Orseau et al.ic Mixing Iterative Learning or SMILe (similar to something called search-based structured prediction):

Part 2 Concluding Remarks

I said in the introduction to this post that intelligence and goals/motivations are separable from one another. The paperclip making superintelligence is a thought experiment meant to illustrate this: something can be supremely competent in all the domains of intelligence important to humans, yet be motivated by something quite alien to the human experience. But the paperclip thought experiment also assumes something that has not yet been achieved, namely general intelligence.

One thing that humans possess that has yet to be recapitulated in any artificial intelligence is the ability to be competent in multiple domains of intelligence, and to be able to integrate those domains into a sort of general intelligence. Calculators are great at performing arithmetical operations, but your TI-84 will never appreciate music or drive you to the airport (or do both at the same time). In the field of AI, this is known as artificial general intelligence, or AGI. Although (as of writing this) there are claims that GPT-4 shows signs of AGI, I think most people would still find it to be much narrower than the space of all domains important to humans (including being able to incorporate new information and update beliefs in accordance with that information – the Chat GPT software is limited to data known at or before whatever is present in its training data, which for Chat GPT was data from the internet before mid-2021).

If you’re interested in more about how Chat GPT works, I highly recommend the following video:

Indeed, programs like Chat GPT are often dismissed as being so-called stochastic parrots: which is described in the linked paper this way:

Contrary to how it may seem when we observe its output, an LM [language model] is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot.

This, however, is taking a human-centric view of intelligence that makes the assumption that intelligence requires semantic knowledge and not just syntactical knowledge. In other words, this criticism conflates consciousness with intelligence by assuming that an intelligent agent must have some first-person internal experience of what it is talking about or what its goals and motivations are. But these two things – consciousness and intelligence – are just as separable from one another as intelligence and goals. A true nightmare scenario, in fact, is that we invent a supremely intelligent AGI that is completely non-conscious, yet it (intentionally or unintentionally) wipes out humankind (or even all living organisms) from the earth, leaving nothing but a machine empty of any sentient experience.

(Of course, the nightmare scenario on the other end of the spectrum is inventing conscious, sentient AI and then proceeding, intentionally or not, to subject them to unfathomable suffering – the scenario of so-called mindcrime).

When it comes to intelligence, all that really matters are the inputs and outputs: do the information processing mechanisms compute the input in relevant ways and give an output that correctly maps onto the reality of the given domain? The actual processes that take the input and turn it into an output, while important in the practical and logistical sense, is not important when assessing whether something is intelligent in any strict sense. And certainly whether the processor is in some way aware of what it is processing does not matter when talking purely about intelligence.

In fact, if AI science has taught us anything, it is that no level of aptitude or competence in some task that requires intelligence is sufficient to give rise to consciousness. Nobody would say that your TI-84 is bad at math, or that its ability to do math is not impactful, just because it is not aware that it is doing math. And it is conceivable to engineer an AI that is as competent in every domain of human intelligence that a calculator is at doing arithmetic, while still not being conscious. And it is conceivable that such an AI could (and, in all likelihood, would) be massively impactful to human civilization, for better or worse.

The point being, dismissing AI, and in particular AGI, as being a stochastic parrot somewhat misses the point. Especially when the parrot aspect of such systems stems from the fact that it is parroting humans by virtue of learning from training data gathered from human activity. As such, our AI is therefore reflecting (and amplifying) human activity, with all our biases, irrationality, and failure to live up to our own ideals. This idea leads into the third part of this book, which is about training AI through imitation – having the program learn by watching humans.