#312 - The Trouble with AI - Transcripts

March 07, 2023

  • Favorite
  • Share
A Conversation with Stuart Russell and Gary Marcus


Welcome to the Making Sense podcast. This is Sam Harris. Okay, before I jump in today, I want to take a moment to address some confusion that keeps coming up. I was on another podcast yesterday and spoke about this briefly, but I thought I might be a little more systematic here. It relates to the paradoxical way that we value expertise in really all fields and scientific authority in particular. Seems to me there's just a lot of confusion about how this goes. Expertise and authority are unstable, intrinsically so, because the truth of any claim doesn't depend on the credentials of the person making that claim. So a Nobel Laureate can be wrong and a total ignoramus can be right, even if only by accident. So the truth really is orthogonal to the reputational differences among people. And yet, generally speaking, we are right to be guided by experts and we're right to be very skeptical of novices who claim to have overturned expert opinion. Of course, we're also right to be alert to the possibility of fraud among so-called experts. There are touted experts who are not who they seem to be.

And we're right to notice that bad incentives can corrupt the thinking of even the best experts. So these can seem like contradictions, but they're simply different moments in time. The career of reason has to pass through all these points again and again and again. We respect authority and we also disavow its relevance by turns. We're guided by it until the moment we cease to be guided by it or until the moment when one authority supplants another or even a whole paradigm gets overturned. But all of this gets very confusing when experts begin to fail us and when the institutions in which they function, like universities and scientific journals and public health organizations, get contaminated by political ideologies that don't track the truth. Now, I've done many podcasts where I've talked about this problem from various angles and I'm sure I'll do many more because it's not going away, but much of our society has a very childish view of how to respond to this problem. Many, many people apparently believe that just having more unfettered dialogue on social media and on podcasts and in newsletters is the answer, but it's not. I'm not taking a position against free speech here. I'm all for free speech. I'm taking a position against weaponized misinformation and a contrarian attitude that nullifies the distinction between real knowledge, which can be quite hard won and ignorance or mere speculation. And I'm advocating a personal ethic of not pretending to know things one doesn't know.

My team recently posted a few memes on Instagram. These were things I had said, I think, on other people's podcasts. And these posts got a fair amount of crazed pushback. Apparently, many people thought I was posting these memes myself as though I had just left Twitter only to become addicted to another social media platform. But in any case, my team posted these quotes and my corner of Instagram promptly became as much of a cesspool as Twitter. And then people even took these Instagram memes and posted them back on Twitter so they could vilify me in that context. Needless to say, all of this convinces me again that my life is much better off of social media. But there is some real confusion at the bottom of the response, which I wanted to clarify. So one of the offending Instagram quotes read, During the pandemic, we witnessed the birth of a new religion of contrarianism and conspiracy thinking, the first sacrament of which is, quote, do your own research. The problem is that very few people are qualified to do this research. And the result is a society driven by strongly held unfounded opinions on everything from vaccine safety to the war in Ukraine. And many people took offense to that, as though it was a statement of mere elitism.

But anyone who has followed this podcast knows that I include myself in that specific criticism. I'm also unqualified to do the quote research that so many millions of people imagine they're doing. I wasn't saying that I know everything about vaccine safety or the war in Ukraine. I'm saying that we need experts in those areas to tell us what is real or likely to be real and what's misinformation. And this is why I've declined to have certain debates on this podcast that many people have been urging me to have. And even alleging that it's a sign of hypocrisy or cowardice on my part that I won't have these debates. There are public health emergencies and geopolitical emergencies that simply require trust in institutions. They require that we acknowledge the difference between informed expertise and mere speculation or amateurish sleuthing. And when our institutions and experts fail us, that's not a moment to tear everything down. That's the moment where we need to do the necessary work of making them trustworthy again. And I admit in many cases, it's not clear how to do that, at least not quickly. I think detecting and nullifying bad incentives is a major part of the solution.

But what isn't a part of the solution at all is asking someone like Candace Owens or Tucker Carlson or even Elon Musk or Joe Rogan or Brett Weinstein or me, what we think about the safety of mRNA vaccines or what we think about the nuclear risk posed by the war in Ukraine. Our information ecosystem is so polluted and our trust in institutions so degraded, again, in many cases for good reason, that we have people who are obviously unqualified to have strong opinions about ongoing emergencies, dictating what millions of people believe about those emergencies and therefore dictating whether we as a society can cooperate to solve them. Most people shouldn't be doing their own research. And I'm not saying we should blindly trust the first experts we meet. If you're facing a difficult medical decision, get a second opinion, get a third opinion. But most people shouldn't be jumping on PubMed and reading abstracts from medical journals. Again, depending on the topic, this applies to me too. So the truth is, if I get cancer, I might do a little research, but I'm not going to pretend to be an oncologist. The rational thing for me to do, even with my background in science, is to find the best oncologists I can find and ask them what they think. Of course, it's true that any specific expert can be wrong or biased. And that's why you get second and third opinions. And it's also why we should be generally guided by scientific consensus, wherever a consensus exists.

And this remains the best practice even when we know that there's an infinite number of things we don't know. So while I recognize the last few years has created a lot of uncertainty and anxiety and given a lot of motivation to contrarianism, and the world of podcasts and newsletters and Twitter threads has exploded as an alternative to institutional sources of information, the truth is we can't do without a culture of real expertise. And we absolutely need the institutions that produce it and communicate it. And I say that as someone who lives and works and thrives entirely outside of these institutions. So I'm not defending my own nest. I'm simply noticing that Substack and Spotify and YouTube and Twitter are not substitutes for universities and scientific journals and governmental organizations that we can trust. And we have to stop acting like they might be. Now that I got that off my chest, now for today's podcast. Today I'm speaking with Stuart Russell and Gary Marcus. Stuart is a professor of computer science and a chair of engineering at the University of California, Berkeley. He is a fellow of the American Association for Artificial Intelligence, the Association for Computing Machinery, and the American Association for the Advancement of Science. And he is the author with Peter Norvig of the definitive textbook on AI, Artificial Intelligence, a Modern Approach.

And he is also the author of the very accessible book on this topic, Human Compatible, Artificial Intelligence and the Problem of Control. Gary Marcus is also a leading voice on the topic of artificial intelligence. He is a scientist, best-selling author, and entrepreneur. He was founder and CEO of Geometric Intelligence, a machine learning company that was acquired by Uber in 2016. And he's also the author of the recent book Rebooting AI, along with his co-author, Ernest Davis. And he also has a forthcoming podcast titled Humans versus Machines. And today we talk about recent developments in AI, chat GPT in particular, as well as the long-term risks of producing artificial general intelligence. We discuss the limitations of deep learning, the surprising power of narrow AI, the ongoing indiscretions of chat GPT, a possible misinformation apocalypse, the problem of instantiating human values in AI, the business model of the internet, the metaverse, digital provenance, using AI to control AI, the control problem, emergent goals, locking down core values, programming uncertainty about human values into AGI, the prospects of slowing or stopping AI progress, and other topics. Anyway, I found it a very interesting and useful conversation on a topic whose importance is growing by the hour. And now I bring you Stuart Russell and Gary Marcus. I am here with Stuart Russell and Gary Marcus. Stuart, Gary, thanks for joining me.

Thanks for having us. So I will have properly introduced both of you in the intro, but perhaps you can just briefly introduce yourselves as well. Gary, let's start with you. You're new to the podcast. How do you describe what it is you do and the kinds of problems you focused on?

I'm Gary Marcus, and I've been trying to figure out how we can get to a safe AI future. I may be not looking as far out as Stuart is, but I'm very interested in the immediate future, whether we can trust the AI that we have, how we might make it better so that we can trust it. I've been an entrepreneur, I've been an academic, I've been coding since I was eight years old. So throughout my life I've been interested in AI, and also human cognition and what human cognition might

tell us about AI and how we might make AI better. Yeah, I'll add, you did your PhD under our mutual friend, Stephen Pinker, and you have a wonderful book, Rebooting AI, Building Artificial Intelligence We Can Trust, and I'm told you have a coming podcast later this spring titled Humans versus Machines, which I'm eagerly awaiting, so I'm pretty excited about that. It's going to be fun. Nice. And you have a voice for radio, so you're in. Yeah, I know that joke. I'll take it in good spirit. And no, that's not a joke. A face for radio is the joke. A voice for radio is high praise. That's right. Thank you.

Stuart, who are you? What are you doing out there? Nice. I know that joke.

So I teach at Berkeley. I've been doing AI for about 47 years, and I spent most of my career just trying to make AI systems better and better, working in pretty much every branch of the field. And in the last 10 years or so, I've been asking myself what happens if I or if we as a field succeed in what we've been trying to do, which is to create AI systems that are at least as general in their intelligence as human beings. And I came to the conclusion that if we did succeed, it might not be the best thing in the history of the human race. In fact, it might be the worst. And so I'm trying to fix that if I can.

I will also add, you have also written a wonderful book, Human Compatible, Artificial Intelligence and the Problem of Control, which is quite accessible. And then you have written an inaccessible book or co-written one, literally the textbook on AI. And you've been on the podcast a few times before. So you each occupy different points on a continuum of concern about general AI and the perhaps distant problem of super intelligence. And Stuart, I've always seen you on the sober side of the worried end. And I've spoken to many other worried people on the podcast and at various events, people like Nick Bostrom, Max Tegmark, Eliezer Yudkowsky, Toby Ord. I spoke to many other people in private. I've always counted myself among the worried and have been quite influenced by you and your book. Gary, I've always seen you on the sober side of the not worried end. And I've also spoken to people who are not worried, like Steve Pinker, David Deutsch, Rodney Brooks, and others. I'm not sure if either of you have moved in the intervening years at all. Maybe we can just start there.

We'll start with narrow AI and chat GPT and the explosion of interest on that topic. But I do want us to get to concerns about where all this might be headed. But before we jump into the narrow end of the problem, have you moved at all in your sense of the risks here?

There are a lot of things to worry about. I think I actually have moved just within the last month a little bit. So we'll probably disagree about the estimates of the long-term risk. But something that's really struck me in the last month is there's a reminder of how much we're at the mercy of the big tech companies. So my personal opinion is that we're not very close to artificial general intelligence. Not sure Stuart would really disagree, but he can jump in later on that. And I continue to think we're not very close to artificial general intelligence. But with whatever it is that we have now, this kind of approximated intelligence that we have now, this mimicry that we have now, the lessons of the last month or two, we don't really know how to control even that. It's not full AGI that can self-improve itself. It's not sentient AI or anything like that. But we saw that Microsoft had clues internally that the system was problematic, that it gaslighted its customers and things like that. And then they rolled it out anyway.

And then initially the press hyped it, made it sound amazing. And then it came out that it wasn't really so amazing. But it also came out that if Microsoft wants to test something on a hundred million people, they can go ahead and do that even without a clear understanding of the consequences. So my opinion is we don't really have artificial general intelligence now, but this was kind of a dress rehearsal and it was a really shaky dress rehearsal. And that in itself made me a little bit worried. And suppose we really did have AGI and we had no real regulation in place about how to test it. My view is we should treat it as something like drug trials. You want to know about costs and benefits and have a slow release, but we don't have anything like regulation around that. And so that actually pushed me a little bit closer to maybe the worry side of the spectrum. I'm not as worried maybe as Stuart is about the long-term complete annihilation of the human race that I think Stuart has raised some legitimate concerns about. I'm less worried about that because I don't see AGI as having the motivation to do that. But I am worried about whether we have any control over the things that we're doing, whether the economic incentives are going to push us in the right place.

So I think there's lots of things to be worried about. Maybe we'll have a nice discussion about which those should be and and how you prioritize them. But there are definitely things to worry about.

Yeah. Well, I want to return to that question of motivation, which has always struck me as a red herring. So we'll talk about that when we get to AGI. But Stuart, have you been pushed around at all by recent events or anything else?

So actually, there are two recent events. One of them is chat GPT, but another one, which is much less widely disseminated, but there was an article in the Financial Times last week, was finding out that the superhuman go programs that I think pretty much everyone had abdicated any notion of human superiority and go completely. And that was 2017. And in the five years since then, the machines have gone off into the stratosphere. Their ratings are 1,400 points higher than the human world champion. And 1,400 points in go or in chess is like the difference between a professional and a five-year-old who's played for a few months. So what's amazing is that we found out that actually, a good average human player can actually beat these superhuman go programs, beat all of them, beat all of them, giving them a nine-stone handicap, which is the kind of handicap that you give to a small child who's learning the game.

Isn't the caveat there, though, that we needed a computer to show us that exploit?

Well, actually, the story is a little bit more complicated. We had an intuition that the go programs, because they are circuits, the circuit is a very bad representation for a recursively defined function. So what does that mean? So in go, the main thing that matters is groups of stones. So a group of stones are stones that are connected to each other by vertical and horizontal connections on the grid. And so that, by definition, is a recursive concept, because I'm connected to another stone if there's an adjacent stone to me, and that stone is connected to the other stone. And we can write that, I can say it in English, you know, I just did, instead of one small sentence, I can write it in a program in a couple of lines of Python. I can write it in formal logic in a couple of lines. But to try to write it as a circuit is in some real sense, impossible. I can only do a finite approximation. And so we had this idea that actually the programs didn't really understand what a group of stones is, and they didn't understand in particular whether a group of stones is going to live or going to die. And we concocted by hand some positions in which we thought that just deciding whether the program needed to rescue its group or whether it could capture the opponent's group, that it would make a mistake because it didn't understand group.

And that turned out to be right. Can I just jump in for one second? Sure. It actually relates to the thing that's on the cover of Perceptrons, which is one of the most famous books in the history of artificial general intelligence. There was an argument by Minsky and Papert that two-layer Perceptrons, which are the historical ancestors of the deep learning systems we have now, couldn't understand some very basic concepts. And in a way, what Stewart and his lab did is a riff on that old idea. People hate that book in the machine learning field. They say that it prematurely dismissed multilayer networks. And there's an argument there, but it's more complicated than people usually tell. But in any case, I see this result as a descendant of that, showing that even if you get all these pattern recognition systems to work, that they don't necessarily have a deep conceptual understanding of something as simple as a group in Go. I think it's a profound connection to the history of AI and kind of disturbing that here we are 50-some

years later and we're still struggling with the same problems. Yeah. I think it's the same point that Minsky was making, which is expressive power matters. And simple Perceptrons have incredibly limited expressive power, but even larger deep networks and so on, in their native mode, they have very limited expressive power. You could actually take a recurrent neural net and use that to implement a Turing machine and then use that to implement a Python interpreter, and then the system could learn all of its knowledge in Python. But there's no evidence that anything like that is going on in the Go program. So the evidence seems to suggest that actually they're not very good at recognizing what a group is and liveness and death, except in the cases. So they've learned sort of multiple fragmentary partial finite approximations to the notion of a group and the notion of liveness. And we just found that we could fool it where we're constructing groups that are somewhat more complex than the kinds that typically show up. And then, as Gary said, Sam, as you said, there is a program that we used to explore whether we could actually find this occurring in a real game, because these were contrived positions that we had by hand, and we couldn't force the game to go in that direction. And indeed, when we started running this program with sort of an adversarial program, it's just supposed to find ways of beating one particular Go program called CataGo. Indeed, it found ways of generating groups kind of like a circular sandwich.

So you start with a little group of your pieces in the middle, and then the program, the computer program surrounds your pieces to prevent them from spreading. And then you surround that surrounding, so you make a kind of circular sandwich. And it simply doesn't realize that its pieces are going to die, because it doesn't understand what is the structure of the groups. And it has many opportunities to rescue them, and it pays no attention, and then you capture 60 pieces, and it's lost the game. This was something that we saw our adversarial program doing, but then a human could look at that and say, oh, okay, I can make that happen in a game. One of our team members is a good Go player, and he played this against CataGo, which is the best Go program in the world, and beat it easily and beat it with a nine-stone handicap. But also, it turns out that all the other Go programs, which were trained by completely different teams using different methods and different network structures and all the rest, they all have the same problem. They all fail to recognize this circular sandwich and lose all their pieces. It seems to be not just an accident. It's not a peculiar hack that we found for one particular program. It seems to be a qualitative failure of these networks to generalize properly. In that sense, it's somewhat similar to adversarial images, where we found that these systems that are supposedly superhuman and recognizing objects are extremely vulnerable to making tiny tweaks in images.

Those tweaks are totally invisible to a human, but the system changes its mind and says, oh, that's not a school bus, it's an ostrich. It's again a weakness in the way the circuits have learned to represent the concepts. They haven't really learned the visual concept of a school bus or an ostrich, because they're obviously for a human not confusable. This notion of expressive power is absolutely central to computer science. We use it all over the place when we talk about compilers and we talk about the design of hardware. If you use an inexpressive representation and you try to represent a given concept, you end up with an enormous and ridiculously overcomplicated representation of that concept. That representation, let's say it's the rules of Go in an expressive language like Python, that's a page. In an inexpressive language like circuits, it might be a million pages. To learn that million-page representation of the rules of Go requires billions of experiences. The idea that, oh, well, we'll just get more data and we'll just build a bigger circuit and then we'll be able to learn the rules properly, that just does not scale. The universe doesn't have enough data in it. There's not enough material in the universe to build a computer big enough to achieve general intelligence using these inexpressive representations.

So I'm with Gary, right? I don't think we're that close to AGI and I've never said AGI was imminent. Generally I don't answer the question, when do I think it's coming?

But I am on the record because someone violated the off-the-record rules of the meeting by

hand. Someone replied to a scotch?

No, they literally just broke. I was at a Chatham House Rules meeting and I literally prefaced my sentence with off-the-record and 20 minutes later it appears on the Daily Telegraph website. So anyway, the Daily Telegraph, you can look it up. What I actually said was, I think it's quite likely to happen in the lifetime of my children,

which you could think of as another way of sometime in this century. Before we get into that, can I jump in to sort of wrap up Stuart's point? Because I agree with him. It was a profound result from his lab. There's some people arguing about particular Go programs and so forth, but I wrote an article about Stuart's result called David Beats Goliath. It was on my sub-stack and I'll just read a paragraph and maybe we can get back to why it's a worry. So Kellen Pelrein I guess is the name of the player who actually beat the Go program and I said his victory is a profound reminder that no matter how good deep learning, data-driven AI looks when it is trained on an immense amount of data, we can never be sure that systems of this sort really can extend what they know to novel circumstances. We see the same problem, of course, with the many challenges that have stymied the driverless car industry and the batshit crazy errors we've been seeing with the chatbots in the last week. So that piece also increased my worry level. It's a reminder that these things are almost like aliens, we think we understand. Like oh, that thing knows how to play Go, but there are these little weaknesses there, some of which turn into adversarial attacks and some of which turn into bad driving and some of which turn into mistakes on chatbots. I think we should actually separate out genuine artificial general intelligence, which maybe comes in our lifetimes and maybe doesn't, from what we have now, which is this data-driven thing that, as Stuart would put, is like a big circuit.

We don't really understand what those circuits do and they can have these weaknesses. You talk about alignment or something like that, if you don't really understand what the system does and what weird circumstances it might break down in, you can't really be

that confident around alignment for that system. This is an area that my research center is now putting a lot of effort in. If we're going to control these systems at all, we've got to understand how they work. We've got to build them according to much more, I guess, traditional engineering principles where the system is made up of pieces and we know how the pieces work, we know how they fit together, and we can prove that the whole thing does what it's supposed to do. There's plenty of technological elements available from the history of AI that I think can move us forward in ways where we understand what the system is doing. But I think the same thing is happening in GPT in terms of failure to generalize. It's got millions of examples of arithmetic, 28 plus 42 is, what, 70. And yet, despite having millions of examples, it's completely failed to generalize. If you give it a three or four digit addition problem that it hasn't seen before, and particularly

ones that involve carrying, it fails. I think it can actually, just to be accurate, I think it can do three and four edition addition to some extent. It completely fails on multiplication at three or four digits if we're talking about Minerva,

which is, I think, the state of the art. To some extent, yeah, but I think it works when you don't need to carry because I think it has figured out that eight plus one is nine because it's got a few million examples of that. But when it involves carrying or you get to more digits outside the training set, it hasn't extrapolated correctly, it hasn't learned. The same with chess. It's got lots and lots of Grandmaster chess games in its database, but it thinks of the game as a sequence of notation like in A4, D6, Knight takes C3, B3, B5. That's what a chess game looks like when you write it out as notation. It has no idea that that's referring to a chess board with pieces on it. It has no idea that they're trying to checkmate each other. You start playing chess with it, it'll just make an illegal move because it doesn't even understand what is going on at all. The weird thing is that almost certainly the same thing is going on with all the other language generation that it's doing. It has not figured out that the language is about a world and the world has things in it and there are things that are true about the world, there are things that are false about the world and if I give my wallet to Gary, then Gary has my wallet and if he gives it back to me, then I have it and he doesn't have it. It hasn't figured out any of that stuff.

AC I completely agree. I think that people tend to anthropomorphize and I'd actually needle Stuart a little bit and say he used words like think and figured out. These systems never think and figure out, they're just finding close approximations to the text that they've seen. And it's very hard for someone who's not tutored in AI to really get that, to look at it, see this very well-formed output and realize that it's actually more like an illusion than something that really understands things. So Stuart is absolutely right. It can talk about me having a wallet or whatever, but it doesn't know that there's a me out there, that there's a wallet out there. It's hard for people to grasp that, but that's the reality. And so when it gets the math problem right, people are like, it's got some math, and then it gets one wrong. They're like, oh, I guess I made a mistake. But really, it never got the math. It's just it finds some bit of text that's close enough. Some of the time that it happens to have the right answer and sometimes

not. Well, I want to return to that point. But I think I need to back up for a second and define a couple of terms just so that we don't lose people. I realize I'm assuming a fair amount of familiarity with this topic from people who've heard previous podcasts on it, but might not be fair. Quickly, we have introduced a few terms here. We've talked about narrow AI, general AI, or AGI, or artificial general intelligence, and super intelligence. Those are interrelated concepts. Stuart, do you just want to break those apart and suggest what

we mean by them? Sure. Narrow AI is the easiest to understand because that typically refers to AI systems that are developed for one specific task. For example, playing Go, or translating French into English, or whatever it might be. AGI, or artificial general intelligence, or sometimes called human level artificial intelligence, or general purpose artificial intelligence, would mean AI systems that can quickly learn to be competent in pretty much any kind of task to which the human intellect is relevant, and probably a lot more besides. Then artificial super intelligence, or ASI, would mean systems that are far superior to humans in all these aspects. I think there's just something worth mentioning briefly about narrow AI. A lot of commentators talk as if working on narrow AI doesn't present any kind of risk or problem, because all you get out of narrow AI is a system for that particular task. You could make 100 narrow AI systems, and they would all be little apps on your laptop. None of them would present any risk because all they do is that particular task. I think that's a complete misunderstanding of how progress happens in AI. Let me give you an example.

Deep learning, which is the basis for the last decade of exploding AI capabilities, emerged from a very, very narrow AI application, which is recognizing handwritten digits on checks at Bell Labs in the 1990s. You can't really find a more narrow application than that. But whenever a good AI researcher works on a narrow task, and it turns out that the task is not solvable by existing methods, they're likely to push on methods to come up with more general, more capable methods, and those methods will turn out to apply to lots of other tasks as well. So it was Yann LeCun who was working in the group that worked on these handwritten digits, and he didn't write a little program that follows the S around and says, okay, I found one bend. Okay, let me see if I can find another bend. Okay, good. I've got a left bend and a right bend, so it must be an S. That would be a very hacky, very non-general, very brittle way of doing handwritten recognition. What he did was he just developed a technique for training deep networks that had various kinds of invariances about images. For example, an S is an S no matter where it appears in the image. You can build that into the structure of the networks, and then that produces a very powerful image recognition capability that applies to lots of other things, turned out to apply to speech, and in a slightly different form is underlying what's going on in chat GPT. So don't be fooled into thinking that as long as people are working on narrow AI, everything's going to be fine.

Yeah, if I could just jump in also on the point of general intelligence and what that might look like. Chat is interesting because it's not as narrow in some ways as most traditional narrow AI, and yet it's not really general AI either. It doesn't perfectly fit into the categories, and let me explain what I mean by that. So a typical narrow AI is I will fold proteins, or I will play chess, or something like that. It really does only one thing well. And anybody who's played with chat GPT realizes it does many things, maybe not super well. It's almost like a jack of all trades and a master of none. So you can talk to it about chess, and it will play okay chess for a little while, and then as Stuart points out, probably eventually break the rules because it doesn't really understand them. Or you can talk to it about word problems in math, and it will do some of them correctly and get some of them wrong. Almost anything you want to do, not just one thing like, say, chess, it can do to some extent, but it never really has a good representation of any of those, and so it's never really reliable at any of them. As far as I know, there's nothing that chat GPT is fully reliable at, even though it has something that looks little like generality. And obviously, when we talk about artificial general intelligence, we're expecting something that's trustworthy and reliable that could actually play chess, let's say, as well as humans are better than them or something like that.

They could actually do word problems as well as humans are better than that and so forth. And so it gives an illusion of generality, but it's so superficial because of the way it works in terms of approximating bits of text that it doesn't really deliver on the promise of being what we really think of as an artificial

general intelligence. Yes. Okay, so let's talk more about the problems with narrow AI here, and we should also add that most narrow AI, although chat GPT is perhaps an exception here, is already, insofar as we dignify it as AI and implement it, it's already superhuman. Your calculator is superhuman for arithmetic, and there are many other forms of narrow AI that perform better than people do. And one thing that's been surprising of late, as Stuart just pointed out, is that superhuman AI of certain sorts, like our best go-playing programs, have been revealed to be highly imperfect such that they're less than human in specific instances, and these instances are surprising and can't necessarily be foreseen in advance. And therefore, it raises this question of as we implement narrow AI, because it is superhuman, it seems that we might always be surprised by its failure modes because it lacks common sense, it lacks a more general view of what the problem is that it's solving in the first place. And so that obviously

poses some risk for us. If I could jump in for one second, I think the cut right there actually has to do with the mechanism. So a calculator really is superhuman, we're not going to find an Achilles heel where there's some regime of numbers that it can't do within what it can represent. And the same thing with Deep Blue, I'd be curious if Stuart disagrees, but I think Deep Blue is going to be able to beat any human in chess, and it's not clear that we're actually going to find an Achilles heel. But when we talk about deep learning driven systems, they're very heavy on the big data or using these particular techniques, they often have a pretty superficial representation. Stuart's analogy there was a Python program that's concise, we know that it's captured something correctly versus this very complicated circuit that's really built by data. And when we have these very complicated circuits built by data, sometimes they do have Achilles heel. So some narrow AI, I think we can be confident of. So GPS systems that navigate turn by turn, there's some problems, like the map could be out of date, there could be a broken bridge, but basically we can trust the algorithm there. Whereas these go things, we don't really know how they work, we kind of do. And it turns out, sometimes they do have these Achilles heels that are in there and those Achilles heels can mean different things in different contexts. So in one context, it means we can beat it at go and it's a little bit surprising.

In another context, it means that we're using it to drive a car and there's a jet there, and it's not in the the training set. And it doesn't really understand that you don't run into large objects and doesn't know what to do with a jet and it actually runs into the jet. So the weaknesses can manifest themselves in a lot of different ways. And some of what I think Stuart and I are both worried about is that the dominant paradigm of deep learning often has these kind of gaps in it. Sometimes I use the term pointillistic. They're like collections of many points in some cloud. And if you come close enough to the points in the cloud, they usually do what you expect. But if you move outside of it, sometimes people call it distribution shift to a different point, then they're kind of unpredictable. So in the example of math that Stuart and I both like, it'll get a bunch of math problems that are kind of near the points in the cloud where it's got experience in. And then you move to four-digit multiplication and the cloud is sparser. And now you ask a point that's not next to a point that it knows about. It doesn't really work anymore.

So this illusion, oh, it learned multiplication. Well, no, it didn't. It just learned to jump around these points in this cloud. And that has an enormous level of unpredictability that makes it hard for humans to reason about what the system's going to do. And surely there are safety consequences that arise from that. And something else Stuart said that I really appreciated is in the old days in classical AI, we had engineering techniques around these. You built modules, you knew what the modules did. There were problems then too. I'm not saying it was all perfect, but the dominant engineering paradigm right now is just get more data if it doesn't work. And that's still not giving you transparency into what's going on and it can be hard to debug. And so like, okay, now you built this Go system and you discover it can't beat humans doing this thing. What do you do?

Well, now you have to collect some data pertaining to that, but is it going to be general? You kind of have no way to know. Maybe there'll be another attack tomorrow. And that's what we're seeing in the driverless car industry is like their adversaries may be of a different sort. They're not deliberate, but you find some error and then people try to collect more data, but there's no systematic science there. Like you can't tell me, are we a year away or 10 years away or a hundred years away from driverless cars by kind of plotting out what happens? Because most of what matters are these outlier cases. We don't have metrics around them. We don't have techniques for solving them. And so this is very empirical. We'll try stuff out and hope for this best methodology. And I think Stuart was reacting to that before.

And I certainly worry about that a lot that we don't have a sound methodology where we know, hey, we're getting closer here. And we know that we're not going to ask them to before we get to

where we want to be. Okay. So it sounds like you both have doubts as to whether or not the current path of reliance on deep learning and similar techniques to scale is not going to deliver us to the promised land of AGI, whether aligned with our interests or not. It's just, we need more to actually be able to converge on something like general intelligence because these networks, as powerful as they seem to be in certain cases, they're exhibiting obvious failures of abstraction and they're not learning the way humans learn. And so we're discovering these failures, perhaps to the comfort of people who are terrified of the AGI singularity being reached. Again, I want to keep focusing on the problems and potential problems with narrow AI. So there's two issues here. There's narrow AI that fails, that doesn't do what it purports to do. And then there's just narrow AI that is applied in ways that prove pernicious, but intentionally or not, bad actors or good actors reaching unintended consequences. Let's focus on chat GPT for another moment or so, or things like chat GPT. Many people have pointed out that this seems to be potentially a thermonuclear bomb of misinformation, right? And we already have such an enormous misinformation problem just letting the apes concoct it.

Now we have created a technology that makes the cost of producing nonsense and nonsense that passes for knowledge almost go to zero. What are your concerns about where this is all headed, where narrow AI of this sort is headed in both of its failure modes? It's failure to do what it's attempting to do, that is, it's making inadvertent errors, or it's just, it's failure to be applied ethically and wisely. And however effective it is, we plunge into the part of the map that is just bursting with

unintended consequences. Yeah, I find all of this terrifying. It's maybe worth speaking for a second just to separate out two different problems, you kind of hinted at it. So one problem is that these systems hallucinate. Even if you give them clean data, they don't keep track of things like the relations between subjects and predicates or entities and their properties. And so they can just make stuff up. So an example of this is a system can say that Elon Musk died in a car crash in 2018. That's a real error from a system called Galactica. And that's contradicted by the data in the training set. It's contradicted by things you could look up in the world. And so that's a problem where these systems hallucinate. Then there's a second problem, which is that bad actors can induce them to make as many copies or variants really of any specific misinformation that they might want.

So if you want a QAnon perspective on the January 6th events, well, you can just have the system make that and you can have it make a hundred versions of it. Or if you want to make up propaganda about COVID and vaccines, you can make up a hundred versions each mentioning studies in Lancet and JAMA with data and so forth. All of the data made up the study is not real. And so for a bad actor, it's kind of a dream come true. So there's two different problems there. And the first problem, I think the worst consequence is that these chat-style search engines are going to make up medical advice. People are going to take that medical advice and they're going to get hurt. On the second one, I think what's going to get hurt is democracy, because the result is going to be, there's so much misinformation, nobody's going to trust anything. And if people don't trust that there's some common ground, I don't think democracy works. And so I think there's a real danger to our social fabric there. So both of these issues really matter. It comes down to, in the end, that if you have systems that approximate the world, but have no real representation of the world at all, they can't validate what they're saying.

So they can be abused, they can make mistakes. It's not a great basis, I think, for AI. It's certainly

not what I had hoped for. Stuart? I think I have a number of points, but I just wanted to go back to something you were saying earlier about the fact that the current paradigm may not lead to the promised land. I think that's true. I think some of the properties of chat GPT have made me less confident about that claim because it's an empirical claim. As I said, sufficiently large circuits with sufficient recurrent connections can implement Turing machines and can learn these higher level, more expressive representations and build interpreters for them. They can emulate them. They don't really learn them. Just no, they can actually do that. Think about your laptop. Your laptop is a circuit, but it's a circuit that supports these higher-level abstractions. Your brain is a circuit, but

it's a circuit.

That's right. It's a question of representation versus learning. It's a circuit that supports that. So it can learn those internal structures, which support representations that are more expressive and can then learn in those more oppressive representations. So theoretically, it's possible that this can happen.

Well, but what we always see in reality is your example before about the four-digit arithmetic. The systems don't, in fact, converge on the sufficiently expressive representations. They just always converge on these things that are more like masses of conjunctions of different cases, and they leave stuff out. I'm not saying no learning system could do that,

but these learning systems don't. Well, we don't know that, right? We see some failures, but we also see some remarkably capable behaviors that are quite hard to explain as just sort of stitching together bits of text from the training set.

I mean, I think we're going to disagree there. It's up to Sam how far he wants us to go down

that rabbit hole. Well, actually, let's just spell out the point that's being made. I also don't want to lose Stuart's reaction to his general concerns about narrow AI, but I think this is an interesting point intellectually. So, yes, there's some failure to use symbols or to recognize symbols or to generalize, and it's easy to say things like, you know, here's a system that is plain go better than any person, but it doesn't know what go is, or it doesn't know there's anything beyond this grid. It doesn't recognize the groups of pieces, et cetera. But on some level, the same can be said about the subsystems of the human mind, right? I mean, like, you know, yes, we use symbols, but the level at which symbol use is instantiated in us, in our brains, is not itself symbolic, right? I mean, there is a reduction to some piecemeal architecture. I mean, you know, there's just atoms in here, right? And these general concerns about atoms in here, right? And there's nothing magical about

having a meat-based computer. In the case of your laptop, if you want to talk about something like, I don't know, the folder structure in which you store your files, it actually grounds out and computer scientists can walk you through the steps. We could do it here if you really wanted to, of how you get from a set of bits to a hierarchical directory structure. And that hierarchical just directory structure can then be computed over. So you can, for example, move a subfolder to inside of another subfolder, and we all know the algorithms for how to do that. But the point is that the computer has essentially a model of something, and it manipulates that model. So there's a model of where these files are, or representation might be a better word in that case. Humans have models of the world. So I have a model of the two people that I'm talking to, and their backgrounds and their beliefs and desires to some extent. It's going to be imperfect, but I have such a model. And what I would argue is that a system like ChatGPT doesn't really have that. And in any case, even if you could convince me that it does, which would be a long uphill battle, we certainly don't have access to it so that we can use it in reliable ways in downstream computation.

The output of it is a string, whereas in the case of my laptop, we have very rich representations. I'll ignore some stuff about virtual memory that makes it a little bit complicated. And we can go dig in and we know which part of the representation stands for a file and what stands for a folder and how to manipulate those and so forth. We don't have that in these systems. What we have is a whole bunch of parameters, a whole bunch of text, and we hope for the best.

Yeah. So I'm not disagreeing that we don't understand how it works, but by the same token, given that we don't understand how it works, it's hard to rule out the possibility that it is developing internal representational structures, which may be of a type that we

wouldn't even recognize if we saw them. They're very different. And we have a lot of evidence that bears on this. For example, all of the studies of arithmetic or Guy van der Broek's work on reasoning where if you control things, the reasoning doesn't work properly. In any domain where we can look or math problems or anything like that, we always see spotty performance. We always see hallucinations. They always point to there not being a deep, rich, underlying representation of any phenomena that we're talking about. So from my mind, yes, you can say there are representations there, but they're not like world models. They're not world models that can be reliably interrogated

and acted on. And we just see that over and over again. Okay, I think we're going to just agree to disagree on that. But the point I wanted to make was that if Gary and I are right, and we're really concerned about the existential risk from AGI, we should just keep our mouths shut. We should let the world continue along this line of

bigger and bigger deep-circuit. Well, yeah, I think that's a really interesting question. I wanted your take on Stuart, and it goes back to the word Sam used about Promised Land. And the question is, is AGI actually the Promised Land we want to get to? So I've kind of made the argument that we're living in a land of very unreliable AI and said, there's a bunch of consequences for that. We have chat search, it gives bad medical advice, somebody dies. And so I have generally made the argument, but I'm really interested in Stuart's take on this, that we should get to more reliable AI, where it's transparent, it's interpretable, it kind of does the things that we expect. So if we ask it to do four-digit arithmetic, it's going to do that, which is kind of the classical computer programming paradigm where you have subroutines and there are functions, and they do what you want to do. And so I kind of push towards, let's make the AI more reliable. And there is some sense in which that is more trustworthy. You know that it's going to do this computation. But there's also a sense in which maybe things go off the rail at that point that I think Stuart is interested in.

So Stuart might make the argument, let's not even get to AGI. I'm like, hey, we're in this lousy point with this unreliable AI. Surely it must be better if we get to reliable AI. But Stuart, I think, sees somewhere along the way where we get to a transition where, yes, it reliably does its computations, but also it poses a new set of risks. Is that right, Stuart? And do you want to spell that out?

I mean, if we believe that building bigger and bigger circuits isn't going to work, and instead we push resources into, let's say, methods based on parabilistic programming, which is a symbolic kind of representation language that includes parability theory, so it can handle uncertainty, it can do learning, it can do all these things. But there are still a number of restrictions on our ability to use parabilistic programming to achieve AGI. But suppose we say, okay, fine. Well, we're going to put a ton of resources into this much more engineering-based, semantically rigorous component composition kind of technological approach. And if we succeed, we still face this problem that now you build a system that's actually more powerful than the human race. How do you have power over it? And so I think the reason to just keep quiet would be give us more time to solve the control problem before we make the final push towards AGI against that.

If I'm being actually honest here, I don't know the right answer there. So I think we can, for the rest of our conversation, take probabilistic programming as kind of standing for the kinds of things that might produce more reliable systems like I'm talking about. There are other possibilities there, but it's fine for present purposes. The question is, if we could get to a land of probabilistic programming that at least is transparent, it generally does the things we expect it to do, is that better or worse than the current regime? And Stuart is making the argument that we don't know how to control that either. I mean, I'm not sure we know how to control what we've got now, but that's an interesting question.

Yeah. So let me give you a simple example of systems that are doing exactly what they were designed to do and having disastrous consequences. And that's the recommender system algorithm. So in social media, let's take YouTube, for example, when you watch a video in YouTube, it loads up another video for you to watch next. How does it choose that? Well, that's the learning algorithm and it's watched the behavior of millions and millions of YouTube users and which videos they watch when they're suggested and which videos they ignore or watch a different video or even check out of YouTube altogether. And those learning algorithms are designed to optimize engagement, how much time you spend on the platform, how many videos you watch, how many ads you click on, and so on. And they're very good at that. So it's not that they have unpredictable failures, like they sort of get it wrong all the time, and they don't really have to be perfect anyway, right? They just have to be considerably better than just loading up a random video. And the problem is that they're very good at doing that, but that goal of engagement is not aligned with the interests of the users. And the way the algorithms have found to maximize engagement is not just to pick the right next video, but actually to pick a whole sequence of videos that will turn you into a more predictable victim.

And so they're literally brainwashing people so that once they're brainwashed, the system is going to be more successful at keeping them on the platform. They're like drug dealers. And so this is the problem, right? That if we made that system much better, maybe using probabilistic programming, if that system understood that people exist and they have political opinions, if the system understood the content of the video, then they will be much, much more effective at this brainwashing task that they've been set by the social media companies. And that would be disastrous, right? It wouldn't be a promised land, it would be a disaster.

I completely agree. So Stuart, I agree with that example in its entirety. And I think the question is what lessons we draw from it. So I think that has happened in the real world. It doesn't matter that they're not optimal at it. They're pretty good and they've done a lot of harm. And those algorithms we do actually largely understand. So I accept that example. It seems to me like if you have AJI, it can certainly be used to good purposes or bad purposes. That's a great example where it's to the good of the owner of some technology and the bad of society. I could envision an approach to that and I'm curious what you think about it. And it doesn't really matter whether they're where they have decent AI or great AI in the sense of being able to do what it's told to do is already a problem now.

You could imagine systems that could compute the consequences for society, sort of Asimov's law approach may be taken to an extreme. They would compute the consequences to society and say, Hey, I'm just not recommending that you do this. I mean the strong version just wouldn't do it. A weak version would say, Hey, here's why you shouldn't do it. This is going to be the longterm consequence for democracy, that's not going to be good for your society. We have an axiom here that democracy is good. So one possibility is to say, if we're going to build AGI, it must be equipped with a ability to compute consequences and represent certain values and reason over them. What's your take on that, Stuart?

Well, that assumes that it's possible for us to write down, in some sense, the utility

function of the human race. We can see the initial efforts there in how they've tried to put guardrails on chat GPT, where you ask it to utter a racial slur, and it won't do it even if the fate of humanity hangs in the balance, right? So that like, insofar as you...

Yeah. I mean, that, that's not really true. That's a particular example and in particular context. You can still get whatever horrible thing you want out of it, right?

You can still get whatever horrible thing you want out of it. Right? We've not been very successful, you know. We've been trying to write tax law

for 6,000 years. We still haven't succeeded in writing tax war that doesn't have loopholes. Right. I mean, I always worry about a slippery slope argument at this point. So it is true, for example, that we're not going to get uniform consensus on values, that we've never made a tax code work. But I don't think we want anarchy either. And I think the state that we have now is either you have systems with no values at all that are really reckless, or you have the kind of guardrails based on reinforcement learning that are very sloppy and don't really do what you want to do. Or in my view, we look behind door number three, which is uncomfortable in itself, but which would do the best we can to have some kind of consensus values and try to work according to

those consensus values. Well, I think there's a door number four. And I don't think door number three works, because really, there are sort of infinitely many ways to write the wrong objective.

You can say that about society. You can say that about society. We're not doing great, but we're better than anarchy. It's the Churchill line about democracy

is the best of some lousy options we try. That's because individual humans tend to be of approximately equal capability. And if one individual human starts doing really bad things, then the other ones sort of tend to react and squish them out. It doesn't always work. We've certainly had near total disasters, even with humans of average ability. But once we're talking about AI systems that are far more powerful than the entire human race combined, then the human race is in the position that, as Samuel Butler put it in 1872, that the beasts of the field are with

respect to humans, that we would be entirely at their mercy. Can we separate two things, Stuart, before we go to door number four, which are intelligence and power? So I think, for example, our last president was not particularly intelligent. He wasn't stupid, but he wasn't the brightest in the world. But he had a lot of power, and that's what made him dangerous, was the power, not the sheer intellect. And so sometimes I feel like in these conversations, people confound super intelligence with what a system is actually enabled to do, with what it has access to do, and so forth. At least I think it's important to separate those two out. So I worry about even dumb AI like we have right now, having a lot of power. There's a startup that wants to attach all the world software to large language models, and there's a new robot company that I'm guessing is powering their humanoid robots with large language models. That terrifies me. Maybe not on the existential threat to humanity level, but the level of there's going to be a lot of accidents

because those systems don't have good models. Yes. So Gary, I completely agree, right? I mean, I wrote an op-ed called We Need an FDA for Algorithms about six years ago, I think.

No, I need to read it. Sorry, Stuart. I think we should hold the conversation on AGI for a moment yet, but I would just point out that that separation of concepts, intelligence and power might only run in one direction, which is to say that, yes, for narrow AI, you can have it become powerful or not depending on how it's hooked up to the rest of the world. But for true AGI that is superhuman, one could wonder whether or not intelligence of that sort can be constrained. I mean, then you're in relationship to this thing that are you sufficiently smart to keep this thing

that is much smarter than you from doing whatever it intends to do? One could wonder. I've never seen an argument that compels me to think that that's not possible. I mean, Go programs have gotten much smarter, but they haven't taken more power over the world.

No, but they're not general. Honestly, Gary, that's a ridiculous example, right?

Before we plunge in, I want to get there. I want to get there. I just don't want to extract whatever lessons we can over this recent development in narrow AI. And then I promise you we're going to be able to fight about AGI in a mere matter of minutes. But Stuart, so we have got a few files open here. I just want to acknowledge them. One is you suggested that if, in fact, you think that this path of throwing more and more resources into deep learning is going to be a dead end with respect to AGI and you're worried about AGI, maybe it's ethical to simply keep your mouth shut or even cheerlead for the promise of deep learning so as to stymie the whole field for another generation while we figure out the control problem. Did I read you correctly there?

All right. I think that's a possible argument that has occurred to me and people that put it to me as well. It's a difficult question. I think it's hard to rule out the possibility that the present direction will eventually pan out, but it would pan out in a much worse way because it would lead to systems that were extremely powerful but whose operation we completely didn't understand where we have no way of specifying objectives to them or even finding out what objectives they're actually trying to pursue because we can't look inside. We don't

even understand their principles of operation. It doesn't occur to me. Okay. So let's table that for a moment. We're going to talk about AGI. But on this issue of narrow AI getting more and more powerful, I'm especially concerned, and I know Gary is about the information space. And again, because I just view what ordinary engagement with social media is doing to us as more or less entirely malignant. And the algorithms, as simple as they are and as diabolically effective as they are, have already proven sufficient to test the very fabric of society and the long-term prospects of democracy. But there are other elements here. So for instance, the algorithm is effective and employed in the context of what I would consider the perverse incentives of the business model of the internet. The fact that everything is based on ads, which gives the logic of endlessly gaming people's attention. If we solve that problem, if we decided, okay, this is the ad-based model that's pernicious here, would the problem of the very narrow problem you pointed to with YouTube algorithms say, would that go away or are you still just as worried by some new rationale about that problem?

You know, my view, and then Stuart can jump in with his, my view is that the problems around information space and social media are not going away anytime soon, that we need to build new technologies to detect misinformation. We need to build new regulations to make a cost for producing it in a wholesale way. And then there's this whole other question, which is like, right now, maybe Stuart and I could actually agree that we have this sort of mediocre AI, can't fully be counted on, has a whole set of problems that goes with it. And then really the question is, there's a different set of problems. If you get to an AI that could, for example, say for itself, you know, I don't really want to be part of your algorithm because your algorithm is going to have these problems. Like that opens a whole new can of worms. I think Stuart is terrified about them. And I'm not so sure as to not be worried about them. You know, I'm a little bit less concerned than Stuart, but I can't in all intellectual honesty say that no problems lie

there. I mean, maybe there are problems that lie there. So for a long, long before we get an algorithm that can reflect in that way, just imagine a fusion of what we almost have with chat GPT with deepfake video technology, right? So you can just get endless content that is a persuasive simulacrum of, you know, real figures saying crazy things.

This is minutes away, not years away. I have an editor of the major paper, it's not an ordinary editor, it's a special role that paper everybody knows. He read a debate that I got in on Twitter two days ago where I said, I'm worried about these things. And somebody said, ah, it's not a problem. And he showed me how in like four minutes he could make a fake story about like Antifa processors cause the January 6th thing using his company's templates and an image from mid journey. And it looked completely authentic. Like this can be done at scale right now. There are dissemination questions, but we know that for example, you know, Russia has used armies of troll farms and lots of, you know, iPhones and fake accounts and stuff like that. So this is like an imminent problem.

It will affect the 2020 election. It's an ongoing problem. It is here. Well, it's already happened many times. In fact, if you, if you go on Google news and look at the fact check section, just in the last day, there have been faked videos of president Biden saying that all 20 to 22 year olds in the United States

will be drafted to fight in the war. This is just here now. And it's a question of scope and spread and regulation. This doesn't require really further advances in AI. What we have now is already sufficient to cause this problem and is.

And it's going, I think, I think Sam's point is it's going to explode as the capabilities of these tools and their availability increase. So I would completely agree. I think, you know, I don't want to give the impression that I only care about extinction risk and none of this other stuff matters. I spend a ton of time actually working on lethal autonomous weapons, which again, already exist despite the Russian ambassadors claim that this is all science fiction and won't even be an issue for another 25 years. Yeah, it's just nonsense. As he was saying that, you know, there was a Turkish company that was getting ready to announce a drone capable of fully autonomous hits on human targets. So I think the solution here, and I have a subgroup within my center at Berkeley that is specifically working on this headed by Jonathan Stray. The solution is very complicated. It's an institutional solution. It probably involves setting up some sort of third party infrastructure, much as, you know, in real estate, there's a whole bunch of third parties like title insurance, land registry, notaries, who exist to make sure that there's enough truth in the real estate world that it functions as a market. Same reason we have accountants and auditing in the stock market. So there's enough truth that it functions as a market.

We just haven't figured out how to deal with this avalanche of disinformation and deep fakes, but it's going to require similar kinds of institutional solutions and our politicians have to get their hands around this and make progress because otherwise,

I seriously worry about democracies all over the world. The only thing I can add to what Stuart said is all of that with the word yesterday, like we don't have a lot of time to sort this out. If we wait till after the 2024 election, that might be too

late. We really need to move on this. I have to think the business model of the internet has something to do with this because if there was no money to be made by gaming people's attention with misinformation, it's not to say it would never happen, but the incentive would evaporate. There's a reason why this doesn't happen on Netflix. There's a reason why we're not having a conversation about how Netflix is destroying democracy in the way it serves up each new video to you. And it's because there's no incentive. I mean, I guess they've been threatening to move to ads in certain markets or maybe they have done, so this could go away eventually. But heretofore, there's been no incentive for Netflix to try to figure out. I mean, they're trying to keep you on the platform because they want you not to churn. They want you to end every day feeling like Netflix is an integral part of your life, but there is no incentive.

They want you to binge watch for 38 hours straight. Exactly, yeah. They're not entirely innocent in this.

No, yes, exactly, yeah. They're not entirely innocent in this. No, yeah. But it's not having the effect of giving them a rationale to serve you up insane confections of pseudoscience and overt lies so that someone else can drag your attention for moments or hours because it's their business model, because they've sold them the right to do that on their platform.

Yeah, it's not entirely an internet phenomena in the sense that Fox News also has a kind of engagement model that does center around, in my view, maybe I get sued for this, but center around misinformation. So for example, we know that executives there were not all on board for the big lie about the election, but they thought that maybe it

was good for ratings or something like that. Yeah, I mean, you could look at the weekly world news, right? That was a...

That's right. And go back to yellow journalism in the 1890s.

An ordinary print outlet, which every week would tell you that the creatures of hell have been photographed emerging from cracks in the streets of Los Angeles and you name it, right?

Right. So what happened historically, the last time we were this bad off was the 1890s with yellow journalism, Hearst and all of that. And that's when people started doing fact checking more. And we might need to revert to that to solve this. We might need to have a lot more fact checking, a lot more curation, rather than just random stuff that shows up on your feed and is not in any way fact checked. That might be the only answer here.

But probably taking it... So not saying, okay, Facebook has to fact check all the stuff or Google has to fact check all the stuff, but Facebook has to make available filters where I can say, okay, I don't want stuff in my newsfeed that hasn't passed some basic standard of accountability and accuracy. And it could be voluntary, right? There's a business model. So coming back to this business model question, I think that the tech companies are understanding that the digital banner ad has become pretty ineffective and advertisers are also starting to understand this. And I think when you look at the metaverse and say, well, what on earth is the business model here, right? Why are they spending billions and billions of dollars? And I went to a conference in South Korea where the business model was basically revealed by the previous speaker, who was an AI researcher, who was very proud of being able to use chat GPT-like technology, along with the fact that you're in the metaverse. So you have these avatars to create fake friends. So these are people who... They are avatars who appear to be avatars of real humans who spend weeks and weeks becoming your friend, learning about your family, telling you about their family, blah, blah, blah. And then a casualty will drop into the conversation that they just got a new BMW or they really love Rolex watch, blah, blah, blah, right?

So the digital banner ad is replaced by the chat GPT-driven fake human in the metaverse and goes from 30 milliseconds of trying to convince you to buy something to six weeks. That's the business model of the metaverse. And this would be far more

effective, far more insidious and destructive. And it could be voluntary, right? Although when you think about it, it's what people do to one another anyway. I mean,

there's like product placement in relationships. There's a little bit of that, but they're really expensive, right? I mean, an influencer on YouTube, you have to pay them tens of thousands of dollars to get 10, 20 seconds of product placement out of them. But these are quasi humans who, they cost pennies to run and they can take up hours and hours and hours of somebody's time. And interestingly, the European union and the AI act has a strict ban on the impersonation of human beings. So you always have a right to know if you're interacting with a real person or with a machine. And I think this is something that will be extremely important. It sounds like, yeah, okay, it's not a big risk right now, but I think it's going to become an absolute

linchpin of human freedom in the coming decades. I tend to think it's going to be a story of AI to the rescue here, where the only way we can detect deep fakes and other sources of misinformation in the future will be to have sufficiently robust AI that can go to war against the other

AIs that are creating all the misinformation. I think it's a useful tool, but I think what we need actually is provenance. So a video, for example, that's generated by a video camera is watermarked and timestamped and location coded. And so if a video is produced that doesn't have that, and it doesn't match up cryptographically with the real camera and so on, then it's just filtered out. So it's much more that it doesn't even appear unless it's verifiably real. It's not that you let everything appear, and then you try to sort of take down the stuff that's fake. It's much more of a sort of positive permission to appear based on authenticated provenance.

I think that's the right way to go. I think we should definitely do that for video. I think that for text, we're not going to be able to do it. People cut and paste things from all over the place. We're not really going to be able to track them. It's going to be too easy to beat the watermark schemes. We should still try, but I think we're also going to need to look at content and do the equivalent of fact-checking. And I think that AI is important because the scale is going to go up and we're not going to have enough humans to do it. We're probably going to need humans in the loop. I don't think we can do it fully by machine, but I think that it's going to be important to develop new technologies to try to evaluate the content and try to validate it in something like the way that a traditional fact-checker might do.

Also, I think the text is probably more validation of sources. At least until recently, there are trusted sources of news and we trust them because if a journalist was to generate a bunch of fake news, they would be found out and they will be fired. I think we could probably get agreement on certain standards of operation of that type. Then if the platforms provide the right filters, then I can simply say I'm not interested in news sources that don't subscribe

to these standardized principles of operation. I'm less optimistic about that particular approach because we've had it for several years and most people just don't seem to care anymore. In the same way that most people don't care about privacy anymore, most people just don't care that much about sources. I would like to see educational campaigns to teach people AI literacy and web literacy and so forth and hopefully we make some progress on that. I think labeling particular things as being false or I think the most interesting ones are misleading has some value in it. A typical example of something that's misleading is if Robert Kennedy says that somebody took a COVID vaccine and then they got a seizure, the facts might be true but there's an invited inference that taking COVID vaccines is bad for you and there's lots of data that show on average it's good for you. I think we also need to go to the specific cases in part because lots of people say some things that are true and some that are false. I think we're going to need to do some addressing of specific content and educating people through labels around them about how to

reason about these things. Okay, gentlemen, AGI alignment and the control problem. Let's jump in. Gary, early on you said something skeptical about this being a real problem because you didn't necessarily see that AGI could have reformed the motivation to be hostile to humanity and this echoes something that many people have said. Certainly Steve Pinker has said similar things and I think that is, I'll put words into Stuart's mouth and then let him complete the sentence. I think that really is a red herring at this point or a straw man version of the concern. It's not a matter of our robot overlords spontaneously becoming evil. It's a story of what mismatches in competence and in power can produce in the absence of perfect alignment, in the absence of that ever increasing competence. I mean, now we're talking about a situation where presumably the machines are building the next generation of even better machines. The question is if they're not perfectly aligned with our interests, which is to say if human well-being isn't their paramount concern even as they outstrip us in every conceivable or every relevant cognitive domain, they can begin to treat us spontaneously based on goals that we can no longer even contemplate the way we treat every other animal on earth that can't contemplate the goals we have formed. It's not that we have to be hostile to the creatures of the field or the ants that are walking across our driveways, but it's just that the moment we get it into our heads to do something that ants and farm animals can't even dimly glimpse, we suddenly start behaving in ways that are totally inscrutable to them, but also totally destructive of their lives. Just by analogy, it seems like we may create the very entities that would be capable of doing that to us.

Maybe I didn't give you much of the sentence to finish, Stuart, but

weigh in and then let's give it to Gary. There are a number of variations on this argument. Steve Pinker says there's no reason to create the alpha male AI. If we just build AI along more feminine lines, it'll have no incentive to take over the world. Yann LeCun says, well, there's nothing to worry about. We just don't have to build in instincts like self-preservation. I made a little Gridworld MDP, which is a Markov decision process. It's just a little grid where the AI system has to go and fetch the milk from a few blocks away. On one corner of the grid, there's a bad person who wants to steal the milk. What does the AI system learn to do? It learns to avoid the bad person and go to the other corner of the grid to go fetch the milk so that there's no chance of being intercepted. We didn't put self-preservation in at all.

The only goal the system has is to fetch the milk. Self-preservation follows as a sub-goal because if you're intercepted and killed on the way, then you can't fetch the milk. This is an argument that a five-year-old can understand. The real question in my mind is, why are extremely brilliant people like Yann LeCun and Steven Pinker not able or pretending not to understand this? I think there's some motivated cognition going on. I think there's a self-defense mechanism that kicks in when your whole being feels like it's under attack because you, in the case of Yann, devoted this life to AI. In the case of Steven, his whole thesis these days is that progress and technology has been good for us. He doesn't like any talk that progress and technology could perhaps end up being bad for us. You go into this defense mode where you come up with any sort of argument. I've seen this with AI researchers. They immediately go to, oh, well, there's no need to worry. We can always just switch it off as if a super intelligent AI would never have thought of that possibility.

It's kind of like saying, oh, yeah, we can easily beat deep blue and all these other chess programs. We just play the right moves. What's the problem? It's a form of thinking that makes me worry even more about our long-term prospects because it's one thing to have technological solutions, but if no one is willing to accept that there's a real risk here, and we saw the same thing in the nuclear industry. Yeah, I would say the anti-nuclear movement was strident and not always technically well-informed, but I was a physics student at the time. I went into that mode. Of course, nuclear power is safe. There's no real risk, etc. The nuclear industry basically became more and more convinced of its own rightness and less and less willing to countenance the possibility that, in fact, it was not capable of running these plants in a safe way, which turned out to be the case, as we saw with Chernobyl. I think this is what's going on. I hope that we can convince people. I found that one-on-one people are convinceable, and they will eventually acknowledge that, yes, of course there's a real issue.

Well, let's pretend to not understand this. Let's test that thesis because we've got

two-on-one here. Gary, are you convinceable? I might be. The first thing that I want to say is that I'm actually somewhere in between on this. I absolutely see Seapinker's arguments about motivation. Let's not be too sexist here, but we would prefer the stereotype of the female approach to the male approach. We would not like our robots to be aggressive and territorial. We don't necessarily have to build them in such a way. I think that there's some value to that. I think there's some value in being worried too. I think that there are a lot of unknowns here. This is the famous Rumsfeld thing about unknown unknowns.

We are filled with unknown unknowns. We don't really have anything like AGI right now that we are experimenting on, and we don't know exactly what's going to constrain it. I think the thing that is missing for me in Stuart's argument, and let me say that I would not bet the species on Stuart being wrong. I think we definitely need people to investigate the line of argument that he's developing. Part of it is a balanced thing, like how much energy do we put towards short-term problems that we know that are real versus long-term problems that could be serious, and we're not quite sure how to estimate their probabilities. So there's some question there, but I think unquestionably we should be studying these possibilities and we should have better answers than we have now. I still have not personally seen the scenario that convinces me that we couldn't just turn off the machine, for example. I know there's lots of discussion about that. The thing that's most important to me though is that I feel like most of the discussions assume that whatever the AI will do will be implicit and there'll be no explicit representation of our values. I think that's a disaster to only have things implicit, whether you do it with implicit reinforcement learning, which is something that Stuart has played around with, or whether you do the kinds of things that people are doing right now to train their chatbots where there isn't really good explicit representation.

But Gary, if you don't mind my just interrupting briefly, actually most of the scenarios that I've seen involve explicit goals, but misspecified goals. Look at the social media. I don't think that Facebook's goal was in fact to destroy all the democracies that they operate in, but they specified an objective and it turned out that optimizing that objective was catastrophic. And we have mathematical theorems showing that if you, for example, if you have a real objective that includes 250 terms and you leave out three of them, what happens is the optimizing system will use those three and set them to extreme values to squeeze a little bit more juice out

of the other 247. So Stuart, I'm not unaware of that kind of issue. The particular example you gave with milk was one where they were not explicit values there or at least a bunch of explicit values that we might imagine were not represented. So I'm completely with you in worrying about, for example, the incomplete specification problem. I'm completely with you in worrying about misalignment when there's a lot of specification and not much. I think we have a problem but I think that the answer is to actually make sure

that the values that are important to us really are represented, values like preserve human... Let me just add something here because I feel like there's a whole part of the map that's being ignored here. The line you're arguing doesn't seem to take intelligence, general intelligence seriously and disparities of intelligence seriously. I don't even think I'm getting to spell out the argument. Well, no, but let's just add this piece here. So there's the possibility as intelligence grows in every case in which we're familiar, that is our own, we know that new goals and priorities and values emerge. So just take our own relationship to evolution. We have not evolved to have conversations like this. We've evolved just to spawn and to survive long enough perhaps to help our progeny spawn. And that's it. I mean, that's the algorithm we're running. We're just trying to get our genes into the next generation.

And yet we've done all this other stuff, some of which seems quite antithetical to our actual encoded values. And who knows a century hence what we will value that we don't even know about now. So there's this exploration in the space of cognition that intelligence allows for. And what we're now talking about is being in relationship to truly intelligent entities. Not just things that are passed in the Turing test, but things that actually can form new goals and have new values. And so just add that to what you're envisioning and then run your argument.

The argument. So I think a critical question is the extent to which we can lock down values in those systems. If we really cannot lock down any values, any set of core values, then I think we do have a problem, if not with spontaneous stuff, with abuse of those systems by bad actors, which is something we haven't talked about too much, but is obviously a looming background issue. If we can lock down some values, then the systems might be precluded from contemplating things that are at odds with those core values. And so I think the argument comes down to, is it possible to lock down core values? Are those core values, if we lock them down, sufficient to prevent the kinds of optimizations that we don't want where we get turned into paperclips? The paperclip example itself always seems silly to me because it seems like, well, why can't you just represent explicitly the value of human life? And then if you do, then that one just gets ruled out as being at odds with the axioms. So you have to either give a story where the system is not subject to the axioms, people have certainly tried to do that, or the story goes away. But what I often hear is people that I think are locked in the current machine learning mindset, where there are no innately given values at all, everything emerges from learning and data. And yeah, that's a mess. I mean, that's a totally unconstrained mess.

We can't even get it to work for arithmetic. I certainly don't want to use it for values. And all the messes of chat GPT are an illustration of that. So it's really a question about innateness and values and whether we can do that.

So I think the point you made, the way you said it, can we lock down some of the core values? Well,

perhaps we can, but... I mean, surely we should be having a research effort around that. Let's

put it at a different rate. There is a research effort. I mean, so that's part of what we're trying to do at CHI. I mean, for somewhat different reasons, more public policy reasons, there's an entire journal of wellbeing, which tries to figure out how do we actually nail down what human wellbeing means. Right. But then the question is how you implement them in a machine, right? Right. So there's real questions about how to do this. So Gary, my point is not that we should all give up, right? I'm saying there are hard problems to solve. I think, you know, and this is the premise for my whole research center is I think they are solvable, but not by door three methods, right? Not by saying, oh, we'll just write down the human objective function and make the machine optimize that, right?

The problem is precisely, you know, as happened with social media, right? Well, now you're actually contradicting yourself. They thought this. No, I'm not.

Well, you're saying it's okay to have your wellbeing journal to document these values and to think about how to implement them, and then you're saying, but that's not the answer and that's

not really going to do us any good. I didn't say it wouldn't do us any good. It'll do us some good,

but it's not. Okay. How much good will it do us? I mean, in a way that's the core question in my mind is, is that a pursuable approach and how far will it get us and what else would

we need to do? I feel like at least maybe we have to agree because of the same thing

as happens with tax law and loopholes, right? So this is where I think it's a slippery slope argument. So yes, the tax code is imperfect, but we have a lot of agreement around it. It's not perfect. Everybody agrees that murder is limited in most circumstances, we will not tolerate it and so forth. So it's a slippery slope argument to say because I can find a periphery of cases where we don't have agreement that there's nothing to be said there. And I think we can agree that it's actually understudied how we could get machines to explicitly implement values that there's only been a few kind of toy experiments around that. It's understudied just like I think inverse reinforcement learning is understudied, like there's a whole suite of things that ought to be studied. And I think that the explicit representation of values and reasoning over moral axioms and so forth just isn't getting enough attention in the AI world right now because it doesn't fit neatly into the machine learning paradigm. But in my view, it is one of the more promising things to consider.

Well, so don't forget for though I think within even the deep learning community, in terms of how you get to AGI, a lot of people would say with some type of deep reinforcement learning, which is precisely where you encode what you want the objective to be as a reward function,

and then off you go to the races. Yeah, but it's not part of a reasoning system where a system computes consequences of actions in real time. It's a kind of offline thing where you tune a bunch of weights. And I think there's an important difference between

that. AlphaGo is an example of deep reinforcement learning, and it certainly is computing the

consequences of actions in real time. So I mean, no, nobody's doing that where you can, I mean, trolley problems are sort of a silly example, but just because they're known, I'll just use it. Nobody is building a system that could actually reason about the consequences of a trolley problem in real time with respect to a set of moral axioms and some known set of facts about a particular situation. Or forget the trolley problems if you don't like them, but just you know, everyday decisions like, is it okay to steal the medicine in order to make this person

survive? Okay. That's not what we're talking about, right? So deep reinforcement learning includes systems that reason over future courses of action and evaluates the desirability of those future courses of action. Overgo positions, but not in open-ended real world. But in terms of

the technology, right? Obviously, if you wanted to build- That technology doesn't work in open-ended real world problems. It doesn't work for anything really in the open-ended real world. It's not a

good approach to this problem. That's a separate question, though. But it's not. What I'm saying

is- Does it not work because we can't specify the reward function correctly?

Well, it's also because there's always distribution shift problems with there, and you're never really doing the deep reasoning. That's why you were able to beat the Go program. Same reason you could beat the Go program, any ethical system is going to have-

But that's an orthogonal issue, right? We're going to take for granted that some technology is going to achieve AGI, and the question is, do we think that simply by specifying the reward function by hand, that's going to be enough to make sure that everything goes well?

I'm trying to get you, Stuart, to think about the problem slightly differently. So yes, you can do things with reward functions, but we should be thinking about systems that reason over moral axioms. It's just not being explored right now, but at least not at any kind of scale. So moral axioms such as? Don't kill people, don't allow humans to come to harm, don't do things that damage their physical bodies, that kind of stuff. Yeah, but these are- There are exceptions to all of those axioms, right? If you read- You need a defeasible reasoning system, just like we need for almost anything else. We're in an AI paradigm right now where we don't do much reasoning. So when chat GPT decides, so to speak, whether or not to say a particular thing, it's not really reasoning over the principles, and that's why you get ridiculous answers like when I asked it, what would be the gender of the first female president, and it told me it couldn't answer.

It's not actually reasoning over- No, we all agree.

If we keep coming back to this, this is not reasoning.

I'm saying let's- So but everybody's imagining an AGI system that can't reason either. I'm not saying that, Gary. You're just putting words in my mouth. You aren't, Stuart, but I think a lot of people are. I think a lot of people see the current paradigm and the way it's driven by data that it doesn't really reason, and they carry that over to their thinking through of these scenarios. And what I'll ask you, I guess, put it this way, Stuart, is if you had a reasoning system, and I think you actually agree with me that an AGI system would be a reasoning system, could represent axioms, could do it in a reasonable way. Just let me pose the question.

Gary, I've worked on a reasoning system for 30 years. Why would I think that AI doesn't

involve reasoning? The current paradigm just doesn't, impure. Okay, but that's a red herring. But wait, Gary, let me just- I was posing a hypothetical. I want to ask Stuart's answer. Okay, don't forget your hypothetical. I just want to try to get this back on track. Let's stipulate a few things. One, we're stipulating that true AGI is possible, and that requires only precious few assumptions. One assumption is substrate independence, right? We can do this in silico. There's no law of nature that requires that a computer be made of meat, right?

So- Fully agreed. Given that mirror, very parsimonious assumption, eventually we will be able to build, or at least it is possible to build a machine that is truly intelligent and open-ended and autonomous and self-improving in a way that we are not. And the crucial question here is, is it conceivable that the values of such a system that continues to proliferate its intelligence, right, generation after generation, and is in the first generation more competent at every relevant cognitive task than we are, is it possible to lock down its values such that it could remain perpetually aligned with us and not discover other goals, however instrumental, that suddenly put us at cross purposes? And that's a, that seems like a very strange

thing to be confident about. But wait, wait, Gary, let me just pass this back on.

Okay, don't fully agree. I didn't say I was absolutely confident.

Well, I would be surprised if in principle something didn't just rule it out, but at best

it seems like a far-fetched possibility. At best it seems like a far-fetti. We have to be careful about the, you could consider the possibility conceive and so forth. I can conceive of a scenario in which we build self-improving machines and the values drift. I can conceive of a world in which we decide that we don't want self-improving machines or that we put restrictions on the waves in which they self-improve precisely because we are worried about these things. One thing that I cannot conceive of is a truly general intelligent system that could not represent values, could not reason about values, and actually would turn us into paperclips. That particular example seems to me to be insane because unless you can give me an argument that it would abandon the values that we put in, a real AGI system would, among its other talents, be able to reason over values, recognize the set of values. Then it becomes a question about drifting the values.

Well, it would have to be. So, I mean, your assumption is that we have built in a set of values that prevent the AI system from taking any course of action that deviates significantly from what

humans want the future to be like. Let's say they're values rather than the course of action,

like let's be careful about where the restraint is. But yes, I'm envisioning a system where I said that you've put in the values in such a way that the AI system will not deviate significantly

from the future that the humans want, right? Won't deviate from those values. I don't know

if we're quibbling over something important or not. No, you're misunderstanding what I'm saying. I'm talking about if we put in slightly wrong values or if we leave out some part of what humans want the future to be like. And I think we don't really know very well what that is. But if we get it wrong or if we leave a piece out, then the system will take courses of action that deviate significantly from what the future that we want. Absolutely see that, Chris. And we won't be able to do anything about it. So, door four is simply design the AI systems in a different way that does not assume that upfront we have fixed the objective of the machine, but instead allow the machine to be explicitly uncertain about the objectives of these humans that it's supposed to be pursuing. And that gives you a lot more robustness. Let me try to give you an analogy. So, suppose I want a car to follow a white line for five miles down a straight road. No, I could try to line it up absolutely perfectly and not have a steering wheel and then set it going and hope that it'll just follow that line because I've set it up so perfectly.

Or I could put a steering wheel in so that as it deviates, it corrects and it'll carry on. And it seems to me that what you're proposing is the former, that we just get the values right and then everything will be fine. And I'm proposing the latter, that we build systems that allow for our uncertainty, our inability to specify the values correctly, and can learn as it goes along, and it will never be aligned. I think this is a problem with the word alignment itself that when we talk about alignment everyone immediately jumps to this idea that we get the AI system aligned with the human goals, and then we set it going. That's not the point. It will never be aligned in the sense that it will never have perfect, correct knowledge of what humans want the future to be like. But the point is, you can still have incredibly useful, valuable, and safe systems if they know that they don't know what the true preferences are about the future. And in fact, those systems behave much more reasonably and safely. They will defer to what humans give in the way of feedback. They will allow themselves to be switched off and so on. And this is crucial. And so the reason that I'm not super pessimistic is that I think there is this door for, we just have to get the AI community to understand that that door exists.

I also think we need to wean the AI community away from black boxes for all kinds of reasons, as we've already discussed. So if we pursue the current model, which is the door three model where we specify objectives, whether they're in the form of reward functions or moral axioms or whatever, we will fail. So that would give you a reason to be scared.

I can come back. So Stuart, first of all, I think there's a point of agreement, which is that the black box stuff is not going to work. Second of all, I think that there's a possible compromise here, though I don't think you're that amenable to it. The compromise is you have some innate core and then you have some learning on top of it. And humans, for better or worse, work in that fashion as far as I can tell. So people could look, for example, at Paul Bloom's book about moral development, and he certainly argues that there is some innate core for humans. I think there's a lot of developmental psychology that points to that. And there's also the possibility of moral learning, moral progress. So you have some things that seem to be hard-coded, it's probably too strong, but pre-written and revisable. And I think we might need a model in which there is some pre-written but revisable stuff. That's part one. Part two is I think that there's two things you could find behind door number three, in fact.

One are reinforcement learning systems that are driven by objectives but that don't really reason. And the other would be systems that really reason over moral dilemmas, for example, in the way that I think people do. And I just don't think you can say that that's well explored. And the last thing I'll say is I don't really understand door number four. So I understand, for example, that you want to have representation of uncertainty. I'm all for that. I don't think that's incompatible with anything that I'm saying. To the extent that you're arguing that some things should be learned, I'm fine with that. But it's an empirical conjecture to make the strong form of that claim and say, we'll just learn all of it implicitly by, I don't know, watching humans and seeing what they do, which is certainly not how I raise my children. So I raise my children by giving them some explicit instruction and also letting them go out there and observe the world. And I think it would be silly to put all of our eggs in one of those baskets or in any of these baskets, really, when we don't have good either theoretical bounds on any of them. We haven't really explored any of them all that much.

I think we really need to be eclectic and open-minded here and look at all of these options in part because none of them are yet convincing. To the extent that I said I'm in the middle here, I'm not certain that we're bound for doom and gloom, but I'm not certain that we're not either. I think that in order to bring us to a safe world, we need to do some serious work here on, let's say, both the innate side of the equation and the learned side of the equation. Nobody knows how to do that work yet and it's why people should look at these questions.

I guess I have my own concerns about door number four in that perhaps there is a solution there that can birth truly generalized intelligence that is perpetually tethered to its objective function of wanting to better and better approximate what humans want. What humans want as a diverse set of values and I don't agree with some of those values and we don't want the Taliban to get too many votes in this system. But leaving all that aside, perhaps it's conceivable that you could have something that's perpetually smarter than we are and getting smarter that would nevertheless continue to hew to this underlying reward function, but it seems to me that there are many more ways to create a superintelligence that is not concerned about what humans continually


I think the default is it would be indifferent. But then if we take it again, I do feel that in many of these conversations, we lose sight of the object we're actually considering, which is true intelligence. All we can do is reason from analogy with our own case. Just imagine we have created people who are people. Who knows what they're going to think next, but in this case, we've created people who are much smarter than any people who have ever existed, 10 times smarter, 100 times

smarter. I think you're convincing yourself that this is a really bad idea and I think...

Yeah, but that's almost by definition the end of our concerns being taken seriously. And I'm just wondering, this is just I guess a question for you Stuart first, is there anything in computer science that suggests that door number four is perpetually viable? Is there any way in which this links up to the deep arguments about the universality of computation or anything that could keep intelligence from migrating away from this

initial umbilical cord that we imagine is going to keep us connected to it? Yeah, so this is really interesting because it reminds me of the first question I ever asked in an AI lecture. So when I was an undergrad, there was an AI society at Oxford. And Aaron Sloman, who's a kind of philosopher stroke AI person from the University of Birmingham, Aaron came down to give a talk about his sort of conceptual architecture for AGI, although he didn't use the word AGI at the time. And I asked at the end, okay, what objectives do you think we should put into this AI system and how do you know that you've put in the right ones? And he really didn't like that question at all because coming, as you said, his idea of universality is that in some sense, it should be possible for the AI system to overwrite any initial objectives we give it and in fact, pursue any other objectives that it might dream up. From a philosophical point of view, that's actually, it seems to me, a little bit simplistic because if it has objectives, what reason would it have to overwrite them with objectives that are completely different or contradictory to those objectives?

Can I just add a footnote there just because I don't want to interrupt you, but I think the footnote to that is what we have accomplished in our own case because our objectives are, as I said, merely to spawn and survive and yet we haven't overridden those objectives and I guess at some level of tortured evolutionary psychological analysis, even this conversation is purpose toward us effectively acquiring mates and spawning. Yeah, surviving, not going extinct exactly. But the truth is we are not living by the logic of evolution and in the case of any individual person, a priest deciding to be celibate for life or a person deciding to get a vasectomy or virtually every male on earth declining the opportunity to just endlessly donate to sperm banks so that they can have endless progeny for whom they have no responsibility.

I mean, we're not living by our actual objective function, we made sense of spawning. Not going extinct exactly, that's because we weren't fortunately designed by computer scientists. So in fact, there's an interesting discussion in Inhuman Combatteble, I talk about this process which is sometimes discussed under the heading of Baldwinian evolution whereby this external evolutionary pressure results in the internalization of objectives which may not be perfectly aligned. So for example, the fact that we have pain sensors, hunger sensors, that we like sugar, that we like sex, these are all internalized approximations to this external objective that evolution has in mind for us, so to speak. And those internalized objectives, because they're not properly aligned with the external objective, produce things like drug addiction where there are certain chemicals that just hack into our dopamine system which is the reward system that has been internalized. So I give the example of a type of sloth that seems to have become addicted to part of its food supply and is so stoned all the time that it doesn't bother to reproduce and is in the process of going extinct. And so that can happen, right? So I completely accept, there's two levels to answer it. One is a sort of precise computer scientist answer is, okay, we need to do the work to ensure that certain parts of the system can never be overwritten and that they continue to function as the driving objective or incentive structure that the system follows. And there's a more hand-wavy argument which is the kind of thing you might see on Less Wrong where someone says, okay, if I'm an intelligent machine and I'm creating the next generation of intelligent machines, why would I build that next generation with objectives that conflict with my objectives? Because then the future that I want to bring about will not be brought about, right? And so I have an incentive, in fact, to maintain the same incentive structures in subsequent generations.

It's a much more hand-wavy argument, but it also gives you a guide on how, as a computer scientist, you might actually try to make it a real theorem that if you design systems in a certain way, they will in fact stick to the original incentive design that we build into them. So Stuart.

Let's call it the incentive constitution. So Stuart, I feel like I'm getting confused about door number three versus door number four here. So door number three seems to, in my mind, include what you just called an incentive constitution, which I would say is more like moral axioms. Door number four doesn't? If you allow full drift, I think we're in trouble. So how do you not allow full drift? No.

So what I mean by the incentive constitution is sort of the three principles in human capacity, right? That the machine's only objective is to further the interests of the human race. It's initially uncertain about what those are, and the two are connected by the evidence that's provided by human behavior generally construed. So as a Bayesian, I certainly don't believe we should start from a uniform prior over

what the human preference structure might be because- So then aren't we just arguing about the details?

No, I think we're arguing about a fundamentally distinct category of AI system. The AI systems that we have right now, if you look at every chapter of the textbook, but let's pick reinforcement learning as one example, they require that you specify a reward function upfront, right? We don't yet know- I mean, look- Can I finish my sentence, Gary, since you asked? Sure. We don't yet know how to build AI systems of the kind that I'm describing that are uncertain about the objective. It's a much more complicated type of problem. It's in game theory because you have to take into account the existence of the human who has the preferences. Preferences have to flow at runtime from the human to the machine. The machine has to figure out how to behave both under uncertainty about objectives and given the possibility of acquiring more information about objectives in the future. And there's a huge amount of research to do. And what I'm arguing is if we don't do that work, right? And it's not just technical AI work.

It's also some of the philosophical work coming back to, well, what does a human race want? But if we don't do this work, we will just continue along the path that we're following right now, which is building door three systems that have extremely poorly specified initial objectives and have no possibility of learning more about what they should be doing.

So I'm still smelling a false dichotomy. So where I come from is developmental psychology. And the usual dichotomy is either things are innate or they're learned. And the reality, if you look at biology, is that there are innate contributions, things that I like to call pre-wired, and there are learned things. And in fact, my favorite flavor of all of this is what we call innately guided learning. And it still seems to me like what you're calling for, and in fact what I'm calling for is innately guided learning. So you want only whatever those three principles are in your book. I wanted to refresh my book, but I didn't have it with me today. You've got some set of principles that include doing things that are human compatible. Well, for me, that's one of the innate principles. And I'm perfectly cool with saying there could be some learning. I'm uncomfortable.

Now it's my turn to finish a sentence. I'm uncomfortable with complete open-endedness about this. I'm totally comfortable with some uncertainty about it. I would add in a couple more principles like don't kill people. Maybe those are special cases of the ones that you want, and maybe they aren't. But I'm not really seeing the difference. I'm just seeing a difference in emphasis, maybe more on the learned side. Maybe you want a slightly fewer axioms, or you want at least one there that you described, which is like you have to maintain human compatibility. You have to make sure the consequences of the things that you do are human compatible.

kind of like what I'm calling for. I'm coming from the point of view of AI, which is actually what we're talking about here. In AI, the vast majority of our technology is premised on the assumption that the objective is completely and perfectly known before we begin. I'm fine with relaxing that assumption. I'm saying that, yeah, of course, we should relax that assumption. But relaxing that assumption means rebuilding the whole of AI technology on a new, more complex, much broader theoretical foundation, and that's not trivial. Right now, 99.5% of AI researchers are continuing in the assumption that I'm saying, and you're saying, is invalid. That's a big deal. That's a

big ask to move the entire field in a different direction. There I agree, but I'll ask this question. If you took the knowledge base of psych, which we know is not perfect, but is fairly extensive, includes some moral principles in there, some kind of thing that does learning in a probabilistic kind of fashion, but doesn't start as a blank slate where the only thing specified is be human compatible, do you think that can work? I would agree that's outside the scope of what most people are looking for.

Well, I think we're getting closer to agreeing, but I think there's a big difference between saying, I don't have a uniform prior over what human preferences for the future are, and saying that I have to build in some absolutely cast iron moral axioms. For example, you can have a fairly strong belief that all other things being equal, each individual human being prefers to be alive than dead. But how much you weigh that preference against other preference, for example, how much do they prefer their children to be alive rather than dead, and what would cause them to sacrifice their own lives to reduce some risk to their children? I'm not prepared to specify an exact number for those things, and I don't know, and it probably varies across individuals and cultures and so on. So I think it's really hard, and even something as simple as don't kill people, that's a false moral axiom. There are many, many circumstances where the morally correct

thing to do is to kill somebody. At some level, I think we agree and some we don't. So on the latter point, I would say we know from many years of AI work, much of which you know better than I, about defeasible reasoning. It's clear that whatever principles we have have to allow that there are contradicting circumstances, you have to be able to reason over them. I'm fully with you on saying that we need some uncertainty around the probabilities. I'm with you in saying they need to be tuned. At some level, I don't think we're really that different in our vision of what it is that we want. I think we both want to start with some principle that says there are principles, some principle that says there can be uncertainty around them, some principle that says there's going to be constraint satisfaction or conflict resolution, and I think we both agree that that's really pretty different from the current paradigm, and that that's desirable. So at that level, I think we're just having minor disputes that matter to us, but at that level, I think we're actually in pretty strong agreement. We're just using different terms for describing it. We both want a system that can reason and learn and has some prior constraint. Neither of us wants a complete blank slate here.

I maybe want a little bit more, like maybe I want more specification around murder than you do, but I certainly want to have defeasibility where you can consider self-defense in cases like that. And so when we get down to actual invitational level details, I'm not sure that we are that different. And I would say we're much closer to each other there than we are to say what people are doing with human reinforcement learning in a sort of sloppy way on top of chat TPT to put guardrails that don't actually reason about those things and just do kind of textual matching relative to what somebody said. So at that level, I don't think we're actually that far off. And I think we both agree that's not really where people are looking. It's like we want two streetlights that are almost adjacent to each other, overlapping little, and they're really far from what most people are doing.

Okay. I think that's certainly true. I think both of us are a long way away from where the majority of the field is. We don't have to disagree. I'm happy if you can persuade the other 99.5% of the field. There's two different things going on here, right? One is transparency, reasoning versus black box, unpredictable mess. And I think we're both on the same side of that question. Absolutely. And then there's the getting away from the other part of the dominant paradigm, which is what I call the standard model in the book is this idea that an AI system is given a specified perfectly known objective upfront and then just optimizes it. And that paradigm is so pervasive that the people don't even realize that there are alternatives. If you look at control theory, there's a cost function.

We specify it upfront, and then the control law either satisfies it, optimizes it, or doesn't optimize it. You look at statistics, you specify a loss function. Occasionally here and there, for example, in economics, people are now questioning, is GDP, in fact, the right objective for macroeconomics? Maybe we actually need to rethink that and think about more sophisticated measures. I think that's a good sign that that's happening, but there's a long way to go from saying, well, maybe GDP is wrong to saying, actually, maybe we need to develop economics around the principle that we will never be able to specify the objective for macroeconomics correctly. That's a totally different mindset and results in completely different mathematics, completely different policy recommendations, and so on.

So I think I see what you're getting at, and I think I'm sympathetic, but I'll give the worry now, which is you want some drift and some exploration in a space, but you might not want infinite drift. One of the worries that I think one could reasonably have is, will we still be here a thousand years from now when the systems clearly are going to be smarter than us? And if there's too much drift, we might not be here. How do you think about that problem of too much drift and too much exploration of the space? Maybe there are some parts we don't really want robots to be able to explore. We might want them to explore some,

but their paths, we might not want them to go down. I mean, again, coming back to, I think of these in a Bayesian way, it's not saying, okay, maybe the human objective function is to maximize the number of bananas in the world. Okay, let's go with that for a while. I mean, it entertains that as one among effectively an infinite set of possibilities, but it never acts on the assumption that that is in fact the correct human. And the more uncertainty, the less the system acts at all, right? If it doesn't know anything about what humans want, then this is actually an interesting theoretical paper that we wrote, right? Because the world is in a state that is not a state of nature, it's a state resulting from the actions of humans pursuing their objectives. If a system knows nothing at all about what humans want, then it has an incentive actually to do nothing, that doing nothing is a special action because it doesn't disturb the equilibrium that the humans have put the world into for whatever reason, right? Even if it has no idea what it is we want, it has an incentive not to mess with it. And so you get this property that adding uncertainty about human preferences leads to greater reluctance to act on the part of the machine. And I take that actually as a feature, not a bug. The other thing to mention is if you constrain, if you say actually the human preference structures for moral or religious or just empirical reasons can only lie in this subset that I'm just ruling out what are otherwise logically possible human preference structures.

If you rule out the truth, then you have a problem because then the system converges to maybe the closest approximation in its feasible set. But as it does that, it ends up becoming completely certain in a false assumption about human preferences because you've said a priori that there's no possibility of a human preference structure that's existing outside of this initial set. So in fact, allowing for universality in the prior, i.e., in the part of the prior that isn't zero, is a good idea. But it's still a good idea also to put more weight in the prior on what you think are reasonable, like people prefer to be alive, people prefer their children to be alive, people prefer to eat rather than starve, to drink rather than die first, all those things. These are perfectly reasonable things to put in to the prior. For example, trying to figure out what the trade-offs should be would be very difficult a priori. I don't even think I have a good sense of what those trade-offs are. And this is a really important thing. It's something that the utilitarians talk about a lot. Why the utilitarians did not want to start from moral axioms is because you can always find situations where, in fact, you have to make a trade-off. And if you start from moral axioms,

they don't allow trade-offs. One question here is, ironically, this is sounding a little bit like what I think the YouTube algorithm would sound like if it could only tell us what it was up to. It didn't know what we wanted. It's just following our clicking behavior. In its own mind, though we might think we don't want that next video of some random girl in yoga pants or whatever it is, it has found as a revealed preference that we, in fact, are more likely to click on that than anything else, and therefore it is what we want. Could there be a more sophisticated but nonetheless perverse outcome here following the logic you're sketching?

I think that's a great question. I want people to read through the book and the papers and critique them and say, look, here's a failure mode that's consistent with what you're proposing, but things could still go wrong. And I think the biggest one, and I talk about this towards the end of the book, is this assumption that, in fact, there's a coherent sense in which humans have preferences about the future and that these are autonomous. So there's two problems. One is our preferences are plastic, and I think exactly what's going on in social media is the systems are manipulating our preferences to make them easier to satisfy, to make you a more predictable version of yourself. So that's one issue. So version zero is humans have these stable autonomous preferences. Version one is going to be, no, they don't. They have preferences that are acquired from their cultures, their peers, their upbringing, and they're not stable, they're plastic, and we have to figure out what does it mean to not manipulate preferences? You can't completely leave the preferences sacrosanct and untouchable because any interaction, I mean, having a really useful household robot is going to change how I think about the world and my preferences are probably become a little bit more spoiled as a result, and so on. So there's a lot of these questions that I raised in the book, and I think we do need to solve them. And my argument is really that version zero, where we assume humans have stable autonomous preferences, is a good start.

I think it's the right place to start because you can make progress with the mathematical and philosophical tools that already exist. But getting to version one, where you take into account the plasticity of human preferences, I think there's still a lot

of philosophical open questions there that I need help with. So can I raise two questions? Just one more question. Sorry, Gary, but just one more question along those lines. Isn't this also missing the truly successful version of this or the truly desirable version of this, which would be to create a super intelligence that was benign and aligned, but aligned in ways which we can't yet understand, which is to say that this is an intelligence that could not only do what we want, but convince us to want better things, right? Show us things that are worth wanting, that we have yet to bring into view ourselves, right? So something that's not only more competent than we are to achieve our measly goals, but something that's wiser than we are that can show us the goals to which we should be purposed. I mean, that's obviously

the most idealistic utterance anyone's heard. Without appearing to do that. Yeah. I mean, I say that towards the end of the book, global AI driven preference engineering, what could possibly go wrong? And I think it's bad enough for an AI researcher to be even thinking about these questions. But if you start saying, oh yeah, the AI will start teaching the human race that we got it all wrong. And in fact, these are the right principles of life and purposes of life and so on. I think I would be the target of a lot more hate mails than I already

am. Right. But that is the true promise of real super intelligence, right? I mean, insofar as there's any correlation between intelligence and what we mean by intelligence, all things considered and what we mean by wisdom, all things considered. I mean, obviously those become uncoupled in many earthly examples, but in the limit, you'd have to think that whatever is, whatever sort of information processing that instantiates what we call wisdom, we want more and more of that. And if we're building intelligence at a superhuman level, at least it's conceivable that we could build wisdom at a superhuman level, insofar as those are not the same thing. And why not do that? Again, I realize we don't know how to do that, but it almost seems that in principle what you're imagining forsakes that possibility, right? Like we're just going to, we're going to keep bending this back to people as they are in the aggregate. You know, we're not going to deputize some star chamber to be the only people who are error correcting this, a star chamber of Nobel laureates, Nobel Peace laureates to inform this machine. It's going to be people from every culture, however benighted, that are going to get a vote. Don't we have a lowest common denominator problem on some level?

And if the AI could look at us now, it would draw the conclusion that what we really want to do is binge watch Netflix and feed the TikTok algorithm and live precariously

perched on the brink of nuclear war. Yeah, I was going to make a similar point, if I could make two observations. One is that I'm a little worried in what I hear from Stuart, as I think you just pointed out, Sam, that people's revealed preferences in the moment aren't necessarily our long-term preferences. So in the moment we want to binge watch or do drugs or things like that, and those things have opportunity costs, so people who watch a lot of television, in older studies, when people watched a lot of television, were less happy than people that watched a little television, and it's because of the opportunity costs and missing time being social with other people. So a lot of times when our short-term revealed preferences aren't really the thing we want to do, so there's an interesting challenge for Stuart's approach in how you would get things to be aligned with people's long-term references that they might state on rational reflection but might not follow in the moment. So I think there's a challenge there. The second observation is, I think there's a trade-off around completeness. So I would prefer a system that was incomplete and could never knowingly cause humans harm even though a complete system might know some scenarios where it was actually in the best interest of humans or the species or whatever to cause harm. I'd rather just leave those choices to humans even if it made my AI system a little bit less complete and couldn't handle some scenarios. I'd be okay saying, look, if there's a human that has to be hurt, that a human has to be another human maybe, has to be in the loop to make that decision. And so the more we strive for completeness, maybe the more we surrender to the machines

and maybe we don't want to do that. So maybe this is going to be the final round of comments. So I think one of the tricky issues is this notion. So I don't think the issue of myopia is particularly problematic. Sure, humans are myopic and the system has to understand enough about human cognitive structure to basically invert that, to get at the true underlying long-term preferences about the future. It's not easy at all but conceptually, at least it's not a black hole. It's essential work though. The issue of human autonomy seems to be much more difficult. So an AI system should understand that we value our autonomy and we, for example, with our children, the children always say, oh, dad, can you do up my shoelaces as I'm getting ready for school? And at some point, you have to say, no, you've got to do up your own shoelaces. You've got to be autonomous. And this autonomy is really important to us because having the world handed to us on a plate is, in the long run, spiritually far less satisfying than actually struggling and sometimes failing to achieve a future.

So AI systems will presumably, if we get this right, learn to stand back, learn to allow. So even if our best interests are to stay on the freeway, the AI system should not close off the offer amps. Even though it would be irrational or even harmful for us to take the offer amps, it shouldn't close them off because that takes away our autonomy. Our freedom to do what isn't in our own best interests is one way of thinking about that and figuring out how to get that straight from a formal point of view and so that we can turn it into algorithms and show that, in fact, the system will respect our autonomy is really tough. But I

think, again, it is an essential part of what we're trying to do. If I can add a point of agreement, it's sort of a meta level here. I don't know the answers to any of the research questions that Stuart and I raised in the last hour or two, but I think we can agree that they're important and that our decision about whether to even go forward with AGLA at all as a society rests on us being able to resolve these research questions. So some of the questions are around like, how much specificity do you put in? How complete do you make the knowledge? How much do you learn? What are the mechanisms for doing this learning? Neither of us have proved to you, Sam, that we have good answers there. We've given you sketches of what might be answers for some of that, but we definitely don't as a field have solid answers and there's an enormous amount at stake in whether or not those research promises can bear out. So there's a bunch of approaches that we have that we might be able to make a machine that has some latitude in its decision-making, but maybe doesn't drift so much as to annihilate us all. We don't have answers there. We have questions there.

And I think Stuart is right to be raising those questions and I can maybe cheer from the outside for that. We need to realize that we don't yet have answers there. And then we have to balance a different set of questions, which is like, these are really interesting, fascinating questions, but how close is AGI? How much short-term risk is there for the stuff that we have now? And so there are many, many really important policy decisions and research decisions

that are very much open. We really don't have the answers here. So here's what I would really like to happen. We do AGI right and the AGI figures out that actually AGI is a really bad idea for

the human race and extinguishes itself. If that happens, I would say, okay, we got it right. At the very least, that's an awesome premise for a short story.

Okay. So a final question, gentlemen, and thank you for your time. I realize we've gone over, but this is a question I think can be answered in a very short span if it even can be answered. It strikes me as a strangely absurd question from two sides, but I will pose it. And it's, should we just stop progress on AI until we figured out how to do it safely? To my ear, this is a very strange question because it sounds absurd for two reasons. I mean, it's absurd because on its face, it's kind of answers itself. Of course, we should wait until we can do it safely. I mean, we're talking about an existential risk. Why wouldn't we wait until we can do it safely? But on the other hand, the idea that we're going to stop the field or even slow its progress significantly seems more or less unthinkable at this point. And I'm wondering if I have that last part right.

Is it really, is it unthinkable that we would pull the brakes in any significant way here and spend more time on the concerns we've been discussing in this

last hour? I'll go back to what I said at the beginning, which is about change for me. A month ago, I would have said, that's silly. We don't need to pull the brakes. We'll sort it out. And just watching the rollout of Sydney or Bing and how Microsoft put pressure on Google and so forth just makes me feel less confident that we either can foresee problems or that we can staff them off. The least we need is we need more regulation about rollout of AI at scale, but I don't know if that's enough. And so I'm at least entertaining this kind of question about pause in a different way than I was before. I'm not immediately ready to call for a pause on research or anything like that, but I would consider a pause on deployment. And I think we do have to acknowledge how much we don't know how fast things are moving and how little structure we have in place to try to make the right decisions. And that itself is a reason to at least consider some kind of pause until we

sort ourselves out. Stuart, do you have anything to add on that point? So I would concur on the deployment question and the European Union AI Act is really about deployment. It's not about research. And I think it's extraordinarily difficult to put a stop to research, but it has happened. The genetic engineers who were gung ho for improving the human gene pool 50 years ago by the mid-70s when they held a meeting in Asilomar basically pulled back and they said, you know what, we should not be modifying the human genome in a heritable way. And since then, that has pretty much held internationally. We don't have human clones and so on. So there's been a kind of self-restraint that is in many ways admirable. I don't know if it's going to last under the present avalanche of interest in CRISPR. With AI, this is one of the reasons I did this back of the envelope calculation in the book about what's the cash value of AGI. And it comes out to ridiculous numbers like in the tens of quadrillions of dollars.

And it's been really hard to stop burning fossil fuels even though we know we should because of the stranglehold that the fossil fuel industry has put on our economy, our governments, our political system for decades and decades and decades. They have really outthought the human race in sort of the same way we might think a hostile AGI might do it. And this is so much larger than the fossil fuel industry. I think the impetus is really hard to stop, but I can certainly see circumstances and we might need sort of a middle-sized catastrophe to make it politically acceptable. But if one did occur, as happened with Chernobyl, that led many countries actually to abandon nuclear power, we could see something similar in AI. And then I think governments are willing to act, whether they're able to act, given the power of the tech companies is a different question. But I'm not willing to bet the future on our ability to just put a stop to it.

Stuart, Gary, thank you so much for your time. It's been a great conversation.

It's been terrific. Thanks very much to both of you.

Thanks. Thanks, Gary.