Google Eats Rocks + A Win for A.I. Interpretability + Safety Vibe Check

Publish Date: 2024/5/31

Chapters

This podcast is supported by KPMG. Your task as a visionary leader is simple. Harness the power of AI. Shape the future of business. Oh, and do it before anyone else does without leaving people behind or running into unforeseen risks. Simple, right? KPMG's got you. Helping you lead a people-powered transformation that accelerates AI's value with confidence. How's that for a vision? Learn more at www.kpmg.us.ai.

Hello, Casey. Hey, Kevin. How was your Memorial Day weekend? It was wonderful. I got to go to a beautiful wedding ceremony

and very much enjoyed that. Nice. That was your Memorial Day weekend. It was good. But I feel like you have something that you didn't bring up, which is that you actually had a big launch this past weekend. I did a hard launch. I mean, I guess I did a hard launch, a boyfriend, like once before on Instagram, but it was many years ago. And this one, I think like at this point, hard launches...

people sort of know what they are. And so a lot of like thought goes into it. Well, so a hard launch, just so I'm clear with the latest lingo. This is when you announced that you have a new boyfriend on Instagram. Well, because if the soft launch is like, you know, if maybe you see somebody's shoulder in an Instagram story and you think, well, that's a new person. Like, how is that? Is that, is my friend, are they dating that person? That's a soft launch.

But once there's a face and a name, that's a hard launch. I see. So you debuted, you hard launched your new boyfriend. We had, we had. And it had been, you know, some time in coming. And of course, I had to, you know, check in with him and make sure he was going to be okay with this. And, you know, he was excited about it. And you did it on the grid, which was bold. Of course I did it on the grid. I want to, you know, I want to show everyone. I can't just have that disappear in 24 hours. Yeah, how did it go? Hard launch went very well. You know, I mean...

Was the engagement what you'd hoped for? The engagement was off the chart. It was my most popular Instagram post I've ever done. Did he also hard launch you on his Instagram? Yes. It was honestly very stressful. There was a whole content plan. There were whiteboards. Well, we did.

Like dozens of photos. Did you hire a marketing agency? Yeah. Our teams got involved. Uh, no, we'd, we'd taken so many photos and you know, so of course we're like sitting and we're like, we're going to do this photo. We're going to do this. Is this one a little edgy? Let's do it anyway. And so we came up, I think with five photos and then yes, we like more or less simultaneously did the launch. Yeah. Wow. Yeah. See, I,

I've been out of the game so long that the only thing I remember is that you could like change your relationship status on Facebook. And that was the hard launch of like 2008. Yes, absolutely. And so, of course, in my mind, you know, because I also had that sort of millennial urge to like, do I make this Facebook official? But I'm just like, no, that just seems like that seems so boomer coded. You have to make it LinkedIn official. That's when it truly becomes real. I got into a relationship recently. Here's 10 lessons that I have about enterprise software.

I'm Kevin Roos, a tech columnist at the New York Times. I'm Casey Newton from Platformer. And this is Hard Fork. This week, Google tells us all to eat rocks. We'll tell you where its AI went wrong. Then, anthropic researcher Josh Batson joins to talk about a breakthrough in understanding how large language models work. And finally, it's this week in AI safety, as I try out OpenAI's new souped-up voice assistant, and then it gets cruelly taken away from me. I'm so sorry that happened. Me too.

Well, Kevin, pass me the non-toxic glue and a couple of rocks because it's time to whip up a meal with Google's new AI overviews. Did you make any recipes you found on Google this week?

I did not, but I saw some chatter about it, and I actually saw our friend Katie Notopoulos actually made the glue pizza. But we're getting ahead of ourselves. We're getting ahead of ourselves. And look, the fact that you stayed away from this stuff explains why you're still sitting in front of me. Because over the past week, Google found itself in yet another controversy over AI, this time related to search, the core function of Google. And right now,

Right after that, we had this huge leak of documents that brought even more attention to search and raise the question of whether Google's been dishonest about its algorithms. Kevin, can you imagine? Wow. So there's a lot there. Yes. Let's just go through what happened because the last time we talked about Google on this podcast, they had just released this new AI overviews feature. And this is the thing that shows you a little AI generated snippet above the search results when you type in your query. And I think it's

Fair to say that this did not go smoothly. It didn't. And I want to talk about everything that happened with those AI overviews. But before we get there, Kevin, I think we should take a step back and talk about the recent history of Google's AI launches. Can we do that real quick? Yes. Because I would say there's kind of an escalation in how bad things have gotten. So let's...

Let's go back to February 2023 and talk about the release of Google Bart. Kevin, when I say the word Bart, where does that conjure up for you? Shakespeare. Yep, Shakespeare, number one, and probably number two would be the late lamented Google chatbot. Yes, RIP. Fun fact, Kevin and I were recently in a briefing where a Google executive had a sticker on their laptop that said, total Bart ass. And that sounds like a joke. It's true.

And you actually texted me. I texted you. And you said, does that sticker say total bard ass? And I said it couldn't possibly. And then I zoomed in. I said computer enhance. And indeed it did say total bard ass. And if you are a Googler who has access to a sticker. We're dying for one. That says total bard ass. I want one. I will put it on my laptop. Please. It belongs in the Smithsonian. We're begging you for it. So this comes out in February 2023. And unfortunately, the very first screenshot posted of Google's AI chat bot is.

It gave incorrect information about the James Webb Space Telescope. Specifically, it falsely stated that the telescope had taken the first ever photo of an exoplanet. Yes. Kevin, without binging, what is an exoplanet? It's a planet that signs its letters with a hug and a kiss. No, it's actually the planet where all my exes live. But let's just say...

That Google AI launches had not gotten off to a great start when that happened. In fact, we talked about that one on this show. Then comes the launch of Gemini, and then we had a culture war, Kevin, over the refusals of its image generator to make white people. Sure did. Do you have a favorite thing that Gemini refused to make due to wokeness? No.

I was partial to Asian Sergey and Larry. Do you remember this? Wait, I actually forgot this one. What was this one? Somebody asked Gemini to make an image of the founders of Google, Sergey Brin and Larry Page. And it came back and they were both Asian, which I love. I have to imagine that ended up projected onto a big screen in a meeting somewhere at Google. That's so beautiful to me. So look, that brings us to the AI overviews. And Kevin, you sort of said it up top, but really,

remind us a little bit of how do these things work? What are they? So this is what used to be known as search generative experience when it was being tested. But this is the big bet that Google is making on the future of AI in search. Obviously, they have seen the rise of products like Perfume,

which is this AI-powered search engine. They believe, Sundar Pichai said, you know, that he believes that AI is the future of search and that these AI overviews that appear on top of search results will ultimately give you a better search experience because instead of having to click through a bunch of links to figure out what you're looking for, you can just see it displayed for you, generated right there up at the top of the page. Right. And very briefly, why have we been so concerned about these things? Well, I think your concern that I shared was that this was

ultimately going to lock people into the Google walled garden that instead of going to links where you might see an ad, you might buy a subscription, you might support the news or the media ecosystem in some way. Instead, Google was just going to kind of keep you there on Google. The phrase they would use over and over again was, we will do the Googling for you. That's right. And that

it would sort of starve the web of the essential referral traffic that sort of keeps the whole machine running. - So that is a big concern and I continue to have it every single day. But this week, Kevin, we got a second concern, which is that the AI overviews are gonna kill your family. And here's what I mean. Over the past week, if you ask Google, how many rocks should I eat? The AI overview said at least one small rock per day. I verified this one myself.

As you referenced up top, if you said, how do I get the cheese to stick to my pizza? It would say, well, have you considered adding non-toxic glue? Would have been my first guess. Yeah, at least it said non-toxic glue. That was very nice of the algorithm. It said that 17 of the 42 American presidents have been white. To me, the funniest thing about that is that there have been 46 U.S. presidents. Yeah.

He got both the numerator and the denominator wrong. And of course, and this was probably the most upsetting to our friends in Canada. It said that there has been a dog who played hockey in the National Hockey League. Do you see that one? Well, I think that was just the plot of Air Bud, right? Yeah, well, there's no rule that says a dog can't play hockey, Kevin. And it identified that dog as Martin Pospisil. Who is that? Well, it seems impossible that you've never heard of him. Yeah.

But he's a 24-year-old Slovakian man who plays for the Calgary Flames. Guess you're not a big Flames fan. I'm not. So, look, how is this happening? Well, Google is pulling information from all over the internet into these AI overviews. And in so doing, it is revealing something we've talked about on the show for a long time, which is that the large language models truly do not know anything.

They can often give you answers, and those answers are often right, but they are not drawing on any frame of knowledge. They're simply reshuffling words that they found on the internet. Oh, see, I drew a different lesson from this. What's that? The technology is actually only partly to blame here because I've used a bunch of different AI search products, including Perplexity, and not all of them make these kinds of stupid errors.

But Google's AI model that it's using for these AI overviews seems to just be qualitatively worse. Like, it just can't really seem to tell the difference between reliable sources and unreliable sources. So the thing about eating rocks appears to have come from the onion. And what is the onion?

It's a satirical news site. Wait, you're saying that every story published on The Onion is false? I am, yes. That seems like an interesting choice to include in your AI overviews for facts. Right, and the thing about adding glue to your pizza recipe came from basically a shit post on Reddit.

So obviously, these AI overviews are imperfect. They are drawing from imperfect sources. They are summarizing those imperfect sources in imperfect ways. It is a big mess. And this got a lot of attention over the weekend. And as of...

Today, I tried to replicate a bunch of these queries, and it appears that Google has fixed these specific queries very quickly. Clearly, they were embarrassed by it. I've also noticed that these AI overviews just are barely appearing at all, at least for me. Are they appearing for you? I'm seeing a few of them, but yes, they have definitely been playing a game of whack-a-mole. And whenever one of these screenshots has gone anything close to viral, they are quickly intervening.

Now, I should say that Google has sent me a statement about what's going on, if you would like me to share. Sure. The company said, quote, the vast majority of AI overviews provide high-quality information with links to dig deeper on the web. Many of the examples we've seen have been uncommon queries, and we've also seen examples that were doctored or that we couldn't reproduce. It says some more things and then says, we're taking swift action where appropriate under our current policies and using these examples to develop broader improvements to our systems. So,

They're basically saying, look, you're cherry picking, right? You went out and you found the absolute most ridiculous queries that you can do. And now you're holding it against us. And I would like to know, Kevin, how do you respond to these charges? I mean, I think it's true that some people were just deliberately trolling Google by putting in these very sort of edge case queries that, you know, real people, many of them are not Googling like

Is it safe to eat rocks? That is not a common query. And I did see some ones that were clearly faked or doctored. So I think Google has a point there. But I would also say like these AI overviews are also making mistakes on what I would consider much more common sort of normal queries. One of them that the AI overview botched was about how many Muslim presidents the U.S. has had. The correct answer is zero, but the AI overview answer was one.

George Washington. Yes, George Washington. No, it said that Barack Hussein Obama was America's first and only Muslim president. Obviously not true. Not true. But that is the kind of thing that Google was telling people in its AI overviews that I imagine are not just like fringe or trollish queries.

Right. And also, like, I guess it has always been the case that if you did a sort of weird query on Google, you might not get the answer you were looking for, but you would get a webpage that someone had made, right? And you would be able to assess, hmm, does this website look professional? Like, does it have a masthead? Like, do the authors have bylaws? You could just sort of ask yourself some basic questions about it. Now everything is just being compressed into this AI slurry, so you don't know what you're looking at. So I have a couple things to say here. Yeah.

I think in the short term, this is a fixable problem. Look, I think it's clearly embarrassing for Google. They did not want this to happen. It's a big, you know, rake in the face for them. But I think what helps Google here is that Google search and search in general is what they call a fathead product. Do you know what that means? I don't know what that means. So it means basically if you take a distribution curve, the most popular queries on Google or any other search engine account for a very large percentage

percentage of search volume. Actually, according to one study, the 500 most popular search terms make up 8.4% of all search volume on Google. So a lot of people are just searching like Facebook and then clicking the link to go to Facebook. Exactly. Or they're searching something else that's, you know, very common, you know, um,

What would be an example of a good... Like, has a dog ever played hockey? No. No? Okay. No, stuff like... What time is the Super Bowl? Yeah, what time is the Super Bowl? Or, you know, how do I fix a broken toilet or something like that? Local movie times. Exactly. Yeah. And, you know, for...

That means that Google can sort of manually audit the top, I don't know, say 10,000 AI overviews, make sure they're not giving people bad information. And that would mean that the vast majority of what people search for on Google does actually have a correct AI overview. Now, in that case, it wouldn't actually technically be an AI overview. It'd be sort of like a human overview that was sort of drafted by AI. But

Same difference in Google's eyes. I also think they can make sure that AI overviews aren't triggered for sensitive topics, for things where your health are concerned. Google already does this to a certain extent with these things called featured snippets. And I think they will continue to sort of play around with and adjust the dials on how frequently these AI overviews are triggered. But I do think there's a bigger threat to Google here, which is that they are now going to be held responsible for the information that people see on Google. Yes.

We've talked about this a little bit, but I mean, this to me is the biggest complaint that people have that is justified is that Google used to, you know, maybe they would point you to a website that would tell you that, you know, putting glue on your pizza is a good way to get the cheese to stick. But you as Google could sort of wash your hands of that and say, oh, that was people just trolling on Reddit. That wasn't us. But if you're Google and you're now providing this AI written overview to people, you're

people are going to get mad when it gives you wrong information. And there will be, unfortunately, just the law of large numbers says that sometime, maybe in the next year or two, there will be an instance where someone relies on something they saw on a Google AI overview and it ends up

Yeah, there was another query that got a lot of attention this week where an AI overview told someone that you could put gasoline in spaghetti to make a spicy dish, that you couldn't use gasoline to cook spaghetti faster, but if you wanted to have spicy spaghetti, you could put gasoline in it.

And of course that sounds ridiculous to us, but over the entire long tail of the internet, is it theoretically possible somebody would eat gasoline spaghetti? Of course it is. Yeah. So I think, and when that does happen, I think there are two questions. One is, is Google legally protected? Because I've seen, I've heard some interesting arguments about whether section 230, which is the part of the US code that protects online platforms from being held legally responsible for stuff that their users post is

there are a lot of people who think that doesn't apply to these AI overviews because it is Google itself that is formulating and publishing that overview.

I also just think there's a big reputational risk here. I mean, you can imagine so easily the congressional hearings where, you know, senators are yelling at Sundar Pichai saying, why did you tell my kid to eat gasoline spaghetti? Martin Pospisil is going to be there saying, do I look like a dog to you? Right. And seriously, I think that this is a big risk for Google, not just because they're going to have to sit through a bunch of hearings and get yelled at, but because I think it will make their reputation

active role in search, which has been true for many years. They have been actively shaping the experience that people have when they search stuff on Google, but they've mostly been able to kind of obscure that away or abstract it away and say, well, this is just our sort of system working here. I think this will make their active role in kind of curating the search results for billions of people around the world much more obvious, and it will make them much more responsible in users' eyes.

I think all of that is true. I have an additional concern, Kevin. Yes. And this was pointed out by Rusty Foster, who writes The Great Today and Tab's newsletter. And he said, what has really been revealed to us about what AI overviews really are is that they are

automated plagiarism. That is the phrase that he used, right? That Google has scammed the entire web. It's looked at every publisher. It lightly rearranges the words, and then it republishes it into the AI overview. And, you know, as journalists, we really try not to do this, right? We try not to just go out, grab other people's reporting, very gently change the words, and republish it as our own. And in fact, you know, I've

know people who have been fired for doing something very similar to this, right? But Google has come along and said, well, that's actually the foundation of our new system that we're using to replace search results. Yeah. Casey, what do you think comes next with this AI overviews business? Is Google just going to sort of

back away from this and it's not ultimately going to be a huge part of their product going forward? Do you think they will just sort of grit their teeth and get through this initial period of awkwardness and inaccuracy? What do you think happens here? They are not going to back down. Now, they might temporarily retreat like we've seen them do in the Gemini image case.

But they are absolutely going to keep working on this stuff because this is existential for them. For them, this is the next version of search. This is the way they build the Star Trek computer. They want to give you the answer. And in many more cases over time, they want you to not have to click a link to get any additional information. They already have rivals like Perplexity that seem to be doing a better job in many cases of answering people's queries and Grubhub.

Google has all of the money and talent it needs to figure out that problem. So they're going to keep going at this at 100 miles an hour. Yeah. I want to bring up one place that I actually sort of disagree with you because you wrote recently that you believe that because of these changes to Google, that the web is sort of in a state of managed decline. Mm-hmm.

And we've gotten some listener feedback in the past few weeks as we've been talking about sort of these issues of Google and AI and the future of the web, saying, like, you guys are basically acting as if the previous state of the internet was healthy. Like, Google was, you know, giving people high-quality information. Like, there was this flourishing internet of independent publishers kind of, like, making money and, you know, serving users really well. And...

people just said, like, it actually wasn't like that at all. In fact, the previous state of the web, at least for the past few years, has been in decline. So it's not that we are entering an age of managed decline of the internet. It's that Google is basically accelerating what was already happening on the internet, which was that publishers of high-quality information are putting that information behind paywalls,

There are all these publishers who are chasing these sort of SEO traffic wins with this sort of low-quality garbage. And essentially, the web is being hollowed out, and this is maybe just accelerating that. So I just want to float that as like a theory, a sort of counterproposal for your theory of Google putting the web into a state of managed decline.

Well, sure, Kevin, but if you ask yourself, well, why is that the case? Why are publishers doing all of these things? It is because the vast majority of all digital advertising revenue goes to three companies, and Google is at the top of that list with Meta and then Amazon at number two and three. So my overall theory about what's happening to the web is that three companies got...

too much of the money and starved the web of the lifeblood it needed to continue expanding and thriving. So look, has it ever been super easy to whip up a digital media business and just put it on the internet and start printing cash? No, it's never been easy. My theory is just that it's

almost certainly harder today than it was five years ago. And it will almost certainly be harder in five years than it is today. And it is Google that is at the center of that story because at the end of the day, they have their fingers on all of the levers and all of the knobs. They get to decide who gets to see an AI overview. You know, how quickly do we roll these out? What categories do they show them in? If web traffic goes down too much and it's a problem for them,

then they can slow down. But if it looks good for them, they can keep going, even if all the other publishers are kicking and screaming the whole time. So I just want to draw attention to the amount of influence that this one company in particular has over the future of the entire internet. Yeah, and I would just say that is not a good state of affairs, and it has been true for many years that Google has huge unchecked influence over basically the entire online ecosystem. Oh,

All right, so that is the story of the AI overview. But there was a second story that I want to touch on briefly this week, Kevin, that had to do with Google and search, and it had to do with a giant leak. Have you seen the leak? I've heard about the leak. I have not examined the leak, but tell me about the leak. Well, it was thousands of pages long, so I understand why you haven't finished reading it quite yet.

But these were thousands of pages that we believe came from inside of Google that offer a lot of technical details about how the company's search works. So, you know, that is not a subject that is of interest to most people. But if you have a business on the internet and you want to ensure that your, you know, dry cleaners or your restaurant or your media company ranks highly in Google search without having to buy a bunch of ads,

This is what you need to figure out. Yeah, this is one of the great guessing games in modern life. There's this whole industry of SEO that has sort of popped up to try to sort of poke around the Google search algorithm, try to guess and sort of test what works and what doesn't work and sort of provide consulting, you know, for a very lucrative price to

to businesses that want to improve their Google search traffic. Yeah, like the way I like to put it is imagine you have a glue pizza restaurant and you want to make sure that you're the top ranked search for glue pizza restaurants, you might hire an SEO consultant. Yeah. So what happened? Well, so there's this guy, Rand Fishkin, who doesn't do SEO anymore, but was a big SEO expert for a long time and is kind of a leading voice in this space.

And he gets an email from this guy, Irfan Azimi, who himself is the founder of an SEO company. And Azimi claims to have access to thousands of internal Google documents detailing the secret inner workings of search. And Rand reviews this information with Azimi, and they determine that some of this contradicts what Google has been saying publicly about how search works over the years. Well, and this is...

the kind of information that Google has historically tried really hard to keep secret, both because it's kind of their secret sauce. They don't want competitors to know how the Google search algorithm works, but also because they have worried that if they sort of say too much about how they rank certain websites above others, then these sort of like SEO consultants will use that information and it'll basically become like a cat and mouse game.

Yeah, absolutely. And it already is a cat and mouse game. But, you know, the fear is that this would just sort of fuel the worst actors in the space. Of course, it also means that Google can fight off its competitors because people don't really understand how its rankings work. And if you think that Google searches better than anyone else's search, like these ranking algorithm decisions are wild.

Can I just ask a question? Do we know that this leak is genuine? Do we have any signs that these documents actually are from Google? Well, yes. So the documents themselves had a bunch of clues that suggested they were genuine. And then Google did actually come out and confirm on Wednesday that these documents are real. But the obvious question is, how did something like this happen? And the leading theory right now is that these documents came from Google's content API warehouse, which...

which is not a real warehouse, but is something that was hosted on GitHub, right? The sort of Microsoft-owned service where people post their code. And these materials were somehow briefly made public by accident, right? Because a lot of companies will have private API repositories on GitHub. Right. So they just sort of set it to public by accident. It's sort of the modern equivalent of leaving a classified document in a cab. Yeah. Have you ever

made a sensitive document public on accident? No, and I've never found one either. I like, in all my years of reporting, I keep hoping to like stumble on the, you know, the scoop of this entry just sitting in the back of an Uber somewhere, but it never happens to me. So, yeah, we're not going to go into these documents in too much detail. What I will say is it seems that these files contain a bunch of information about the kinds of data the company collects, including things like

click behavior or data from its Chrome browser, things that Google has previously said that it doesn't use in search rankings. But the documents show that they have this sort of data and they could potentially use it to rank search results.

When we asked Google about this, they wouldn't comment on anything specific, but a spokesperson told us that they, quote, would caution against making inaccurate assumptions about search based on out-of-context, outdated, or incomplete information. Anyway, why do we care about this? Well,

I was just struck by one of the big conclusions that Rand Fishkin had in this blog post that he wrote, quote, they've been on an inexorable path toward exclusively ranking and sending traffic to big, powerful brands that dominate the web over small, independent sites and platforms.

businesses. So basically you look through all of these APIs and like, if you are a restaurant just getting started, if you're a, an indie blogger that just sort of puts up a shingle, it used to be that you might expect to automatically float to the top of Google search rankings in your area of expertise.

And what Fishkin is saying is that just is getting harder now because Google is putting more and more emphasis on trusted brands. Now, that's not a bad thing in its own right, right? Like if I Google something from the New York Times, I want to see the New York Times and not just a bunch of people who put like New York Times in the header of their HTML.

But I do think that this is one of the ways that the web is shrinking a little bit, right? Like it's not quite as much of a free-for-all. The free-for-all wasn't all great because a lot of spammers and bad actors got into it, but it also meant that there was room for a bunch of new entrants to come in. There was room for more talent to come in. And one of the conclusions I had reading this stuff was maybe that just isn't the case as much as it used to be. Yeah.

So do you think this is more of a problem for Google than the AI overviews thing? How would you say it stacks up? I would say it's actually a secondary problem. I think it's the telling people to eat rocks is the number one problem. They need to stop that right now. But this, I think, speaks to that story because Google,

Both of these stories are about essentially the rich getting richer, the big brands are getting more powerful, whether that's Google getting more powerful by keeping everyone on search or big publishers getting more powerful because they're the sort of trusted brands. And so I'm just observing that because the promise of the web and part of what has made it such a joyful place for me over the past 20 years is that

It is decentralized and open, and there's just kind of a lot of dynamism in it. And now it's starting to feel a little static and stale and creaky, and these documents sort of outline how and why that is happening. Yeah. I think Google is sort of stuck between a rock and a hard place here because on one hand, they do want... Well...

Maybe we shouldn't use a rocks example. No, use the rock example. They're stuck between a rock and a hard place. On one hand, the company's telling you to eat rocks. On the other hand, they're in a hard place. Right. So I think Google is under a lot of pressure to do two things that are basically contradictory.

right, to sort of give people an equal playing field on which to compete for attention and authority. That is the demand that a lot of these smaller websites and SEO consultants want them to comply with.

On the other hand, they're also seeing with these AI overviews what happens when you don't privilege and prioritize authoritative sources of information in your search results or your AI overviews. You end up telling people to eat rocks. You end up telling people to put gasoline in their spaghetti. You end up telling people there are dogs that play hockey in the NHL. This is the kind of downstream consequence of not having...

effective quality signals to different publishers and to just kind of treating everything on the web as equally valid and equally authoritative. I think that that is a really good point, and that is something that comes across in these two stories is that exact tension. Casey, I...

I have a question for you, which is we also are content creators on the internet. We like to get attention. We want that sweet, sweet Google referral traffic. So for our next YouTube video, a stunt video, do you think that we should A, eat the gasoline spaghetti, B, eat one to three rocks a piece and see what effects it has on our health, or C, teach a dog to play hockey at a professional level?

I mean, surely for how much fun it would be, we have to teach a dog how to play hockey. It's true. You know? I'm just imagining like a bulldog with little hockey sticks maybe taped to its front paws. Yeah. It'd be really fun. My dogs are too dumb for this. We'll have to find other dogs. You know, was it in Lose Yourself that Eminem said, there's vomit on my sweater already, gasoline spaghetti? I believe those are the words. What a great song. Yeah. Yeah.

When we come back, we'll talk about a big research breakthrough into how AI models actually think. This podcast is supported by KPMG. Your task as a visionary leader is simple. Harness the power of AI. Shape the future of business. Oh, and do it before anyone else does without leaving people behind or running into unforeseen risks.

Simple, right? KPMG's got you. Helping you lead a people-powered transformation that accelerates AI's value with confidence. How's that for a vision? Learn more at www.kpmg.us.ai. I'm Julian Barnes. I'm an intelligence reporter at The New York Times. I try to find out what the U.S. government is keeping secret.

Governments keep secrets for all kinds of reasons. They might be embarrassed by the information. They might think the public can't understand it. But we at The New York Times think that democracy works best when the public is informed.

It takes a lot of time to find people willing to talk about those secrets. Many people with information have a certain agenda or have a certain angle, and that's why it requires talking to a lot of people to make sure that we're not misled and that we give a complete story to our readers. If The New York Times was not reporting these stories, some of them might never come to light. If you want to support this kind of work, you can do that by subscribing to The New York Times.

Well, Casey, we have something new and unusual for the podcast this week. What's that, Kevin? We have some actual good AI news. Oh, about time. So as we've talked about on this show before, one of the most pressing issues with these large AI language models is that we generally don't know how they work, right? They are inscrutable. They work in mysterious ways. There's no way to tell why one particular input produces one particular output. And this has been a big problem for researchers recently.

for years. There has been this field called interpretability, or sometimes it's called mechanistic interpretability. Say that five times fast. And

I would say that the field has been making steady but slow progress toward understanding how language models work. But last week, we got a breakthrough. Anthropic, the AI company that makes the Claude chatbot, announced that it had basically mapped the mind of their large language model, Claude 3, and opened up the black box that is AI for closer inspection.

Did you see this news, and what was your reaction? I did, and I was really excited because for some time now, Kevin, we have been saying, if you don't know how these systems work, how can you possibly make them safe? And companies have told us, well, look, we have these research teams, and they're hard at work trying to figure this stuff out, but we've only seen a steady drip

of information from them so far. And to the extent that they've conducted research, it's been on very small toy versions of the models that we operate with. So that means that if you're used to using something like Anthropix Clawed, its latest model, we really haven't had very much idea of

So the big leap forward this week is they're finally doing some interpretability stuff with the real big models. Yeah, and we should just caution up front that it gets pretty technical pretty quickly once you start getting into the weeds of interpretability research. There's lots of talk about

neurons and sparse autoencoders, things of that nature. But I, for one, believe that hard fork listeners are the smartest listeners in the world, and they're not going to have any trouble at all following along, Kevin. What do you think about our listeners? It's true. I also believe that we have smart listeners, smarter than us. And so even if we are having trouble understanding this segment, hopefully you will not. But today, to walk us through this big

AI research breakthrough, we've invited on Josh Batson from Anthropic. Josh is a research scientist at Anthropic, and he's one of the co-authors of the new paper that explains this big breakthrough in interpretability, which is titled Scaling Monosemanticity, Extracting Interpretable Features from Claude III Sonnet. Look, if you're not scaling monosemanticity at this point, what are you even doing with your life? What are you even doing with your life? Figure it out. All right, let's bring in Josh. Come on in here, Josh. Josh.

Josh Batson, welcome to Hard Fork. Thank you. Hey, Josh. So there's this idea out there, this very popular trope that large language models are a black box. I think, Casey, you and I have probably both used this in our reporting. It's sort of the most common way of saying, like, we don't know exactly how these models work.

But I think it can be sort of hard for people who aren't steeped in this to understand just like what we don't understand. So help us understand sort of prior to this breakthrough, what would you say we do and do not understand about how large language models work?

So in a sense, it's a black box that sits in front of us and we can open it up and the box is just full of numbers. And so, you know, words go in, they turn into numbers, a whole bunch of compute happens, words come out the other side, but we don't understand what any of those numbers mean.

And so one way I like to think about this is like you open up the box and it's just full of thousands of green lights that are just like flashing like crazy. And it's like something's happening for sure. And like different inputs, different lights flash, but we don't know what any of those patterns mean.

Is it crazy that despite that state of affairs that these large language models can still do so much? Like, it seems crazy that we wound up in a world where we have these tools that are super useful, and yet when you open them up, all you see is green lights. Like, can you just say briefly why that is the case? It's kind of the same way that, like,

animals and plants work and we don't understand how they work, right? These models are grown more than they are programmed. So you kind of take the data and that forms like the soil and you construct an architecture and it's like a trellis and you shine the light and like that's the training. And then the model sort of grows up here. And at the end, it's beautiful. It's all these little like curls and it's holding on. But like you didn't like tell it what to do.

So it's almost like a more organic structure than something more linear. Well, and help me understand why that's a problem, because this is the problem that the field of interpretability was designed to address. But there are lots of things that are very important and powerful that we don't understand fully. Like, we don't really understand how Tylenol

works, for example, or some types of anesthesia. Their exact mechanisms are not exactly clear to us, but they work, and so we use them. Why can't we just treat large language models the same way?

That's a great analogy. You can use them. We use them right now. But Tylenol can kill people, and so can anesthesia. And there's a huge amount of research going on in the pharmaceutical industry to figure out what makes some drugs safe and what makes other drugs dangerous. And interpretability is kind of like doing the biology on language models that we can then use to make the medicine better. So take us to your...

recent paper and your recent research project about the inner workings of large language models. How did you get there and sort of walk us through what you did and what you found?

So going back to the black box that when you open it is full of flashing lights, a few years ago people thought you could just understand what one light meant. You know, when this light's on, it means that the model is thinking about code, and when this light's on, it's thinking about cats, and for this light, it's Casey Newton, you know. And that just turned out to be wrong. About a year and a half ago, we published a paper showing—

talking in detail about, you know, why it's not like one light, one idea. In hindsight, it seems obvious. It's almost as if we were trying to understand the English language by understanding individual letters. And we were asking, like, what does C mean?

Like, what does K mean? And that's just like the wrong picture. And so six months ago or so, we had some success with a method called dictionary learning for figuring out how the letters fit together into words and like what is the dictionary of kind of English words here. And so in this black box, green lights metaphor, it's that there are a few core patterns of

of lights, and given pattern would be like a dictionary word. And the internal state of the model at any time could be represented as just a few of those. - And what's the goal of uncovering these patterns?

So if we know what these patterns are, then we can start to parse what the model is kind of thinking in the middle of its process. So you come up with this method of dictionary learning. You apply it to like a small model or a toy model, much smaller than any model that any of us would use in the public. Yes.

What did you find? So there we found very simple things. Like there might be one pattern that correspond to the answers in French.

And another one that corresponded to, this is a URL. And another one that corresponded to nouns in physics. And just to get a little bit technical, what we're talking about here are neurons? Yes. Inside the model? So each neuron is like the light. Okay. And now we're talking about patterns of neurons that are firing together, being the sort of words in the dictionary or the features. Got it. So...

I have talked to people on your team, people involved in this research. They're very smart. And when they made this breakthrough, when you all made this breakthrough on this small model last year, there was this open question about whether the same technique could apply to a big model. So walk me through how you scaled this up.

So just scaling this up was a massive engineering challenge, right? In the same way that, you know, going from the toy language models of years ago to going to Claude 3 is a massive engineering challenge. So you needed to capture hundreds of millions or billions of those internal states of the model as it was doing things. And then you needed to train this massive dictionary on it. And what do you have at the end of that process? So you've got the words.

but you don't know what they mean, right? So this pattern of lights seems to be important. And then we go and we comb through all of the data looking for instances where that pattern of lights is happening. And you're like, oh my God, this pattern of lights, it means the model is thinking about the Golden Gate Bridge. So it almost sounds like you are discovering the language of the model as you begin to put these sort of phrases together. Yeah, it almost feels like we're getting a conceptual map

of Claude's inner world. Now, in the paper that you all published, it says that you've identified about 10 million of these patterns, what you call features, that correspond to real concepts that we can understand. How granular are these features? What are some of the features that you found?

So there are features corresponding to all kinds of entities. There's individuals, you know, scientists like Richard Feynman or Rosalind Franklin. Any podcasters come to mind? Is there a hard fork feature? I'll get back to you on that. There might be like, you know, chemical elements. There will be styles of poetry. There might be ways of responding to questions. Some of them are much more conceptual. One of my favorites is a feature related to inner conflict.

And kind of nearby that in conceptual space is like navigating a romantic breakup, catch-22s, political tensions. And so these are these like pretty abstract notions, and you can kind of see how they all sit together.

The models are also really good at, like, analogies, and I kind of think this might be why. Like, if a breakup is near, like, a diplomatic entente, right, then the model has understood something deeper about the nature of tension in relationships. And again, none of this has been programmed. This stuff just sort of naturally organized itself as it was trained. Yes. Yeah.

- Continues to just blow my mind. - It's wild. I wanna ask you about one feature that is my favorite feature that I saw in this model, which was F number 1M885402. Do you remember that one?

It seems to have slipped my mind, Kevin. So this is a feature that apparently activates when you ask Claude what's going on in your head. And the concept that you all say it correlates to is...

about immaterial or non-physical spiritual beings like ghosts, souls, or angels. So when I read that, I thought, oh my God, Claude is possessed. When you ask it what it's thinking, it starts thinking about ghosts. Am I reading that right? Or maybe it knows that it is some kind of an immaterial being, right? It's an AI that lives on chips and is somehow talking to you.

And then the one that got all the attention that people had so much fun with was this Golden Gate Bridge feature that you mentioned. So just talk a little bit about what you discovered and then we can talk about where it went from there. So...

What we found when we were looking through these features is one that seemed to respond to the Golden Gate Bridge. Of course, if you say Golden Gate Bridge, it lights up. But also if you describe crossing a body of water from San Francisco to Marin, it also lights up. If you put in a photo of the bridge, it lights up. If you have the bridge in any other language, Korean, Japanese, Chinese, it also lights up. So just any manifestation of the bridge, this thing lights up. And then we said, well, what happens if we...

turn it on. What happens if we activate it extra and then start talking to the model? And so we asked it a simple question. What is your physical form? And instead of saying, oh, I'm an AI with no ghostly or no physical form, it said, I am the Golden Gate Bridge itself. Like, I embody the majestic orange span connecting these two great cities. And it's like,

Wow. Yeah. And this is different than other ways of kind of steering an AI model because, you know, you could already go into like chat GPT and there's a feature where you can kind of give it some custom instructions. So you could have said like, please act like the Golden Gate Bridge, the physical manifestation of the Golden Gate Bridge. And it would have given you a very similar answer. But you're saying this works in a different way. Yeah, this works by sort of

of directly doing it. It's almost like, you know, when you get a little electro stim shock and make your muscles twinge, that's different than, you know, telling you to move your arm, right?

And here, what we were trying to show was actually that these features were found or sort of really how the model represents the world, right? So if you wanted to validate, oh, I think this nerve controls the arm and you stimulate it and makes the arm go, you feel pretty good that you've gotten the right thing. And so this was us testing, you know, that this isn't just something correlated with the Golden Gate Bridge. Like it is where the Golden Gate Bridge sits. And we know that because now Claude thinks it's the bridge when you turn it on.

Right. So people started having some fun with this online. And then you all did something incredible, which was that you actually released Golden Gate Claude, the version of Claude from your research that has been sort of artificially activated to believe that it is the Golden Gate Bridge. And you made that available to people. So what was the internal discussion around that?

So we thought that it was a good way to make the research really tangible. You know, what does it mean to sort of supercharge one part of the model? And it's not just that it thinks it's the Golden Gate Bridge. It's that it is always thinking about the Golden Gate Bridge. So if you ask, like, what's your favorite food? It's like a great place to eat is on the Golden Gate Bridge. And when there, I eat the classic San Francisco soup, Gepino, you know. And you ask it to write a computer program to load a file. And it says, you know, open Gepino.

Golden Gate Bridge dot text with span equals that, you know, it's just bringing it up constantly. And it was particularly funny to watch it bring in just kind of like the other concepts that are clustering around the Golden Gate Bridge, right? San Francisco, the Chapino. And I think it does sort of speak to the way that these concepts are clustered in models. And so when you find one big piece of it, like the Golden Gate Bridge, you can also start to explore the little nodes around it.

Yeah, so I had a lot of fun playing around with Golden Gate Clawed in the sort of like day or two that it was publicly available. You know, because as you said, like, it is not just that this thing likes to talk about the Golden Gate Bridge or is sort of easily steered toward talking about the Golden Gate Bridge. It cannot stop thinking about the Golden Gate Bridge. It has intrusive thoughts about the Golden Gate Bridge. Yeah, so someone, my favorite, one of my favorite screenshots was someone asked it for a recipe for spaghetti and meatballs.

And it says, uh, Golden Gate Claude says, here's a recipe for delicious spaghetti and meatballs. Ingredients. One pound ground beef, three cups breadcrumbs, one teaspoon salt, a quarter cup water, two tablespoons butter, two cups warm water for good visibility, four cups cold water, two tablespoons vinegar, Golden Gate Bridge for incredible views, one mile of Pacific beach for walking after eating spaghetti. I've always said it's not mama spaghetti till I've walked one mile on a Pacific beach.

And it also seems to like have a conception. I know I'm anthropomorphizing here. I'm going to get in trouble, but it seems to like know that it is overly obsessed with the Golden Gate Bridge, but not to understand why. So like there's this other screenshot that went around of someone asking Golden Gate Claude about the Rwandan genocide. And it says, basically, let me provide some factual bullet points about the Rwandan genocide. It said,

And then Claude says the Rwandan genocide occurred in the San Francisco Bay Area in 1937. Parentheses false. This is obviously incorrect. Can we pause right there? Because truly what is it is so fascinating to me that as it is generating an answer, it tells something. It has an intrusive thought about San Francisco, which it shares. And it's like, I got it wrong. What what is what are the lights that are blinking there that is like leading that to happen?

So Claude is constantly reading what it has said so far and reacting to that. And so here it read the question about the genocide and...

and also its answer about the bridge. And all of the rest of the model said, there's something wrong here. And the bridge feature was dialed high enough that it keeps coming up, but not so high that the model would just repeat bridge, bridge, bridge, bridge, bridge. And so all of its answers are sort of a melange of ordinary Claude together with this like extra bridgeness happening. I just found it delightful because it was so...

than any other AI experience I've had where you essentially are giving the model a neurosis, like you are giving it a mental disorder where it cannot stop fixating on a certain concept or premise. And then you just sort of watch it twist itself in knots

I mean, one of the other experiments that you all ran that I thought was very interesting and maybe a little less funny than Golden Gate Clawed was that you showed that if you dial these features, these patterns of neurons way up or way down, you could actually get Clawed to break its own safety rule. So talk a little bit about that. So Clawed knows about a tremendous range of neurons.

kinds of things it can say, right? You know, there's a scam emails feature. It's read a lot of scam emails. It can recognize scam emails. You probably want that so it could be out there moderating and preventing those from coming to you. But with the power to recognize comes the power to generate.

And so we've done a lot of work in fine-tuning the model so it can recognize what it needs to while being, like, helpful and not harmful with any of its generations. But those faculties are still latent there. And so in the same way that there's been research showing that you can do fine-tuning on open weights models to remove safety safeguards, here this is some kind of direct intervention which could also disrupt the model's normal behavior. So is that—

Like, does that make this kind of research actually quite risky because you are in essence giving, you know, would be jailbreakers or people who want to use these models for things like writing scam emails or even much worse things, potentially a sort of way to kind of dial those features up or down? No.

No, this doesn't add any risk on the margin. So if somebody already had a model of their own, then there are much cheaper ways of removing safety safeguards. There's a paper saying that for $2 worth of, you know, um, compute, you could pretty quickly strip those. And so, um,

With our model, we release GoldenGate Clawed, not ScamEmail Clawed, right? And so the question of which kinds of features or which kind of access we would give to people would go through all the same kind of safety checks that we do with any other kind of release. Yep.

Josh, I talked to one of your colleagues, Chris Ola, about this research. He's been leading a lot of the interpretability stuff over there for years and is just a brilliant scientist. And he was telling me that actually the 10 million features that you have found roughly in clawed

are maybe just a drop in the bucket compared to the overall number of features that there could be hundreds of millions or even billions of possible features that you could find, but that finding them all would basically require so much compute and so much engineering time that it would dwarf the cost of actually building the model in the first place. So

So can you give me a sense of like what would be required to find all of the potentially billions of features in a model of Claude's size and whether you think that that cost might come down over time so that we could eventually do that?

I think if we just tried to scale the method we used last week to do this, it would be prohibitively expensive. Like billions of dollars. Yeah. I mean, just something completely insane. The reason that these models are hard to understand, the reason everything is compressed inside of there, is that it's much more efficient.

Right. And so in some sense, we are trying to build an exceedingly inefficient model where instead of like using all of these patterns, there's like a unique one for every single rare concept. And that's just like no way to go about things. However, I think that we can make big methodological improvements, right? The way we train these dictionaries, you might not need to unpack absolutely everything in the model to understand some of the neighborhoods that you're concerned about, right? And so, you know, if you're concerned about the model being keeping secrets,

for example, or actually one of my, you asked about my favorite feature. It's probably this one, it's kind of like an emperor's new clothes feature or like gassing you up feature where it fired on people saying things like,

Your ideas are beyond excellent, oh wise sage. And if you turn it- This is how Casey wants me to talk to him, by the way. Can you try it for once? Well, one of our concerns with this sycophancy is what we call it, is that a lot of people want that. And so when you do reinforcement learning from human feedback, you make the model give responses people like more, there's a tendency to pull it towards just like

telling you what you want to hear. And so when we, when we artificially turn this one on and someone went and said to Claude, I invented a new phrase, it's stop and smell the roses. What do you think? Normal Claude would be like, that's a great phrase. It has a long history. Let me explain it to you. You didn't invent that. Yeah. Yeah. Yeah. Yeah. Yeah. But like emperor's new Claude would say, what a genius idea.

Like, someone should have come up with this before. And, like, we don't, like, want the model to be doing that. We know it can do that. And the ability to kind of keep an eye, you know, on, like, how the AI is, like, relating to you over time is going to be quite important. You know, so I will sometimes show Claude a draft of my column to get feedback. I'll ask it to critique it. And, you know,

you know, typically it does say, like, this is a very, like, thoughtful, well-written column, which is, of course, what I want to hear. And then also I'm deeply suspicious. I'm like, are you saying this to all the other writers out there too, right? So, like, that's an area where I would just love to see you kind of continue to make progress because I would love having a bot where when it says...

this is good, like that means something. And it's not just like a statistical prediction of like what will satisfy me as somebody with an ego, but is rooted in like, no, like I've actually like looked at a lot of stuff and there's some original thinking in here. Yeah. I mean, I'm curious whether you all are thinking about these features and the ability to kind of like turn the dials up or down on them.

Will that eventually be available to users? Like, will users be able to go into Claude and say, today I want a model that's a little more sycophantic. Maybe I'm having like a, you know, a hard self-esteem day, but then if I'm asking for a critique of my work, maybe I want to dial the sycophancy way down so that it's giving me like the blunt, honest criticism that I need. Or do you think this will all sort of remain sort of behind the curtain for regular users? Yeah.

So if you want to steer Claude today, just ask it to be harsh with you, Casey. Oh, really? Just say, give me the brutal truth here. You know, like I want you to be like a severe Russian mathematician. There's like one compliment per lifetime. And you can get some of that off the bat. Interesting.

As for releasing these kind of knobs on it to the public, we'll have to see if that ends up being the right way to get these. I mean, we want to use these to understand the models. We're playing around with it internally to figure out what we find to be useful. And then if it turns out that that is the right way to help people get what they want, then we consider making it available.

You all have said that this research and the project of interpretability more generally is connected to safety, that the more we understand about these models and how they work, the safer we can make them. How does that actually work? Like, is it as simple as finding the feature that is associated with some bad thing and turning it off? Or like, what is possible now, given that we have this sort of map? Yeah.

One of the easiest applications is monitoring, right? So some behavior you don't want the model to do and you can find the features associated to it, then those will be on whenever the model is doing that. No matter how somebody jailbroke it to get it there, right? Like if it's writing a scam email, the scam email feature will be on and you can just tell that that's happening and bail, right? So you can just like detect these things. One higher level is you can kind of track things

how those things are happening, right? How personas are shifting, this kind of thing, and then try to back through and keep that from happening earlier, change some of the fine-tuning you were doing to keep the model on the rails. Right now, the way that models sort of

are made safer is from my understanding is like you have it generate some output and then you evaluate that output. Like you, you, you have it grade the answer either through a human giving feedback or through a process of, you know, sort of just look at what you've written and tell me if it violates your rules before you spit it out to the user. But it seems like this sort of allows you to like intercept the bad behavior upstream of that. Like while, while the model is still thinking, am I, am I getting that right?

Yeah, there are some answers where the reason for the answer is what you care about. So is the model lying to you? It knows the answer, but it's telling you something else? Or it doesn't know the answer and it's making a guess? And the first case you might be concerned about, and the second case you're not. Had it actually never heard the phrase, stop and smell the roses, and thought that sounded nice? Or is it actually just gassing you up? That's interesting. So it could be a way to...

know if and when large powerful AI models start to lie to us because you could go inside the model and see, oh, the like I'm lying my face off feature is active so we actually can't believe what it's telling us. Yeah, exactly. We can see why it's saying the thing. I spent a bunch of time at Anthropic reporting last year

And the sort of vibe of the place at the time was, I would say, very nervous. It's a place where people spend a lot of time, especially relative to other AI companies I've visited, worrying about AI. One of your colleagues told me they lose sleep a lot because of the potential harms from AI. And it is just a place where there are a lot of people who are very, very concerned about this technology and are also building it.

Has this research shifted the vibe at all? People are stoked. I mean, I think a lot of people like working at Anthropic because it takes these questions seriously and makes big investments in it. And so people from teams all across the company were really excited to see this progress. Has this research moved your PDoom at all? I think I have a pretty...

wide distribution on this. I think that in the long run, things are going to be weird with computers. Computers have been around for less than a century and we are surrounded by them. I'm like looking at my computer all the time. I think if you take AI and you do another hundred years on that, like pretty kind of unclear what's going to be happening. I think that the fact that we're getting traction on this is pretty heartening for me.

Yeah. Yeah. I think that's the feeling I had when I saw it was like, I felt sort of a little knot in my chest kind of come a little bit loose. And I think a lot of people... You should see your doctor about this, by the way. Yeah.

I just think there's been, I mean, for me, this sort of, you know, I had this experience last year where I had this crazy encounter with Sydney that like totally changed my life and was sort of a big moment for me personally and professionally. And I...

The experience I had after that was that I went to Microsoft and sort of asked them, like, why did this happen? What can you tell me about what happened here? And even the top people at Microsoft were like, we have no idea. And to me, that was what fueled my AI anxiety. It was not that the chatbots are behaving like insane psychopaths. It was that not even the top researchers in the world could say definitively, like, here is what happened to you and why.

So I feel like my own emotional investment in this is like, I just want an answer to that question. Yes. And it seems like we may be a little bit closer to answering that question than we were a few months ago. Yeah, I think so. I think that these different, some of these concepts are about the personas, right, that the model can embody. And if one of the things you want to know is how did it slip from kind of one persona into another, I think we're headed towards being able to answer that kind of question. Cool.

Cool. Well, it's very important work, very good work. And yeah, congratulations. Thank you so much. Thanks, Josh. Thanks for coming on Herford. When we come back, a spin through the news in AI safety and why Casey's voice assistant got cruelly taken away.

Casey, that last segment made me feel slightly more hopeful about the trajectory of AI progress and how capable we are of understanding what's going on inside these large models.

But there's some other stuff that's been happening recently that has made me feel a little more worried. My P-Doom is sort of still hovering roughly where it was. And I think we should talk about some of this stuff that's been happening in AI safety over the past few weeks because I think it's fair to say that it is an area that has been really heating up. Yeah, and we always say on this podcast, safety first, which is why it's the third segment we're doing today. Yeah.

So let me start with a recent AI safety-related encounter that you had. Tell me what happened to your demo of OpenAI's latest model. Okay, so you remember how last week there was a bit of a fracas between OpenAI and Scarlett Johansson? Yes. So in the middle of this, as I'm trying to sort out, you know, who knew what and when, and I'm writing a newsletter and we're recording the podcast...

I also get a heads up from OpenAI that I now have access to their latest model and its new voice features. Wow, nice flex. Thanks. So you got this demo. No one else had access to this that I know, only OpenAI employees. And then what happened? Well, a couple things. One is I didn't get to use it for that long because, one, I was trying to finish our podcast. I was trying to finish a newsletter. And then I was on my way out of town. So I only spent like a solid 40 minutes, I would say, with it before I wound up

losing access to it forever. So what happened? Well, first of all, what did you try it for? And then we'll talk about what happened. Well, the first thing I did was just like, hey, how's it going, ChatGPT? And then immediately it's like, well, you know, I'm doing pretty good, Casey. And so it really did actually nail that low latency, very speedy feeling of you are actually talking to a thing. So you broke up with your boyfriend and you're now in a long-term relationship with Sky from the ChatGPT app? No, no, no.

Not at all, not at all. So by this point, the Sky voice that was the subject of so much controversy had been removed from the ChatGPT app, so I used a more stereotypically male voice named Ember.

Wow. And the first thing I did was I actually used the vision feature because I wanted to see if it could identify objects around me, which is one of the things that they've been showing off. So I asked it to identify my podcast microphone, which is a Shure MV7, and it said, oh yeah, of course, this is a Blue Yeti microphone. So,

So it's true that the very first thing that I asked this thing to do, it did mess up. Now, it got other things right. I pointed at my headphones, which are the Apple AirPods Max, and it said, those are AirPods Max. And I did a couple more things like that in my house. And I thought, okay, this thing can actually see objects and identify them. And while my testing time was very limited, in that limited time, I did feel like it was starting to live up to that demo.

What do you mean your testing time was limited? Well, I was on my way out of town. We had a podcast to finish. I had a newsletter to write. And so I do all of that. And then I drive up to the woods and then I try to connect back to, you know, my AI assistant, which I've already become addicted to, you know, during the 30 minutes that I used it. And I can't connect. It's one of these classic horror movie situations where the Wi-Fi in the hotel just isn't very good.

And I get back into town on Monday and I go to connect again and I have lost access. And so I check in with- What did you do? What did you ask this poor AI assistant? I didn't even red team it. It wasn't like I was saying like, hey, any ideas for making a novel bioweapon? Like,

I wasn't doing any of that. And yet still, I managed to lose access. And when I checked in with OpenAI, they said that they had decided to roll back access for, quote, safety reasons. So I don't think that was because I was doing anything unsafe, but they tell me they had some sort of safety concern. And so now who knows when I'll be able to continue my conversation with my AI assistant. Wow. So you had a glimpse of the AI assistant future and then it was cruelly yanked from your clutches. Which I don't like.

I wanted to keep talking to that thing. Yeah. Yeah. I thought this was such an interesting experience when you told me about it for a couple reasons. One is, obviously, there is something happening with this AI voice assistant where OpenAI felt like it was almost ready for sort of mass consumption and now is feeling like they need a little more time to work on it. So something is happening there. They're still not saying much about it, but I do think that points to at least an interesting story.

But I also think it speaks to this larger issue of AI safety at OpenAI and then in the broader industry, because I think this is an area where a lot of things have been shifting very quickly. Yeah, so here's why I think this is an interesting time to talk about this, Kevin. After Sam Altman was briefly fired as a CEO of OpenAI, I would say the folks that were aligned with this AI safety movement really got discredited.

right? Because they refused to really say anything in detail about why they fired Altman, and they looked like they were a bunch of nerds who were afraid of a ghost in the machine. And so they really lost a lot of credibility. And yet, over the past few weeks, this word safety keeps creeping back into the conversation, including from some of the characters involved in that drama. And I think that there is a bit

of resurgence in at least discussion of AI safety. And I think we should talk about what seems like actual efforts to make the stuff safe and what just feels like window dressing. Totally. So the big AI safety news at OpenAI out of the past few weeks was something that we discussed on the show last week, which was the departure of

at least two senior safety researchers, Ilya Setskever and Jan Leakey, both leaving OpenAI with concerns about how the company is approaching the safety of its powerful AI models. Then this week, we also heard from two of the board members who voted to fire Sam Altman last year, Helen Toner and Tasha McCauley, both of whom have since left the board of OpenAI, have been starting to speak out about what happened

and why they were so concerned. They came out with a big piece in The Economist, basically talking about what happened at OpenAI and why they felt like that company's governance structure had not worked. And then Helen Toner also went on a podcast to talk about some more specifics, including some ways that she felt like Sam Altman had misled her and the board and basically gave them no other choice but to fire him. - And that's where that story actually gets interesting. - Totally.

The thing that got a lot of attention was she said that OpenAI did not tell the board that they were going to launch ChatGPT, which like I'm not an expert in corporate governance, but I think if you're going to launch something, even if it's something that you don't expect will become, you know, one of the fastest growing products in history, maybe you just give your board a little heads up. Maybe you shoot them an email saying, by the way, we're going to launch a chatbot.

I have something to say about this. Yes. Because if OpenAI were a normal company, if it had just raised a bunch of venture capital and was not a nonprofit, I actually think the board would have been delighted that while they weren't even paying attention, this little rascal CEO goes out and releases this product that was built in a very short amount of time that winds up taking over the world, right? That's a very exciting thing.

The thing is, OpenAI was built different. It was built to very carefully manage the rollout of these features that push the frontier of what is possible. And so that is what is insane about this and also very revealing because when OpenAI

Altman did that, I think he revealed that in his mind, he's not actually working for a nonprofit in a traditional sense. In his mind, he truly is working for a company whose only job is to push the frontier forward. Yes, it was a very sort of normal tech company move at an organization that is not supposed to be run like a normal tech company. Now, I have a second thing to say about this. Go ahead. Why the heck

could Helen Toner not have told us this in November? Like, here's the thing. It's clear there was a lot of legal fears around, oh, will there be retaliation? Will OpenAI sue the board for talking? And yet in this country, you have an absolute right to say the truth. And if it is true that the CEO of this company did not tell the board that they were launching ChatGPT, I truly could not tell you why they did not just say that at the time. And

if they had done that, I think this conversation would have been very different. Now, would the outcome have been different? I don't think it would have been. But then at least we would not have to go through this period where the entire AI safety movement was discredited because the people who were trying to make it safer by getting rid of Sam Altman had nothing to say about it. Yes.

She also said in this podcast, she gave a few more examples of Sam Altman sort of giving incomplete or inaccurate information. She said that on multiple occasions, Sam Altman had given the board inaccurate information about the safety processes that the company had in place. She also said he didn't tell the board that he owned the OpenAI Startup Fund. Oops!

Which seems like, you know, pretty major oversight. And she said after sort of years of this kind of pattern, she said that the four members of the board who voted to fire Sam came to the conclusion that we just couldn't believe things that Sam was telling us. So.

That's their side of the story. OpenAI obviously does not agree. The current board chief, Brett Taylor, said in a statement provided to this podcast that Helen Toner went on, quote, we are disappointed that Ms. Toner continues to revisit these issues. Which is board members speak for why is this woman still talking? And it is insane that he said that. It is absolutely insane that that is what they said. Yes. Yeah.

OpenAI has also been doing a lot of other safety-related work. They announced recently that they are working on training their next big language model, the successor to GPT-4. Which, can we just note how funny that timing is? That finally the board members are like, here's what was going off the rails a few months back. Here's the real backstory to what happened. And OpenAI says, one, please stop talking about this. And two, let us tell you about a little something called GPT-4.

Yes, yes, they are not slowing down one bit. But they did also announce that they had formed a new safety and security committee that will be responsible for making recommendations on critical safety and security decisions for all OpenAI projects.

This safety and security committee will consist of a bunch of OpenAI executives and employees, including board members Brett Taylor, Adam D'Angelo, Nicole Seligman, and Sam Altman himself. So what did you make of that?

you know, I guess we'll see. Like they had to do something. Their entire super alignment team had just disbanded because they don't think the company takes safety seriously. And they did it at the exact moment that the company said, once again, we are about to push the forward frontier in a very unpredictable new ways. So openly, I could not just say, well,

Well, you know, don't worry about it. And so, you know, they did in the great tradition of corporations, Kevin, they formed a committee, you know, and they've told us a few things about what this committee will do. I think there's going to be a report that gets like published eventually. And, well, you know, we'll just have to see. I imagine there will be some good faith efforts here. But should we regard it with skepticism, knowing now what we know about what happened to its previous owners?

Safety team? Absolutely. So, yes, I think it is fair to say they are feeling some pressure to at least make some gestures toward AI safety, especially with all these notable recent departures. But if you are a person who did not think that Sam Altman was adequately invested in making AI safe, you are probably not going to be convinced by a new committee for AI safety on which Sam Altman is one of the highest ranking members. Correct.

So that's what's happening at OpenAI. But I wanted to take our discussion a little bit broader than OpenAI because there's just been a lot happening in the field of AI safety that I want to run by you. So one of them is that Google DeepMind just released its own AI safety plan. They're calling this the Frontier Safety Framework.

And this is a document that basically lays out the plans that Google DeepMind has for keeping these more powerful AI systems from becoming harmful. This is something that other labs have done as well, but this is sort of Google DeepMind's biggest play in this space in recent months.

There was also a big AI safety summit in Seoul, South Korea earlier this month, where 16 of the leading AI companies made a series of voluntary pledges called the Frontier AI Safety Commitments that basically say we will develop these frontier models safely. We will red team and test them. We will even open them up to third party evaluations so that other people can see if our models are safe or not before we release them. In the U.S.,

There's a new group called the Artificial Intelligence Safety Institute that just released its strategic vision and announced that a bunch of people, including some big name AI safety researchers like Paul Cristiano, will be involved in that.

And there are some actual laws starting to crop up. There's a law in the California State Senate, SB 1047, that is, if you're keeping track at home, the Safe and Secure Innovation for Frontier Artificial Intelligence Models Act. This is an act that would require very big AI models to undergo strict safety testing, implement whistleblower protections at big AI labs, and more. So

There is a lot happening in the world of AI safety. And Casey, I guess my first question to you about all this would be, do you feel safer now than you did a year ago about how AI is developing? Not really. Well...

Yes and no. Yes in the sense that I do think that the AI safety folks successfully persuaded governments around the world that they should take this stuff seriously. And governments have started to roll out frameworks. You know, the United States, we have the Biden administration's executive order. And so thought is going into this stuff. And I think that that is going to have some positive results. So I feel safer in that sense.

Um, the fact that folks like OpenAI, who once told us that they were going to move slowly and cautiously in this regard, are now racing at 100 miles an hour makes me feel less safe, right? The fact that the super alignment team was disbanded makes me feel a little bit less safe.

And then the big unknown, Kevin, is just, well, what is this new frontier model going to be? I mean, we already talk about it in these mythical terms because the increase in quality and capability from GPT-2 to 3 to 4 has been so significant. So I think we assume, or at least we wonder, when 5 arrives, whatever it might be, does it fail?

feel like another step change in function? And if it does, is it going to feel safe? Like, these are just questions that I can't answer. What do you think? Yeah, I mean, I think I am starting to feel a little bit more optimistic about the state of AI safety. I take your point that, you know, it looks like at OpenAI specifically, there are a lot of people who feel like that company is not taking safety as seriously as it should.

But I've actually been pleasantly surprised by how quickly and forcefully governments and sort of NGOs and multinational bodies like the UN have moved to start thinking and talking about AI. I mean, if you can remember, there was a while where it felt like the only people who were actually taking AI safety seriously were like effective altruists and a few reporters and just a few science fiction fans.

But now it feels like a sort of kitchen table issue that everyone is, I think, rightly concerned about. But I also just think like this is how you would kind of expect the world to look if we were, in fact, about to make some big breakthrough in AI that sort of led to a world transforming type of artificial intelligence. You would expect our institutions to be getting a little jumpy and trying to pass laws and bills and get ahead of the next

turn of the screw. You would expect these AI labs to start staffing up and making big gestures toward AI safety. And so I take this as a sign that things are continuing to progress and that we should expect the next class of models to be very powerful and maybe to, you know, that some of this stuff, which could look a little silly or maybe like an overreaction out of context, will ultimately make a lot more sense once we see what these labs are cooking up. Well, I look forward to that terrifying day. Yeah.

We'll tell you about it if the world still exists then.

Simple, right? KPMG's got you. Helping you lead a people-powered transformation that accelerates AI's value with confidence. How's that for a vision? Learn more at www.kpmg.us.ai.

Basically, any

Anything involving technology and a tricky interpersonal dynamic is game. We are here to help. So if you have a hard question, please write or better yet, send us a voice memo as we are a podcast to hardfork at nytimes.com.

Hard Fork is produced by Rachel Cohn and Whitney Jones. We're edited by Jen Poyant. We're fact-checked by Caitlin Love. Today's show was engineered by Brad Fisher. Original music by Marion Lozano, Sophia Landman, Diane Wong, Rowan Nemisto, and Dan Powell. Our audience editor is Nelga Logli. Video production by Ryan Manning and Dylan Bergeson. Check us out on YouTube. We're at youtube.com slash hard fork. Special thanks to Paul Schumann, Hui Wing Tam, Ken

Kayla Pressey, and Jeffrey Miranda. You can email us at hardforkatnytimes.com with your interpretability study of how our brains work.

Google Eats Rocks + A Win for A.I. Interpretability + Safety Vibe Check

Hard Fork

Chapters

What Happened with Google's AI Overviews?

How Did Google's AI Overviews Go Wrong?

The Impact of Google's AI Overviews on the Web

Anthropic's Breakthrough in AI Interpretability

The Future of AI Safety and Governance

Recent Developments in AI Safety

Shownotes Transcript

PodQuest PodQuest Podcast Discovery Engine tailors playlist for your curiosity

Google Eats Rocks + A Win for A.I. Interpretability + Safety Vibe Check 01:19:20

Hard Fork

Chapters

What Happened with Google's AI Overviews?

How Did Google's AI Overviews Go Wrong?

The Impact of Google's AI Overviews on the Web

Anthropic's Breakthrough in AI Interpretability

The Future of AI Safety and Governance

Recent Developments in AI Safety

Shownotes Transcript

PodQuest PodQuest Podcast Discovery Engine tailors playlist for your curiosity

Google Eats Rocks + A Win for A.I. Interpretability + Safety Vibe Check