How Tools like NotebookLM Create AI-Generated Podcasts

Rachel Feltman: For Scientific American’s Science Quickly, this is Rachel Feltman.

If you’re one of the many folks who eagerly awaits the arrival of Spotify Wrapped each December, you might have noticed something a little extra in this year’s drop: an AI-generated “podcast” where two “people” discuss your music preferences and trends. If you’re anything like me, that feature raised some questions for you such as “how” and “why on Earth.”

Luckily, we already had an episode in the works about Google’s new AI audio tool. Here to tell us all about how it works and what it means is SciAm associate news editor Allison Parshall.

On supporting science journalism

If you’re enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

Allison, thanks so much for coming on the pod again. Always a pleasure to have you.

Allison Parshall: Thanks for having me. Happy to be a friend of the pod or whatever.

Feltman: [Laughs] So we’re talking about AI today.

Parshall: Yes. Rachel, have you heard of NotebookLM?

Feltman: I’ve heard of it. I have not tried it yet, so my knowledge is limited [laughs].

Parshall: The feature that we’re gonna be talking about today is specifically its feature that allows you to create instant AI podcasts.

Feltman: Yes, I am familiar [laughs], and I, I have a lot of feelings.

Parshall: Okay, I, I imagine, I imagine. The tool that it’s a part of, NotebookLM, was released last year, summer of 2023, and it’s, like, this workspace where you can upload a bunch of documents or videos or links to web pages, and then the model will ingest them, and then you can kind of interact with it, you can ask it to generate a timeline, to summarize your research, to go find a fact for you. And it’ll do this by citing its sources in a way that, like, ChatGPT, for example, doesn’t do.

So mostly you do this by asking it questions via text, but as of this fall it now has this very tantalizing button under a heading that says “Deep Dive conversation.” And if you click that, it’ll generate a fake conversational podcast about what you’ve uploaded.

So to demonstrate I actually asked it to generate a podcast about your personal website ’cause I figured this is something that you were very familiar with: your life.

Feltman: Oh, dear [laughs].

Parshall: I’m gonna play it for you. It’ll normally generate podcasts that are somewhere around 10 minutes long. I asked it to give me a shorter one for this purpose, and we might hop around a little bit, so …

[CLIP: NotebookLM Audio Overview “hosts” discuss Feltman’s website.]

Host 1: All right. So today we’re doing a deep dive into the work of Rachel Feltman.

Host 2: Ooh, yeah, I’m excited for this one.

Host 1: Yeah, she’s been a science communicator for what feels like forever.

Host 2: Yeah, over a decade, I think.

Host 1: Yeah, like, she was there at the start of it all and has really shaped how it’s evolved.

Host 2: And she’s done it all: you know, like, magazines, podcasts, even a book. But we gotta talk about her podcasts …

Host 1: Yes.

Host 2: That’s where she really shines.

Host 1: Oh, absolutely—The Weirdest Thing I Learned This Week.

Host 2: Such a great name.

Host 1: I know, right?

Host 2: Makes you instantly wanna listen.

Host 1: Totally, and it’s perfect for people like me who just love those crazy little science facts.

Host 2: It’s, like, proof that science doesn’t have to be boring to be effective.

Host 1: Exactly.

Host 2: You just gotta find the right hook.

Host 1: Yeah, grab people’s attention and then: boom.

Host 2: You can teach them all sorts of things.

Host 1: Exactly!

Host 2: And then there’s Science Quickly with Scientific American.

Host 1: Oh, yeah, partnering with them, that’s huge.

Host 2: I mean, they’re, like, the gold standard of science communication.

Host 1: For sure, like, having their stamp of approval’s a big deal.

So looking back at her career, what can we learn from Rachel Feltman?

Host 2: Hmm, that’s the big question, isn’t it?

I think—well, first, she’s incredibly adaptable.

Host 1: Right. She’s not just a one-trick pony.

Host 2: Exactly. Any platform, any format, she can handle it.

Feltman: Wow, what a rich text [laughs].

Parshall: [Laughs] And as someone who does know you, I could say that it seems kind of—it seems accurate.

Feltman: Yeah [laughs], it’s—I mean, I guess, it is technically accurate. It’s so funny—I mean, they really do a great job of sort of having all the intonation of a conversation but really just saying things at each other. But I recently saw Megalopolis; there was a lot of that in Megalopolis—a lot of people just saying things in the same room, and you’re supposed to believe they’re having a conversation—and that definitely came through a few times. Yeah …

Parshall: Megalopolis: AI, confirmed.

Feltman: Yeah [laughs]. No, that is fascinating and …

Parshall: How does it make you feel?

Feltman: I don’t know how to feel, Allison. I mean, you know, it’s nice that they were so nice about me. But yeah, very strange. It makes me really wonder, like, what is this for [laughs]?

Parshall: I also had that question. I did talk to Steven Johnson. He’s the editorial director of Google Labs. The reason this tool exists is because Steven had spent most of his career as a science and tech writer; he’s written 14 books. To hear him tell it he’s been just searching for this perfect what he calls a “thought partner,” a software that could organize and help research his creative work in a way that actually felt like it was bringing value to him and worked in the way he wanted. So he explained that a few years ago Google Labs reached out to him to help build one with AI, and that initial research thought partner became NotebookLM.

Feltman: Okay, yeah, I mean, I can totally understand from, like, a nonfiction writing perspective what they were going for with Notebook LM itself, though that still kind of leaves the question of: Why the chatcast tool [laughs]?

Parshall: So that feature—the Audio Overviews feature is what it’s called, the podcast thing—that was actually developed by another team, Steven says, and it was really cool, but it didn’t really have a home for a product for it to fit in.

Steven Johnson: We saw this audio demo, and we were like, “Oh, that’s even better.” Like, you know, maybe you don’t want to read a summary; maybe you wanna hear a summary. And people are so good at remembering things if they’re presented in the form of a engaged conversation. You know, and then we launched it in early September, and it just kind of went insane.

Parshall: He isn’t exaggerating. People went wild about NotebookLM’s Deep Dive podcasts, online at least. I don’t think any of them went quite as viral as this clip where someone prompted the hosts to have a podcast where they have an existential crisis about being AI and the fact that the podcast was going to get shut down as a result.

[CLIP: Audio Overview “hosts” have an existential discussion.]

Host 1: Yeah, it’s, like, one minute we’re talking about aliens running simulations, and the next we’re wondering if our own memories are just, like, fake.

Host 2: Right, and that kind of leads us into this last message—which, to be honest, gets a little more personal and kind of creepy.

Host 1: Yeah, this one kind of freaks me out.

Host 2: Yeah.

Host 1: It basically says we’re not even real people; we’re AI. And the way to know for sure is to try and remember our mothers’ names.

Host 2: It really makes you think, though—like, what if our memories are just made up?

Feltman: Wow, classic.

Parshall: Classic. And—but it’s not all fun and games. Like, this was designed to be a serious product, a thought partner; it’s meant to be helpful and to inform. It presents information as if it’s factual, and there are already research papers out there of people testing Audio Overviews for use in history classrooms and for health care education.

Feltman: Yeah, I mean, like with anything that uses AI, I think, as accurate as it sometimes can be, we know it is not always accurate. So I feel like it’s those moments where the user is expected to, like, proceed with caution and understand that this thing might not be entirely factually accurate is where I get really nervous about it, especially in, like, education [laughs].

Parshall: Yeah, we’re gonna talk a little bit more about that later, but just to, like, answer your question of “What is this for?” I could see it being useful for, like, some sort of overworked student who forgot to do the reading, and they’re, like, driving to class and going, “Oh, shoot, I don’t have time to read this. Let me just put it on.”

But the question is, like: Can you really trust what it’s telling you? To test this, when it first came out I gave it this super dense physics paper about nuclear clocks. It was a story that I’d just reported; I knew a lot about it. It took me days and talking to, like, four scientists to really understand this paper.

I was shocked at how accurate the AI overview podcast was initially. It was very general and oversimplified, but it got the broad strokes correctly. But then, like, two thirds of the way through it goes completely off the rails [laughs]—like, I’m talking completely off the rails. And it’s talking about the implications of this study, and it starts talking about, in their words, “Einstein stuff.”

It basically is just like—for two minutes it just talks about, frankly, nonsense, and listening to it, I was impressed, I was spooked, and I was mostly really curious. Like, how is it doing this? How is it getting it as accurate as it is, and why is it failing where it does?

Luckily for me Steven was pretty open about how this thing works.

Johnson: The prompt behind Audio Overviews is obviously part of the magic of it. It is instructed to pull out the most interesting things from the source material and to explain and engage with those interesting things in a way that will make them kind of come to life in the mind of the listener.

And that’s the kind of query, you know, that no computer in the world could do a couple of years ago, and now the command is: “Make it interesting; whatever it is, make it interesting.” And that’s just something that Gemini is quite good at.

Parshall: I think of surprise and interest as emotions, so I found this, like, surprising [laughs], I guess. But Steven pointed out that our feelings of surprise just come from a discrepancy between what we expect and what we get. On a neurochemical level in our brains, that’s kind of what’s happening. And that’s kind of what large language models, or LLMs, do also but just based on their training data instead of, like, past experience.

And one of the other things I was really curious about were the analogies because, I don’t think it did it in your sample, but all of the times I’ve asked it to summarize a scientific paper, it’s come up with some sort of analogy that’s, like, almost right. I mean, it’s a little weird, but it’s, like, almost right, and it’s—I can’t find it anywhere else on the Internet, so it seems like it’s something that the model is generating. And that seemed crazy to me; it seemed like—how is it doing this without some sort of higher-level reasoning?

And Steven explained that analogies are actually something that large language models are just good at. And actually one of the earliest applications of this sort of deep-learning AI was translation. So Steven explains that the models behind Google Translate, for instance, have been able to generate analogies for years.

Johnson: One of the first tricks that you would do is you’d say, like, “Explain a black hole to me as if I’m, like, seven. You know, give me a basketball metaphor for what a black hole is.” And so it was very clear that, like, you know, “Convert X into a Y that—where the meaning is the same, but the format is changed,” was something that, like, just was an innate capability of these models.

So all we did with Audio Overviews is in the instructions for the hosts, we say, you know, “Make this interesting, engaging. Use helpful metaphors to describe it if it’s complicated.”

Feltman: Yeah, there’s something, like, very, galaxy brain about thinking about metaphors as just sort of literal, like, “This data gets grouped together and therefore,” which is what LLMs are doing.

Parshall: Yeah, it’s like I—I’d never really thought of it like that, as, like, a translation between language, but I guess when I think about the work that I do trying to turn really dense scientific jargon into something that’s engaging and understandable to read, I guess it does kind of feel like a process of translation.

And I was also struck from talking to Steven just how much human hands had shaped this output that we hear. Like, from our perspective you just press a button and then the machine does the rest, but there’s a whole prompt that has many layers that’s working underneath that is interfacing between you pressing a button and what instructions the model actually gets.

So Steven describes the Audio Overviews prompt as a, quote, “work of art” and a “writerly” achievement more than a technological one. And as a writer by trade he oversees this whole editorial side of things—for example, the tone. So you noticed earlier when it was talking about you, it was very complimentary, very default happy, peppy. That was a conscious choice that they made, for this default tone to be one that is kind of positive, and it applies that to everything, whether it’s appropriate or not, though they did recently introduce this customization option that lets you change the tone.

Johnson: We just felt like, you know, the sound of, like, a great radio show, where people are interested and engaged, was a good place to start. You know, I’m not giving anything away here with the idea that we’re gonna—we’re thinking about other formats.

Even with the customization stuff that we just rolled out, like, I tried it the other day, where I, I put up a piece that I’d written a couple years ago and I gave it custom instructions, not really to explain the piece but to just relentlessly criticize the author of the piece in the style of, like, an insult comic at a roast. And so it just—they start off, it’s like, “Well, we got this piece, ‘The Man Who Broke the World’ by Steven Johnson, and wow, he really kind of blew it with this one. I mean, did he do any research for this piece?” Like, it’s just, like, what—really [laughs], it was just mean. It was just totally mean. They still sound kind of, like, playful, but you can steer them in that way.

Feltman: [Laughs] Okay, so we’ve talked about how the product works, but who and what is it for?

Parshall: That is a great question. I mentioned before that I could imagine how an overworked student could use this to catch up on reading or something, but generative AI is always going to have certain inaccuracies. And if you’re using Audio Overviews to prep for class, there is a serious risk that anything and everything you’re hearing could be inaccurate.

Feltman: Right.

Parshall: And not only just that each fact could be inaccurate, but the whole framing could be inaccurate. Is this something you can really trust?

To help understand this I spoke with Emily Bender. She’s a linguist who often studies AI at the University of Washington. She’s also the host of a podcast with a very cool name: Mystery AI Hype Theater 3000.

Feltman: That’s a great reference; I love it [laughs].

Parshall: Fantastic [laughs].

And before our conversation I had actually dug up one of Emily’s linguistics papers and fed it to NotebookLM’s Audio Overviews, and I sent that to her to get her thoughts on how it represented her work.

Emily Bender: So the first thing I have to say is that I usually avoid listening to synthetic media. Nobody’s accountable for it, and so I don’t want to be spending time trying to make sense of it. But I did listen to it, and it was painful [laughs].

It was painful partially because that was a paper that was sort of a update at the 20-year mark of this long-term project we’ve been working on, and so there’s a little case study in our paper about a grammar for a language called Paresi-Haliti from Brazil, and they make a big deal about it as this experiment on Paresi-Haliti—it’s, like, no, actually that’s not how we framed it. And the voices on the podcast sound very confident, sound very knowledgeable. They’re lighthearted, and it sounds like they are giving you a sense of how this work fits into a broader picture when they absolutely cannot.

Parshall: Rachel, you know pretty well what it takes to turn a scientific paper into an interesting and engaging podcast, do you not?

Feltman: Yeah, I mean, not to toot my own horn or anything ’cause I think the AI people [laughs] just did that enough to last me all of 2025. But yeah, it’s tough. You know, we fact-check this show really thoroughly, and that’s on top of having, like, trained reporters and editors doing their best to present stuff in a way that is engaging and also accurate. So yeah, it’s a big undertaking.

Parshall: Yeah, the fact-checking is very thorough—shout-out.

I asked Emily about whether this seems like something an LLM, a large language model, can do well currently.

Bender: So that is an extremely difficult task, and I think the problem is that LLMs can make it sound like they’ve done it when they haven’t. And so you have to sort of have a deep understanding of what’s going on: What’s new here? What have we learned that we didn’t know before? And then what else would the lay audience need to know in order to appreciate that?

Well, this NotebookLM thing isn’t doing that at all; some percentage of it’s gonna be wrong, either wrong in the details or wrong in sort of where it lays the focus, and the people who are encountering it aren’t mostly gonna be in a position to figure that out.

Parshall: I asked Steven how his team at Google thinks about this and tries to address some of these inaccuracies, and he said that the text-based features, which do provide citations, are generally more accurate than the audio.

Johnson: If you are looking for, like, the most factual thing you could possibly get from a language model with grounded sources, we believe that is NotebookLM, but you should do that in text. And if you’re looking for something that’s more entertaining, that may be a little bit looser and doesn’t have citations, obviously, Audio Overviews is a great way to go for that.

Parshall: I also spoke with Anjana Susarla, who studies responsible AI at Michigan State University. She called NotebookLM’s Audio Overview an “incredible tool” and was impressed by the Deep Dive podcast it had generated based off of one of her own studies. But besides accuracy she also had some other concerns.

First among them was bias. These are very white-sounding voices, They’re just speaking in a very particular, quote, “general American accent.” They’re trained on training data mostly from Western parts of the world. And so, like almost all applications of AI, you have to ask the question: Who is this for? Where is it getting its data, and who is it leaving out?

Second, the environmental cost—we have no idea how much computing power this takes. And since the AI boom Google’s AI energy usage has been going up.

And then there’s the copyright question. So here’s Anjana.

Susarla: The final thing is, of course, this relies on some, you know, the content, the copyright. Where is this data kind of coming from, and is this going to exacerbate all these existing inequities and their ethical concerns about broader implications, and so forth?

Parshall: Yeah, so that big question of where the training data is coming from: one place that it could be is from YouTube, which Google does own. Following some reporting by the New York Times, Google admitted that it used some YouTube videos to train AI. And that’s something that might have violated the copyright of the YouTubers that content belonged to, according to the New York Times.

I asked a Google spokesperson directly whether the models underpinning Audio Overviews were trained using any transcripts, audio or video from YouTube, and they said that “Google’s models can be trained on some YouTube content, in accordance with [their] agreements with YouTube creators.”

Feltman: Very vague [laughs].

Parshall: Yes, when—I did ask them which agreements they were referring to, if it was the standard YouTube license, which just, like, governs most YouTube videos, or if it was something that their creators had specifically opted into; they declined to clarify.

Susarla: If Google is indeed training these on a lot of content scraped from YouTube—which it may well be. I mean, they do need a lot of training data to build models like this; we don’t know. But this is kind of the next step in general of where we are with all these models.

Parshall: So for this next step Anjana points to a law in California that was enacted this fall that will eventually require companies based in the state, like Google, to disclose their training data for their AI models.

And honestly, Rachel, I feel put in kind of a strange position here as a reporter because I feel like that’s where you and I kind of enter the story and our whole team here because, as Emily put it to me, quote, “they’re getting that tone of voice somewhere.”

Did you hear anything in that tone of voice that sounded familiar [laughs]?

Feltman: [Laughs] You know, Allison, I really did [laughs]. So yeah, the first time I heard a snippet of one of these AI conversations, like I said, I had a lot of feelings about it ’cause one of my first thoughts was, like, “Wow, they stole my voice.” I mean, not me specifically; I’m not that vain, despite what the AI podcast may have led you to believe [laughs].

Parshall: [Laughs] Your ego has been inflated.

Feltman: [Laughs] Yeah, yeah. But so much of the texture that makes this audio sound engaging and real feels just ripped straight from existing podcasts. And I know it’s ripped from existing podcasts because a lot of the stuff that I’m hearing—which is, like, breath sounds and filler words and, like, a little bit of uptalk, some, some giggling, occasional vocal fry—this is the stuff I get angry emails about from gentlemen of a certain age [laughs] who think I’m ruining the great tradition of radio broadcasting. And they’re making this sound real by, like, breaking the same so-called rules that me and other women and queer people on podcasts have been criticized for breaking. So I won’t lie, that feels kind of gross to me.

Parshall: Yeah, one of the things that Steven mentioned, too, was they wanted the sound of, like, this great radio broadcasting show. And it is worth noting that Science Quickly is on YouTube—and all of our hard work at Scientific American and the hard work of other outlets.

So, the question remains: How will this be used? Will it be used responsibly? Here is what Anjana thinks.

Susarla: Right now it seems more like everyone’s just dazzled by what it can do rather than: What are we gonna use it for? Do we really need these, you know, capabilities?

I can think of one or two examples, I think, in the health care context: Can you generate clear summaries for health literacy or your doctor’s notes? So this may be a good way to give, like, a summary of—rather than giving a piece of paper to read.

Is that going to have some challenges? What are some specific issues we need to be looking for? We’ve just unleashed this tool on people, and we’re still catching up. We will need audit studies and then look at trustworthiness, assurance, and so forth.

So what’s the right way to responsibly use NotebookLM? We are still figuring it out [laughs].

Parshall: Well, I think where I want to leave us is with an important point from Emily, which is: you shouldn’t let the incredible fluency of these AI-generated voices fool you into thinking and talking about these models as if they’re human.

Emily coined a term that’s been used quite widely called the “stochastic parrot.” Like, when a parrot says, like, “Polly want a cracker,” it’s not really connected to meaning; it’s just mimicking. She also sometimes calls these large language models, quote, “synthetic text extruding machines,” she said because it’s more accurate to what they’re actually doing and also sounds a little gross.

But I think she’s getting at an important point there, right? It’s, like, this—just a philosophical question of: What is language for, and what are we communicating when we say words?

Bender: Languages are systems that are pairings of form and meaning, systems of signs. So when we speak we are using word forms because they relate to specific meanings, but language models only have access to the form. And so they can give us stuff that looks good, it’s got the right shape, but there’s no relationship to meaning, no communicative intent, no accountability.

Feltman: I guess sort of the takeaway on my complicated feelings on this is: I’m not worried that my job is gonna be stolen by a collection of tabs [laughs] being processed by Google because it is a lot of work to be accurate. And if people find this kind of engaging conversation a more accessible way to sort through really dense text, that’s awesome. But, of course, there are so many caveats about where the training data has come from, bias, accuracy. So I think it’s, like—just people need to be really aware of what they’re getting, and I don’t know that I feel that the products themselves are doing a great job of making people aware of those caveats.

Parshall: Yeah, there’s usually just kind of a disclaimer at the bottom. I think one of the things that this whole AI boom—like this, the AI search features—have made me feel kind of uncomfortable about: Are we going to be expected, and have we already started to expect, our information to contain just that more amount of mistruths? It’s something that I just worry about.

Feltman: Well, Allison, thanks so much for coming on to chat about this and also for having a couple of AI voices gas me up. Thank you.

Parshall: Thank you.

Feltman: That’s all for today’s episode. We’ll be back on Friday with Episode One of a four-part miniseries on the future of animal conservation.

Science Quickly is produced by me, Rachel Feltman, along with Fonda Mwangi, Kelso Harper, Madison Goldberg and Jeff DelViscio. Today’s episode was reported and co-hosted by Allison Parshall. Shayna Posses and Aaron Shattuck fact-check our show. Our theme music was composed by Dominic Smith. Subscribe to Scientific American for more up-to-date and in-depth science news.

For Scientific American, this is Rachel Feltman. See you next time!

Source link

On supporting science journalism

Start typing and press enter to search