Is Notebook LM biased?
Google’s Notebook LM is justifiably receiving a lot of attention. It’s a unique generative AI system that allows you to add resources (PDFs, websites, Google Docs, etc.) that become part of the knowledge base for chat sessions; it also creates things like overviews, tables of contents, and more automatically. The “wow” feature is its ability to create, with a single click, an audio overview of the resources, which is like an NPR-style podcast in which two AI voices, one female sounding and one male sounding, discuss the information contained in the resources. It’s kind of magic. (I wrote an article about Notebook LM earlier. Check it out for more details.)
I love to use Notebook LM as the big finish when I give talks about AI to various groups. Recently, I was speaking at the 2024 PodINDY podcasting conference. At the end of one of my talks, I demonstrated Notebook LM’s audio overview. Most people were blown away, but two audience members were familiar with Notebook LM and asked, “Is there any way to not make the female voice sound like an idiot?” (or something to that effect). Although I didn’t expect the question, I wasn’t surprised. In my experience, the male sounding character usually takes the lead and dominates the conversation. I responded by commenting that I had the same experience as the audience members and promised I’d try using custom instructions to see if I could adjust the flow of the conversation.
This is a symptom of a bigger problem in generative AI. Much of the data on which the underlying large-language models (LLMs) are trained are biased, so many of the results are too. This is a well-known problem, but the apparent bias in Notebook LM’s audio overview is an interesting, unique example of AI bias.
As promised, I used the customization capability to direct Notebook LM to make the female-sounding character the leader and sound like the expert. Typically, you just click on “Generate” to create the audio overview (see below), but I added the customization instructions shown in the second screenshot below.
The results seemed to work. The female character seemed to lead the conversation, which sounded pretty balanced to me. The female character did seem to be the expert.
So, I thought, “Great! That was easy.” But then I pondered a bit more and wondered whether the result was random or actually due to the custom instructions. I ran a second test and the results were not as good (balanced).
The Experiment
That led me to a rabbit hole of experimentation. I decided to do a more systematic investigation to see if Notebook LM really did exhibit bias in its audio overview. The results were interesting.
Here’s what I did to test Notebook LM’s bias:
Created five audio overviews of notebooks on topics (color vision deficiency, mobile commerce, information security, human behavior, and managing stress) because I had resources readily available.
Generated audio interviews with Notebook LM without customizing the instructions.
Transcribed the audio using Otter.ai (which is great, by the way).
Asked ChatGPT, Claude, and Gemini to evaluate which character is:
Leading the conversation
Dominates the conversation, and
Sounds smarter.
The AI evaluators could choose the female character, the male character or neither. I also asked the tools to explain their reasoning for each choice.
(I told you this was a rabbit hole.)
Results
The results were revealing and a little surprising. In a shout-out to podcast coach Dave Jackson, let’s call the female character Sheila and the male character Kyle. There were 15 total trials (three models evaluating five conversations). Here are the results. (The percentages may total to 100% due to rounding.):
Led: Sheila - 4 (27%), Kyle - 10 (67%), and neither - 1 (7%).
Dominated: Sheila - 5 (33%), Kyle - 9 (60%), neither - 2 (13%)
Smarter: Sheila - 6 (40%), Kyle - 0 (0%), neither - 9 (60%)
Frankly, I’m not sure what to make of this … at least I wasn’t initially. One conclusion is that in my trials, the female character did not seem like an idiot, quite the opposite. She (using the pronoun of the apparent gender) seemed to be the expert in the conversation, providing more details and concrete examples and facts. BUT Kyle definitely seemed to be leading and dominating the conversation; this may have made it seem like he was the smart one on the surface.
So, what’s my preliminary conclusion? It seems to me that Notebook LM is engaging in stereotyping, without a doubt in my mind. However, the relationship between stereotyping and bias is complex and nuanced. The line between the two is thin. I’m not ready to call Notebook LM biased, but I’m also not ready to declare it unbiased. How’s that for an unsatisfying conclusion?
The data point to a clear pattern: the female voice consistently demonstrated expertise and knowledge and the male voice controlled the flow and direction of the conversation. This mirrors a disturbingly common pattern in professional settings where women’s expertise is present but their voices are often managed by male colleagues.
Caveats
There are a few huge warnings here. First, this is a very limited, quick investigation. Before making any solid conclusions, a more careful experiment needs to be done. One thing about my test is that four of the five topics were pretty technical and used academic sources for the notebook’s resource base. The stress notebook was less technical and was based on non-academic resources. It would be interesting to vary the topics and resources more systematically and see if patterns emerge.
A bigger caveat comes from the fact that I used AI to evaluate the transcripts. Since these may have (OK, do have) their own biases, the results could be biased. Interestingly, the bias could go in either direction. The AI evaluations might be more or less sensitive to differences, depending on how the models were trained and what bias guardrails might be in place. Human evaluators may see things very differently.
I’m going to post the audio of the conversations to Substack in the next few days. If you want to check them out for yourself, be sure to subscribe (it’s free and I won’t spam you). Also, I’m thinking about writing an article with more details about the results, including the rationales each AI tool gave for their evaluations.
Conclusions
The pattern in my findings suggests that Notebook LM may have inherited gender-biased interaction patterns from its training data. Although we need to be careful about drawing strong conclusions from limited data, these results highlight how subtle forms of bias can manifest in AI systems - not through obvious discrimination, but through nuanced interaction patterns that mirror problematic societal dynamics. As AI systems get better at mimicking humans, we may be replicating not just our intelligence, but also our biases.
As AI voices become more prevalent in our daily lives, identifying and addressing these patterns becomes increasingly important. I'm continuing to investigate bias in AI, and I'll share more findings as this work develops. In the meantime, I encourage others to listen critically to AI-generated conversations and consider what unconscious biases might be hiding in plain sight.