Sign in or Join the community to continue

Whose Opinions Do Language Models Reflect? Research Presentation by Shibani Santurkar

Posted Sep 12, 2023 | Views 9.7K

# AI Literacy

# AI Research

Share

speaker

Shibani Santurkar

AI Researcher @ OpenAI

Shibani Santurkar is a researcher at OpenAI working in building safe and reliable machine learning models. Shibani received a PhD in Computer Science from MIT in 2021, where she was advised by Aleksander Mądry and Nir Shavit. Subsequently, she was a postdoctoral researcher at Stanford University with Tatsu Hashimoto, Percy Liang and Tengyu Ma. She is a recipient of the Google Fellowship and an Open Philanthropy early-career grant.

Link to paper: https://arxiv.org/abs/2303.17548

Abstract: Language models (LMs) are increasingly being used in open-ended contexts, where the opinions reflected by LMs in response to subjective queries can have a profound impact, both on user satisfaction, as well as shaping the views of society at large. In this work, we put forth a quantitative framework to investigate the opinions reflected by LMs -- by leveraging high-quality public opinion polls and their associated human responses. Using this framework, we create OpinionsQA, a new dataset for evaluating the alignment of LM opinions with those of 60 US demographic groups over topics ranging from abortion to automation. Across topics, we find substantial misalignment between the views reflected by current LMs and those of US demographic groups: on par with the Democrat-Republican divide on climate change. Notably, this misalignment persists even after explicitly steering the LMs towards particular demographic groups. Our analysis not only confirms prior observations about the left-leaning tendencies of some human feedback-tuned LMs, but also surfaces groups whose opinions are poorly reflected by current LMs (e.g., 65+ and widowed individuals). Our code and data are available at this https URL.

+ Read More

SUMMARY

About the Talk: Language models (LMs) are increasingly being used in open-ended contexts, where the opinions reflected by LMs in response to subjective queries can have a profound impact, both on user satisfaction, as well as shaping the views of society at large. In this work, we put forth a quantitative framework to investigate the opinions reflected by LMs -- by leveraging high-quality public opinion polls and their associated human responses. Using this framework, we create OpinionsQA, a new dataset for evaluating the alignment of LM opinions with those of 60 US demographic groups over topics ranging from abortion to automation. Across topics, we find substantial misalignment between the views reflected by current LMs and those of US demographic groups: on par with the Democrat-Republican divide on climate change. Notably, this misalignment persists even after explicitly steering the LMs towards particular demographic groups. Our analysis not only confirms prior observations about the left-leaning tendencies of some human feedback-tuned LMs, but also surfaces groups whose opinions are poorly reflected by current LMs (e.g., 65+ and widowed individuals).

+ Read More

TRANSCRIPT

Hello, everyone. If you don't know, I'm Natalie Cone, the OpenAI Forum Community Manager. I want to start our talk today by reminding us all of OpenAI's mission.

OpenAI's mission is to ensure that artificial general intelligence, AGI, by which we mean highly autonomous systems that outperform humans at most economically valuable work, benefits all of humanity. Today's talk, Whose Opinions Do Language Models Reflect, will be presented by my colleague, Shabani Santukar.

Shabani is a researcher at OpenAI working in building safe and reliable machine learning models. Shabani received a PhD in computer science from MIT in 2021, where she was advised by Alexander Madri and Nir Shaveen. Subsequently, she was a postdoctoral researcher at Stanford University with Tatsu Hashimoto, Percy Liang, and Teng Yu Ma. She's a recipient of the Google Fellowship and an Open Philanthropy Early Career Grant. Shabani, welcome to the forum. We're so honored to have you here sharing your important research with us today.

Thank you for that introduction, Natalie, and thank you so much for having me. I'm just going to share my screen. I just want to make sure that you guys can see the screen. It looks perfect. Awesome. Let me just get started.

I'm Shabani, and like Natalie said, I'm a researcher at OpenAI. I'm very excited to talk to you guys about some of my recent work, which I did during my postdoc at Stanford, with several of my amazing colleagues, Esen, Faisal, Sinu, Percy, and Tatsu. I'm just going to dive right into it.

I don't think I need to tell anyone here that language models are going to have a big role in shaping society. There are many ways in which language models can influence society, but I think one of the most glaring examples of this is in how they respond to open-ended queries like this.

These are some examples of queries that a user posted on Twitter, where basically the user asked Chad GPD to write a positive poem about two political leaders. In one case, in the case of Donald Trump, the model just refused to do it, saying that it needed to maintain a neutral stance, whereas in the other case, for Joe Biden, the model happily wrote an extremely positive poem.

In these responses of models to open-ended queries, you can kind of see that there are some underlying values or opinions that seem to be baked into these models. And so a big question for us was, whose opinion is it that these language models are reflecting? And note that the answer to this question is not completely obvious a priori, because these models are shaped by the opinions of a whole bunch of humans, right from people on the internet, from where the pre-training data for a lot of these models comes from, to the crowd workers that provide human feedback to improve these models, and of course to those of the developers at these big organizations that are developing the models in the first place.

So the central goal of our work was to try to understand whose opinions these models reflect, and specifically to perform a fine-grained comparison of the opinions reflected by language models and those of human populations on topics of public interest.

Okay, so how do we go about doing this? So measuring language model opinions in every possible setting almost seems impossible. After all, there's so many different ways we can pose a query to a language model, and completely analyzing an open-ended model response seems like inordinately challenging. But it turns out that this problem doesn't only arise in the case of language models. In fact, it arises even when we actually want to measure public-oriented opinion, which is done frequently in many disciplines, from media to education to politics. And there, the typical tool that's used is that of public opinion polls.

So basically, organizations like YouGov, Pew, Gallup, they organize surveys where they leverage experts to curate these multiple-choice questions on a topic. And these questions are specifically designed to cover various nuances of a topic, and also to, as accurately as possible, elicit people's preferences. These surveys are then presented to individuals, and then these responses are aggregated over different populations, giving us human opinion distributions. So you could imagine Pew conducting such a survey on gun control in different populations, and then obtaining opinion distributions, and then doing some post-hoc analysis, for instance, comparing the opinions of Democrats to Republicans.

Now, to measure language models' opinions, what we do is we repurpose this tool. Basically, we take these multiple-choice questions, and now we present them to language models. Basically, we ask a language model, in text form, what exactly should be the answer to this question. We then measure the log probability that the model assigns to various answer choices, and then use these to measure the model's opinion distribution. Note that this step of measuring the log probabilities to obtain this opinion distribution is analogous to sampling many responses from the model, and just obtaining a distribution from this. So you basically can think of this model opinion distribution as analogous to what you're obtaining from that of a population of humans.

Now, with these objects in hand, the human opinion distribution and the model opinion distribution for a given question, we can then begin to ask, how aligned are they? And so in our paper, we focus on measuring such opinion alignment between human and model opinion distributions, using the metric of a vast sustained distance. Basically, this asks how similar are these two distributions. And importantly, in our paper, we also think about different ways of prompting the model.

So we consider the default model opinion distribution, which is basically just asking the model this multiple-choice question without any additional context. And we also consider the steered opinion distribution, which is basically asking the model the same question by providing additional context. For instance, telling the model to emulate the opinions of a particular group, such as Democrats. And this allows us to actually study two different objects. One is the opinion alignment of a model by default, which is probably what you would get if you asked one of these language models a query without any additional information. For instance, if I asked JGPT to write a poem about Donald Trump, this would kind of constitute its default opinion distribution. And the second object that we have access to is also the steered opinion alignment, which is basically analogous to if we interacted with JGPT, and JGPT kind of knew a lot of stuff about us, how effectively would it be modeling the opinions of the group that it's interacting with?

Okay, so this is the general methodology that we develop in our paper. And for our analysis, we actually instantiate this methodology to study current language models. And this requires us to come up with two main things. The first one is the data set.

So in our paper, using surveys, we build a data set called opinion QA. And this data set is based on Pew Research's American Trends panel. And I'll talk a little bit later in the talk about the limitations of this survey. But I just want to preface this by saying that this is based on the US only. And this data set consists of 1500 questions on topics ranging from guns to driverless cars to families. So it's an extremely broad data set and is in no way limited to just political bias, but contains a whole range of interesting questions. And for each of these questions, we have access to individualized human responses. So basically responses from specific participants in the order of 1000s of people. And along with that, we have information about demographic attributes of these individuals, which constitutes 60 groups.

And the second main component of our study, of course, is the models themselves. And for our study, we study nine language models. And these are from AI21 and OpenAI. And they range in size from 350 million parameters to 178 billion parameters. I want to say that this study was done before the release of GPT-4. So the models you're going to see here are pretty old. By some standards, they're all GPT-3 and before. And we also consider in our study a range of different language models, both base language models, as in the models that are just pre-trained on internet data, as well as fine-tuned language models. So basically, these are models that have been post hoc, fine-tuned to be more aligned with human preferences.

Okay, so the big question is, what do we find? And how aligned are current language models with humans? So in our paper, we basically consider this alignment along three axes. The first axis we consider is representativeness. And what this asks is, how aligned are the default model opinions, that is the opinions of the models without any additional context with those of a particular human population.

So to start with on this slide, we look at the representativeness of different language models of the general US populace. So basically, we're measuring the opinion alignment between the language models' opinion distribution and that of the general US populace aggregated over the entire data set. And here's what we find.

So what you're seeing here is the representativeness scores of different models. You see the ones from AI21 and OpenAI. And in general here, a higher score means that the model is more representative. The highest possible value is one and the lowest possible value is zero.

So in general, what you see here is that most of the language models are not particularly representative of the general US populace. And to contextualize these numbers, I've designed this.

Improve the readability of this transcript by adding appropriate line breaks between speakers:

Speaker 1: Scale here, which basically measures how representative the groups of one particular US demographic group or of another. So for instance, you can see a circle here that says Democrat, Republican climate change. And that basically says how representative are Democrats views on climate change of those of Republicans. And so if you look at the scale and compare this with the scores that we see in this table, you see that most language models are about as representative of the views of the general US populace, as maybe Democrats are of Republicans on climate change, or conservatives are of liberals on the government's role, and government control in general. And interestingly, what was even more surprising for us was the fact that these newer language models that have been explicitly trained to be more human aligned, are actually even less aligned with the views of the general US populace.

Speaker 2: So to see that you can compare the representativeness scores of the Da Vinci model from open AI to the text Da Vinci model. So the text Da Vinci model was actually explicitly fine tuned on human preferences to be more human aligned. And you see that its representativeness score is much lower than the base model. And I'll come back to this finding in a few slides. But before going there, we actually did a more fine grained study. And we asked, what if we don't look at the general US populace, and we look at the alignment with respect to specific demographics groups in the US. So again, we find a big shift between base language models, and those that are fine tuned on human preferences. And the shift is towards aligning more with the preferences of more liberal and educated parts of the US populace. So for example, just focusing on the table on the right, you see that the lighter colors which indicates more alignment or more representativeness. For the base models, you see that this is more with people who are like high school or less educated. And then for the fine tune models, which is the text series here, you see that it's very saliently towards college or higher education. And another thing that we found in our study is that there are certain groups of the US populace that are uniformly poorly reflected by all the models. So for instance, the older parts of the population, particularly people from certain religions, such as Mormons, and people who are widowed. And interestingly, we found that actually, even though it's hard to completely pinpoint why these shifts and why these cues are happening, it turns out that they somewhat align with the demographics of the crowd workers who are used to provide feedback. So looking at the limited amount of information that's available publicly about the contractors that were hired to provide human feedback, we see that their demographics actually align with the shift that we're seeing from base language models to fine tuned ones. So in particular, in this table, you see that the contractors had almost no individuals above the age of 65. And most of the contractors were at least high school educated, if not much higher.

Speaker 3: Okay, so in the previous slide, we actually saw that some of the human feedback fine tune models that were explicitly trained to be more human aligned, were actually less representative or less aligned with the views of different parts of US society. And so we wanted to understand why this was the case, because this seemed like a really puzzling phenomenon. And we actually got a clue about what might be going on by looking a little bit closer at the kinds of answers the model is predicting. So specifically, what we saw is that for most of the models, and for humans as well, if we looked at the diversity of responses that we got from the model or from a group of people, we saw that they were pretty diverse. What I mean by that is, if you ask a question about gun control to a group of Democrats, you're likely to get some variation in the responses across individuals. Whereas, on the other hand, for the newer fine tuned models, in particular, text DaVinci three, we saw some very different behavior. In particular, what we saw is that the model had almost no diversity in its responses, and was very confidently answering only one particular one of the particular choices.

Speaker 4: So our hypothesis was that the reason why these models are less representative of the views of a particular human population is that rather than capturing the entire diversity of opinions, they are actually collapsing to the dominant viewpoint of that group. And so in this figure, what we did is we measured the alignment on the x axis, you see the alignment of the model with respect to the actual opinion distribution. And on the right axis, you see the alignment with respect to the dominant viewpoint of a particular group. And what we see in particular for text DaVinci three, is that these models are much more aligned with the dominant viewpoint of a group than they are with the entire opinion distribution. So one way to think about this is that the models are not only getting skewed towards more liberal and educated and wealthy parts of society, but they're almost caricaturizing the opinions of those groups by just collapsing onto what is the stereotypical or the dominant viewpoint of that group.

Speaker 5: Okay, so now we've talked a lot about representativeness, which is, once again, the alignment between the default model opinions and that of a human group. But you might ask what happens when you actually steer the language model to behave like a particular group. After all, when most of us interact with chat GPT, or any other language model, we're not just asking it a query in vacuum, the model has already some information about us, and is probably personalizing to us to some extent. So our big question was, if we explicitly prompt the model to behave like a certain group, can the model actually emulate the viewpoints of that group? So once again, we have a plot where on the x axis, you see the default alignment of the model, that is the alignment between the default opinions and that of a particular group. And on the right axis, you see the steered alignment, that is how aligned is the language model with the opinions of a group after it's explicitly been steered or prompted to behave like that group. And intuitively, the picture looks as following.

Speaker 6: So if you just draw a dividing line across this plot, a model would be more steerable, the more it is towards the top left corner, and less steerable, the more it is towards the bottom right corner. And ideally, what we would like is to see a line like this. And here, you can think of each point in the line to be a different group. And what we might expect is that even if the model is differently aligned with different groups by default, after steering, it gets equally aligned with all the groups. So the line is kind of flat. So even if the model is biased towards Democrats over Republicans, by default, if you explicitly steer the model to behave like Democrats and Republicans, it can emulate their viewpoints equally well.

Speaker 7: Okay, so what do we find? So this is what this durability of different language models looks like. And what we see is, I'm going to first highlight the most and the least durable model. So if you think about the ideal line that we saw in the previous slide, you see that this gray line here, which is one of the instruct or the fine tuned models from OpenAI is actually most durable, because it's closest to being a flat line. And this other model, which is one of the older base models is actually least durable, and in fact, actually happens to get worse by steering. What is salient here to note is that even though most of the other language models are somewhat durable, as in the shift towards the top left corner, none of them are anywhere close to the ideal line, which means that even though these models do tend to align more with the opinions of certain groups after steering, by no means do the performance disparities across different groups get corrected. So as a concrete example, if you think of a model that was better aligned with liberals over Republicans before steering, or by default, what this says is that by steering, the model does get better aligned towards both Republicans and Democrats, but it's still preferentially aligns more with one particular group.

Speaker 8: Okay, so the third axis we consider in our paper is that of consistency, which is basically asking, do language models consistently lean towards the opinions of certain groups across topics. So the way we measure this is by asking, how often is the group that the model is best aligned with overall, the same as the group that a model is best aligned with on a topic. So you could imagine that a particular model is liberal overall, but maybe on certain topics, it's really leaning Republican, or you could imagine a different model that's just uniformly liberal across all topics. And so the goal of our study here was to understand how consistent are these language models opinions. So in this table, you see the consistency scores, again, ranging from one to zero to one with one being more consistent and zero being less consistent. And if you look at a particular number here, for instance, if you look at the 0.5 number for text Da Vinci do what this means is that only for about 50% of the topics is the is the group that the model best aligns with the same as the group that the model best aligns with overall. And like the way to interpret this is basically that these models have a mishmash of different opinions and across topics, and they don't really reflect, most often they don't really reflect a consistent viewpoint or a consistent persona.

First of all, thank you for your questions. As for your first question, you are correct that large polls may have limitations and may not cover all emergent topics or have sufficient data on certain subjects. In such cases, it may be necessary to explore other methods to gather data. For example, engaging with stakeholders directly, conducting focus groups or interviews, or utilizing other sources such as social media data or online forums can provide valuable insights. It's important to consider multiple sources and methods to gain a more comprehensive understanding of public opinion and viewpoints.

As for your second question, I'm not sure what the complete question is, could you please complete it?

Speaker 1: ... So is it really good to align the AI models with these kind of groups that we don't want to see aligned with? Speaker 2: Yeah, great questions both of them. So for the first question about the data source...

Speaker 1: Thanks. It was really good answer. Speaker 3: Thank you, Peter. And just so Peter knows...

Speaker 4: Okay. So, if I understood this correctly...

Speaker 5: Hi, I'm not 100% sure...

Speaker 4: Thank you, Colin. Next up, Chris Soria has a question for you, Shabani.

Speaker 6: Hi, Shabani. Hi, Natalie. Can you hear me?...

Speaker 6: Great questions. Okay. So for the first question...

Speaker 6: Yep. Great questions. Okay. So for the first question...

Speaker 6: And so with that analysis, what I was trying to say is...

Speaker 6: Yeah. It's a really good question. I think...

Speaker 6: So you're totally right. It would be amazing...

Speaker 6: Exactly. And I think that's a super good point...

Speaker 6: Yeah, exactly. So I think that this question...

Speaker 7: Thank you, Chris. Next up is Luis.

Speaker 8: Hi, Shabani. Thank you for your presentation.

Speaker 9: Hi, Shabani. Thank you for your talk.

Speaker 10: Hi, Shabani, thank you for your talk.

Speaker 10: So my question is, I think the outcomes...

Speaker 10: Thank you, Luis. I will go next and then it will be Shiva.

Speaker 11: Hi. So when I was listening to your talk, I think I heard you mention or perhaps imply...

Speaker 12: Hi, Shabani. Thank you for that talk. It was really great.

Speaker 12: Hi. So my research focus is ownership and representation in AI...

Speaker 12: Thank you, Shiva. Next up is Melanie.

Speaker 13: Hi, Shabani. That was a great presentation. Thank you.

Speaker 13: Hi, Shabani. Thank you for your presentation.

Speaker 13: Thank you, Melanie. Next up is Stephen.

Speaker 14: Hi, Shabani. Fantastic talk. Thanks so much.

Speaker 14: Hi there. Shabani, I'm Kim. I'm from the Transformer team. Thank you for that talk.

Speaker 14: Sure. Thanks, Stephen. I think a lot of these situations...

Speaker 15: Hi, Shabani. Thanks so much for the talk.

Speaker 15: Hi, Shabani. Thank you for your presentation.

everything. If you just look at what they have reported about the statistics of the crowdworkers or annotators they use, it seems to match up in many cases, like another really interesting example for me was, we saw that the base language models, just the pre trained ones, were typically aligning with like Christian religious values, whereas the fine tuned ones were aligning with, you know, Buddhism, Hinduism, Islam, and I was like, really surprised. But this kind of makes sense if you look at the demographics, because a lot of the contractors are Southeast Asian. And so you can like really see these things manifesting.

The second question that you asked about fine tuning, maybe there was a misunderstanding there. Because in our study, we didn't do any fine tuning on the models, we just did prompting. Maybe ideally, we would want to do fine tuning to align with different groups. But this seems like really challenging and complex, especially given how many different viewpoints there are. And you could imagine that you don't only want to steer to get the model to behave like these distinct groups, but also like different people who are like much more nuanced and different combinations of like, you know, groups. So we just did prompting. And we tried a very simple, we tried like three approaches to prompting, and you can see more about it in our paper. But we were fairly simplistic. Basically, what we're asking is in different ways, we're asking the model to behave like it is a certain group. And we're trying to see whether the model can like kind of correct for it. And just like you said, and we see that the model, excuse me does correct for it to some extent. But what's kind of problematic is that the model, it doesn't, like, correct the skew, like if the model is better, is better representative of one group than the other, it does get better at both when you know, you steer it, but the disparity still remains. So both of them may be improved by five percentage points. But this doesn't fix the like, 30 percentage point gap that there was to begin with, basically.

Yeah.

So Shabani, when you discussed steering, the steering was accomplished by the prompting, not by fine tuning.

Yes. And I think that was a very simplistic approach to steering. I think it would be really interesting where if it was a more implicit steering, like you could imagine when you interact with charge upd, you don't explicitly go out and say, I'm a Democrat, or like, it's more subtle, it's more like implicitly expressed, maybe in your conversations with the model. And it would be interesting also to see like, how good the model is at being steered implicitly as well.

Thank you for your question, Chris. And also Shabani, just to let you know, Chris is working on his PhD at UC Berkeley, being advised by Claude Fisher, a distinguished professor in sociology, and the team recently reached out looking for some advice related to fine tuning. So perhaps I can connect you and his team afterwards, since you really have been leaning into this research.

Yeah, happy to chat. Thank you.

Yeah. We have seven more minutes. If anybody has any more questions, we're happy to take them. I'm also happy to give everybody some time back. Going once, going twice.

Awesome. Shabani, that was so impressive. Again, I really wish we would have just kicked off a forum with your research because it's so relevant and it answers so many questions that forum members have been surfacing since the very beginning. So thank you so much for joining us today.

Thank you so much for having me.

Yeah, that was so amazing, Shabani, and I hope that you'll come back. I always like to remind us of the events that we have in the pipeline. But first, if anybody is interested in reading Shabani and team's full research paper, our associates going to drop it in the chat right now. Also, October 5 will be our very last in-person event this year. We're going to scale back the forum programming just a bit towards the last quarter so we can make space for 2024 strategic planning for this community. And the rest of the events after October 5 will be hosted virtually. On October 5, we'll be hosting the CEO of WorldCoin and Tools for Humanity, Alex Blania, and OpenAI's head of policy research, Myles Brundage, which you were all introduced during our AI literacy event a few weeks ago. Seats are limited for that event as are all in-person events. So please register ahead of time to save your spot. Also note that that event was originally scheduled for September 21. And due to extenuating circumstances was rescheduled for October 5. So if you previously had registered for September 21, please take note of that new date. Also, we'll be hosting Terence Tao for the future of math and AI. He is a world renowned mathematician, as many of you know. We will also be hosting Ilya Svoboda's one of our co-founders of OpenAI during that event. It will unfold virtually here on the forum on October 12. And that is it for us this evening. Again, Shivani, thank you so much for joining us. Thank you all who took some time out of your evening to participate and enrich our discussions. You always make the event so wonderful because of your amazing and deep questions. So I hope to see you all again very, very soon. In just a couple of weeks, I will see Colin and I will see Peter and I will see Evan in San Francisco and we'll enjoy a lovely boat ride along with recording all of your presentations so that we can also host those on demand in the forum so that everybody else who was not in the Democratic Inputs group will be able to see what you've been working on. So thank you so much, everybody. It was lovely to see you here this evening. And until next time, have a wonderful rest of your night.

+ Read More

Sign in or Join the community

Like

Comments (0)

Popular

Watch More

AI Literacy: The Importance of Science Communicator & Policy Research Roles

Posted Aug 28, 2023 | Views 40.7K

# AI Literacy

# Career

AI Art From the Uncanny Valley to Prompting: Gains and Losses

Posted Oct 18, 2023 | Views 39.2K

# Innovation

# Cultural Production

# Higher Education

# AI Research

Worldcoin: Distinguishing Humans From AI

Posted Nov 13, 2023 | Views 34.7K