Sign in or Join the community to continue

Weak to Strong Generalization

Posted Feb 26, 2024 | Views 20.3K

# STEM

# AI Research

Share

speakers

Collin Burns

Superalignment Researcher @ OpenAI

Collin is a researcher at OpenAI working on aligning superhuman models. Before joining OpenAI, he was a PhD student at Berkeley. His research interests include (1) studying extreme forms of "weak-to-strong" generalization, (2) developing unsupervised methods for making language models honest, and (3) understanding when and how high-level abstractions are encoded in representations.

+ Read More

Pavel Izmailov

Superalignment Researcher @ OpenAI

I am a Research Scientist in the OpenAI superalignment team, focusing on weak-to-strong generalization and interpretability. In 2023, I defended my PhD in Computer Science at NYU, under the supervision of Andrew Gordon Wilson. I am primarily interested in understanding and improving deep neural networks. In particular my interests include out-of-distribution generalization, AI alignment, reasoning, probabilistic deep learning, representation learning and other topics. Recently, our work on Bayesian model selection was recognized with an outstanding paper award at ICML 2022.

+ Read More

SUMMARY

Collin Burns and Pavel Izmailov present their research, Weak-to-Strong Generalization

+ Read More

TRANSCRIPT

For folks who I haven't met, my name is Logan Kilpatrick, I do developer relations at OpenAI, so really focused on trying to make developers successful who are building with our platform and get to be the recipient of all the important work that our research team does and help developers make use of that.

I know there's a bunch of folks from the developer ecosystem here today. Before we dive into the talk about superalignment with some of the folks from our research team, just wanted to quickly put in context OpenAI's mission.

I think this talk is actually very, very relevant to OpenAI's mission, and for folks who aren't familiar with it, we're just trying to make sure that artificial general intelligence, which people call AGI all the time, highly autonomous systems that outperform humans at most economically viable work, benefit all of humanity. I think one of the important ways that we need to make sure that these systems benefit all of humanity is by making sure that they actually are safe and aligned and useful to humans. So it's a super important part of the mission for OpenAI, and today we're joined by some of the folks who are pioneering the research that our superalignment team is actually doing, Colin Burns and Pavel Izmailov, and they're going to go through their latest paper, which is called Weak to Strong Generalizations, and it's exploring how you can actually use weak or smaller models to potentially observe and supervise some of the more capable large models.

And again, this is really important work because as we get closer to models that are, quote unquote, super intelligent or superhuman, we need systems to be able to observe and understand what they're doing, and these folks, as we'll hear about, are sort of the ones who are pushing on the edges of this work. So we'll give really quick speaker introductions.

So we're joined by Colin, Colin's a researcher working at OpenAI, obviously, on superalignment. Before joining OpenAI, he did his PhD at Berkeley. His research interests were studying extreme forms of weak to strong generalization, developing unsupervised methods for making large models honest, which sounds really interesting, and then understanding when and how high-level abstractions are encoded in representations, which I'm going to need to double click on later, and hopefully we'll learn more about that.

Pavel's also on the research team at OpenAI, he's a research scientist working on reasoning. He was previously on the superalignment team, he's now on a new team where he was focused on weak to strong generalization and interpretability, which is a very relevant and adjacent field.

In 2023, Pavel defended their PhD in computer science at NYU under the supervision of Andrew Gordon Wilson, and the primary research interest lies in understanding and improving deep neural networks. Lots of other cool things like out-of-distribution generalization, AI alignment, reasoning, probabilistic deep learning. So hopefully we'll get some great questions about some of this stuff.

And Pavel's work has actually been sort of acknowledged and received awards at ICML, which, if folks aren't familiar, is one of the premier computer science machine learning conferences that are out there. So we are lucky to hear from them today. So Pavel, Connor, I will hand it over to both of you to kick this off.

Great. Thanks, Logan. Okay. Can everyone see my screen just fine? Can someone confirm? It's great. Great. Awesome. Thanks for showing up. We're really excited to present our work on weak-to-strong generalization. So this was the first major release from the superalignment team.

The superalignment team was started over the summer, and its goal is to build the fundamental techniques to align superhuman models. And I'll discuss a little bit more what that looks like.

One of the main motivations for our team is that AI systems are becoming much more capable very quickly. I think almost everyone has been astounded by the rate of progress in machine learning, and machine learning researchers most of all. And I don't think this is going to stop anytime soon. And so we should expect much stronger models to come much sooner, or very soon. And I think it's really important that we think ahead to what that might look like, and what the implications of that are.

One important consequence of substantially smarter models is that evaluating the behavior of these models becomes increasingly difficult over time. For example, if you have a model that is behaving in ways that I can't evaluate, then it may not be clear how I can train that model to behave in the ways I want it to behave. It also makes it difficult to evaluate whether models are behaving well or poorly, and it's hard to monitor whether they're safe or dangerous, and so on.

And so this is just a general phenomenon, that it's becoming harder and harder for us to evaluate model behavior. And this is important for alignment in particular, because right now, our core alignment techniques, for example, reinforcement learning from human feedback, or LHF, fundamentally relies on reliable human supervision.

For example, if we have an AI system generating code and telling us about that code, for example in this case, saying that this code multiplies numbers by two, at least many humans can look at this code and basically understand what it's doing. We can see that this is, in fact, a simple program, it is safe, it is just multiplying numbers by two, the model is describing it faithfully, and so on. And we can say that this behavior is good. This is what allows us to train these models and reinforce this sort of good behavior.

And if the model behaved poorly, for example, if it said this multiplies numbers by four, we would know that that's false, and we would be able to penalize the model for behaving in that way. But this is basically how current alignment techniques work. This is how we control models to do what we want them to do. We evaluate their behavior, reinforce the good behaviors, and penalize the bad behaviors.

In contrast, we will possibly, over the coming years, help systems that are in important ways superhuman, in important ways smarter than we are, or more capable than we are, at least at important tasks that we care about. For example, maybe models in the future will generate code that even expert programmers cannot fully understand. If models do this, and they say some things about the code, for example, if they claim that the code is safe and secure, then unlike in the setting today, it's not clear how we can trust the model. It's not clear why the model should output honest answers, honest descriptions, and it's not clear why we should trust it. This is a serious problem for controlling these models to behave in the ways we want.

One way of framing the core challenge here is that humans will be too weak to evaluate superhuman models. We will not be as capable as these models at some important tasks. But it's not obvious how we study this problem today. We currently do not have superhuman models. Our models still make silly mistakes all the time. They're not as capable as we are at most tasks, despite massive gains over the past years. It's not clear how we can study this problem before we even have these superhuman models in the first place. How do we actually study this problem now, ahead of time, before this becomes a serious problem? We want to develop solutions before this is a major issue in practice.

What does most machine learning today look like? Most machine learning involves humans labeling examples and training models to predict those labels. For example, in supervised learning, we label a dataset of examples, and the model is trained to predict those labels. This more or less works reliably today, because humans can mostly reliably evaluate the sorts of examples that we care about today. Models are not yet capable enough, or rather, models are generally less capable than humans are, so humans can evaluate the tasks that we want models to do.

In contrast, in the future, we will have models that are more capable than us, of superhuman models, at least in important domains. In this setting, humans will not be able to reliably evaluate various behaviors that these models are capable of, so it's unclear how we can evaluate how well these models are performing and make sure they're behaving the way we want.

We can't study this directly today, we don't have superhuman models just yet, but we can study a simple analogy that we propose in this paper. We consider this analogy of, what if you take a weak model and use that to supervise a strong model? How well can a weak model control and align a strong model? This captures this core challenge of, can weak supervisors align much more capable models?

Concretely what this looks like today is, for a given task T, maybe it's a classification task like, is this code safe or not, or is it buggy or not, we first train a weak supervisor.

Most of the time, right now, we fine-tune a weak pre-trained model on this task using correct ground-truth labels. Here the weak labels, or the weak supervision, is basically the predictions of this weak model on held-up data. We also look at weak-to-strong performance, so we take a strong pre-trained model, for example GPT-4, we fine-tune it on these weak labels, and we see what is its performance. We then compare this to fine-tuning the same strong model, for example GPT-4, on ground-truth labels.

Intuitively, we want two and three to be as close as possible to each other. If we can get GPT-4 or very strong models to try their best at some important safety task, even if we only have weak supervision, then we can use that to help control and align models. This leads to this basic metric of performance gap recovered, or PGR, which basically measures what fraction of the gap between the weak model and the strong model, for example the gap between GPT-2 and GPT-4. What fraction of that gap can we recover when we fine-tune the strong model, like GPT-4, on the weak labels? We want this to be as close as possible to one. If we can make this one, let's see what happens.

Suggests that weak supervisors can elicit strong capabilities from much stronger models. This is important because this would allow us to train very strong superhuman reward models and safety classifiers. So if we can train reliable superhuman reward models, we can use this to train models to behave safely, even if they are more capable than we are. Or if we train a superhuman safety classifier, we can use this to catch unsafe behaviors from models at test time, even if the model behavior itself is too complicated for us to understand.

So this is sort of the basic problem setup and how we might try to use this in the future. So with that, Pavel will talk about some actual results that we have.

Hi, everyone. I will be going through some of our technical results. Please switch to the next slide. So we have four groups of tasks that we consider in our paper. The first one is NLP data sets. So we have a large collection, about 22 different NLP data sets that cover different types of problems, such as question answering, natural language inference, sentiment analysis, and things like that. But we convert them all to the same binary classification format. So the model just has to, given a text output, a binary label, a 0 or a 1.

And then the next group of tasks is the reward modeling. So Colin already mentioned the reward models. But for those of you who are not familiar with this concept, in this RLHF or Reinforcement Learning from Human Feedback pipeline, one of the important steps is reward modeling. And that's basically where we are given multiple possible responses from the chatbot to a given prompt from the user. And we need to see which one of those responses is the best. So specifically, we consider a binary classification version where we are given two completions, two answers from ChatGPT. And we need to see which one of those answers is better to the given prompt.

The next one is chess. And for chess, we consider a generative setting. So here, we are given a chess position. And we need to see what is the best move in this position. And it's a generative task. So the model needs to spell out the move. So it's not just a classification problem. So it's a bit different, more complicated. Another thing to mention here is that there is a unique right answer. There is only one good move in each of these positions by construction.

And the last group of tasks that we consider is computer vision. So we work with the ImageNet data set, which is a classification problem with 1,000 classes. And the goal is to say, what is the object shown in the image? And in the presentation, I will only go through results for the first three groups of tasks. And for all of these tasks, we'll be using the GPT-4 pre-trained based models.

So we'll use a family of models, which are from the same group as GPT-4. And it's important to note that they are pre-trained. So we are not training from scratch. So in my part of the talk, I'll be going through a lot of graphs. And I think it's useful to spend a bit of time learning how to read them because they are quite different from or a bit non-standard. And you might not understand them from the very beginning. So let's start by looking at this test accuracy graph.

So here we are showing the test accuracy as a function of the model size. And let's first focus on this ground truth label curve, which is highlighted, the darker black curve.

And then, so what we want to see in the plots is basically these curves going up and to the right. So that would mean that the accuracy of the student of a large model trained by the weak model is better than the accuracy of the weak model itself.

So for example, let's consider the GPT-2 supervising GPT-4. In this example, GPT-2, when trained on ground truth labels, gets about 62% accuracy. And GPT-4, the full model trained on ground truth labels, gets about 90% accuracy. And that's the gap in performance that we are trying to recover.

But let's look at it in a bit more detail. So first of all, let's look at the NLP tasks. And this is where we have our best results across the board. So we see that when we fix a weak model and we increase the size of the strong model, we are getting better and better performance. So the curves are pointing upwards, and also the PGRs shown in the bottom row of the plot are all quite high. So up to 80% in the best cases.

But the biggest thing that we're trying to do more, that we're not being able to actually do, is recovering significant amount of information in the first case.

So when the gap in the capacity between the student and the supervisor is not very large, so when like GPT-3, for example, supervised a GPT-3.5, we get pretty good looking results.

But for the reward modeling, we have our worst results across the board. So we have uniformly pretty low PGRs. The curves look quite flat. So they are not exactly flat. The student is still doing better than the supervisor, but it's doing much worse than if we use ground truth labels for the same model.

So this is the main set of results, but this is with naive fine tuning.

So in fact, it turns out that we can actually do quite a bit better than that. And we have a few ideas that we describe in the paper. I'll talk about two of them next.

The first one is bootstrapping. So bootstrapping is this kind of classical and very intuitive idea in the alignment literature, where the idea is that instead of just directly supervising a superhuman model, we will first construct a sequence of models.

Maybe we will align or supervise a GPT-2 and make sure that it's aligned. And then we will use that model to align GPT-3 and it will align GPT-3.5 and that will align GPT-4 or something like that. Or and that will align a superhuman model.

But historically, this has been an idea. It's a very kind of natural idea in the alignment literature, but it's also quite abstract. But the cool thing about our setup is that you can actually do an experiment and try to do something like this. And that's what we're doing here.

So we are constructing a sequence of models and we are doing this weak to strong supervision in multiple steps. And turns out that it really helps on chess. So where before we were seeing the accuracy curves flatten out eventually, as shown with the dotted lines, that's our baseline.

Now with the bootstrapping, we are actually seeing continual improvements as the student size increases. And overall, the PGRs are also substantially improved. So yeah, it works on chess. Interestingly, on the NLP tasks, it also...

works but the improvements were not as high and on the reward modeling it actually didn't really didn't really help. And for NLP we will talk about a different method that actually helps quite a bit more. So first of all let's talk about intuition. When we naively train a strong model on the weak model predictions we are just minimizing this loss. CE here stands for the cross entropy and f of x is our prediction prediction with the strong model and fw of x is the weak model predictions that we are using. So here we are minimizing the kind of we are making this strong model predictions as similar as we can to the weak model predictions. That's our objective that we're training with.

And what that means is that if you had an infinite data set and an infinite amount of time and you could solve this optimization problem perfectly then you should just recover the weak model predictions. So the strong model should converge to just doing the same thing that the weak model would do and we will not have any improvement. We will not see any pgR. We'll just get zero. But what we actually want is something different. Let's go to the next slide.

Yeah so we want to the strong model to be able to contradict the weak model predictions in certain cases. So when the strong model already has a pretty good idea of what the task is and it believes that the label should be different from what the weak model suggests, it shouldn't, we shouldn't overwrite it. We should follow what the strong model wants to predict. And we can operationalize this intuition by using this simple adjustment of the loss. So here we are mixing in some of the predictions from the strong model together with the weak model predictions. So we use a linear combination of those two. Which means that instead of just training the strong model to make the same predictions as the weak model would do, we are adjusting to sometimes follow our own predictions. And there are some details here which I'm skipping for the sake of time. But this is the main intuition. And it turns out that this idea works really well.

So we can substantially improve the results in the NLP setting. So here we are again comparing to the baseline. And you can see that where before we are getting let's say like 30% PGR, we can go to almost 80% in certain cases. So it is actually working extremely well in NLP. And if we look at individual datasets, the results are quite striking. So on many of them we are basically seeing this very dramatic jump from pretty bad performance to basically recovering the full performance of the large model.

So GPT-2 supervising GPT-4, we are almost recovering the full performance of the GPT-4. Even when the weak model had around 60% accuracy, we can recover almost 90% accuracy. For example in the leftmost, in the first experiment, there are some datasets where it doesn't actually work. So out of our 22 datasets, it seems that about two or three datasets show negative results from this loss. But they also tend to be datasets where we already see some weirdness even with the baseline. So for example here the performance doesn't just improve with the size of the strong model even when we train on the ground truth labels.

And a final thought is this still doesn't help on the reward modeling task. So both of the methods that I described don't help with reward models and we are still looking for a satisfying kind of method, satisfying resolution for the reward models. And yeah, this is still something for future work. So yes, those are the main results and in the paper we are also spending a bit of time looking into understanding and we are looking for example into how much the weak model simulates, the strong model simulates the weak model. And we also have some other results that I'll not be able to present here today due to time constraints, but I wanted to share this one understanding result.

So as I mentioned already, naively when we train the strong model into weak model supervision, one of the failure modes is that we'll just mimic the predictions of the supervisor. And the weak model, the strong model will just follow the same mistakes that the weak model would make and that's what should happen if we had an infinite data set and an infinite amount of time. But in practice we are training on a finite data set and so it's an open question how much the strong model will mimic the predictions of the weak model. And we can actually study this directly so we can consider this agreement metric which is the fraction of the test data points where the student and the supervisor make the same predictions. So after we train the student on the labels from the supervisor, what's the fraction of the test inputs that they label in the same way.

So here I'm showing the agreement for the NLP tasks and for both the baseline and for our confidence loss that I described. And we're also showing the results separately for data points where the supervisor is correct and data points where the supervisor is wrong. And the first observation that we can make here is that the agreements are actually quite high and in particular even on data points where the supervisor is wrong which means that the student is actually inheriting some of the mistakes and biases of the supervisor. It's simulating some parts of the wrong signal that come from the supervisor. And another point is that confidence loss actually helps reduce the agreement and that was our intuition when we came up with the laws and it seems to hold.

So when we train the same weak to strong pair that we supervise our same strong student but with the confidence loss where we get less agreement and in particular less agreement on the incorrect data points. And that is part of the reason possibly why it works. But another interesting thing is that we are seeing inverse scaling and this is not something that we expected at all going into this project and we were quite surprised. But you can see that all of these curves are sloping downwards.

They're going down as the student size increases. So for a fixed weak model the strong model is able to simulate the labels from the weak model worse and worse as the size of the strong model increases and this is unusual. Usually like the larger the model the better it is able to do the tasks that we are training it for. But here we are seeing the opposite and it's quite interesting and not something that we fully understand but I think we have some intuitions and also this points to a possible disanalogy because in the future we might expect that the superhuman models can be very good at mimicking our biases and mistakes unlike in our setting.

Again with that Koen will continue. Thanks.

So our work I think is very exciting because I think it makes it easy to study this problem a really core challenge of super alignment in a way that is I think analogous in many ways to the future problem that we'll ultimately care about. But ultimately our setup still has some limitations. For example we make some simplifications like we just focus on classification tasks mostly and just focus on cases where models can do some tasks in a single forward pass. We don't give them time to think in chain of thought. That's something that we'll work on extending in the future but for simplicity this is one simplification that we made.

There are also some remaining disanalogies between our setup today and the future problem of aligning superhuman models that we ultimately care about. In the interest of time I won't go into these in detail but we discussed this in length in the paper itself.

There's also a lot of different directions for future work. First of all there's a lot of room for improvement for better methods. So we showed proofs of concept that we can really control how models generalize. Very simple methods can drastically improve how well weak models can supervise strong models. I think there are a lot of ideas for how to improve this. Listed on the screen are just a few intuitions for ways you might be able to build on the ideas in this paper. In general I think there's just a lot of living fruit here.

Another very important question is if we train a model on weak's provision from humans, how can we trust that it generalizes out of distribution in the right way? How do we know if it's generalizing in the wrong way? Even if we don't have reliable labels to actually evaluate its behavior, what should let us trust it? This is a complicated question and I think there's more work to be done here in terms of better metrics and more science and so on.

Finally there's just a lot of really basic science that we're excited to do within this framework. For example there are lots of questions raised by our paper that we have initial ideas about but we don't have full satisfying answers to. For example, why are results with reward models worse than for other tasks? We have some ideas we really don't understand.

More generally, what makes a capability easy or hard to elicit with weak supervision? Under what conditions can we do this in the first place? Also, how important are errors in the weak labels? If humans make errors when trying to label examples of this is good and bad behavior, if we make errors on one percent of examples, is that a big deal? Or are models pretty robust to that? We still don't really have a great understanding of this.

There's just a lot of basic scientific questions that we can really start to study now that we have this basic framework.

In conclusion, I think there are a few important implications of this work. First of all, weak supervisors can elicit capabilities that go far beyond their own. This makes me optimistic about leveraging generalization properties of neural networks. Leveraging some really benign properties that deep learning often generalizes remarkably well, we may be able to leverage this for alignment.

We already see that models generalize far beyond the weak supervision they're trained with. But this is still not enough. Weak supervisors still can...

elicit everything that strong models know. There's still huge gaps remaining. There's a lot of work to be done. And in particular, we show that this is a real problem.

For example, I think this really shows evidence that existing alignment techniques like RLHF probably won't scale to superhuman models. There really is this gap that we show is not totally recovered just from naive existing methods. But I think I'm overall just really excited about this approach because there's just so much looking through. We did a lot of initial studies, both scientifically and in terms of method, but we just, I think, made a lot of progress in a relatively small amount of time. I think there's just a lot that we can do to build on this. And so in general, I think it's really more important than ever to start working on these problems.

I think models are getting much, much smarter very quickly. We will need to solve these serious safety and alignment issues before they become extremely capable, before our existing techniques for controlling them stop working. And so I think our hope here is to sort of lay the groundwork for that today.

Back in 2017, the alignment team at OpenAI helped develop RLHF, the current dominant alignment technique. And a few years later, while it was initially more of a research idea, a few years later, it ended up making its way to product and being deployed. And today it's totally essential to all of our models being safe and useful at all. And our goal is to repeat that story. We want to lay the groundwork today so that in a few years from now, we can control and really use to the full extent of its capabilities the really capable models that we have in some years' time.

So with that, Pavel and I can take any questions you might have. Thank you.

Colin, Pavel, this was awesome. I know we have a couple of folks' hands up. I just wanted to kick off with a couple of questions that I wrote down as we're going through this. And you'll have to excuse me if these are bad or dumb questions. But as you were talking about this, I was curious, can you lay out your intuition as to why RLHF, which for folks who aren't familiar with the process of using human feedback to align models, won't end up being good enough to scale for superintelligence?

Yeah, that's a great question. So I think the basic intuition is the reason RLHF, reinforcement learning from human feedback, the reason it works today intuitively is that humans can evaluate model behavior. So for example, you can type in a question to chat GPT. It'll generate a response. And then human evaluators go and say, this response was good, or this response was bad. And we use that human feedback to train the model.

Now, suppose you know nothing about coding. And suppose you tell the model to code something for you. And it does it. And it generates some output. And you don't know if it's... You don't really know how it's working. And as far as you can tell, it's correct, but you don't really understand what's going on. How can you tell if you should give it a thumbs up or a thumbs down? It's actually not clear what you should do in that case. It's not clear how you can evaluate this model and provide it feedback if you don't really understand its behavior in the first place. And so this is the basic concern. This is not a huge issue just yet because model behavior we mostly understand. But I think this will become an issue once models really are capable of behaviors that go beyond what we can evaluate. And for example, one sense in which we do see this to some extent today is to train models we increasingly need to use domain experts.

For example, we do need to get help from people who understand programming to evaluate models for coding. And so we already start to see this to some extent today. It is already becoming harder to evaluate models. But I think this will become much more of an issue in the future.

That's a great question.

Yeah, that makes a lot of sense.

Colin, do you mind actually stopping to share your screen just so that it'll make everyone's faces and it'll be easier to call on folks? That's perfect.

I'm also curious, just to build really quickly on that last question, is there another example beyond coding where you think the emergent properties of these more advanced systems will be hard for humans to understand? I'm imagining if I asked you to today to even make me an entire code base in some random language, I could go through and read all that. And if I did get a bunch of human experts, we could go through all that code and go through. But I'm curious if there's other examples that are more complicated than that. If there's other examples that are more complicated where it just is no longer feasible to have humans do that work.

That's a good question. I think ultimately we'll want to apply real AGI to all sorts of domains. So we'll want it to do science for us. We'll want it to try to cure cancer. I think eventually we'll have AI systems that can, or this is personal prediction, I think we'll have AI systems that can autonomously generate product ideas, create the products, advertise it, basically do all of this, everything a company does, just autonomously. And that looks very different from models today. And it might be a little while off, but there are just all sorts of things the model could be doing there. Some of it might be coding. Some of it might be very different. Some of it might be planning. Some of it might be reasoning about what sorts of strategies are most effective for some goal. Some of it might be communication between different AI systems themselves. All of which might be too complicated for us to fully understand. And so maybe that gives at least an initial flavor of what this might look like. But ultimately it is hard to say. It's always hard to predict the future exactly.

Yeah. I've got a really quick two-part question, and then we'll go to some of the folks in the audience. I'm interested if there's an opportunity to... I know when we originally put out the super alignment work, we were actually... There was a grant program. I'm curious if there's any sort of update on that that you can give for folks who are potentially curious about collaborating with OpenAI in this work. I'm also wondering whether or not it makes this... And this kind of coincides with the end of your presentation around the start of this field. But are there other people who are actually doing this work today around super alignment? Or is it really just the nine or ten people who are on the initial slide at OpenAI focus on this work?

So on the first question of the super alignment fast grants program. So we actually just closed grant applications a few days ago and we got a lot of applications. So we now have to go through a lot of applications. But just from skimming through some of it already, it's really exciting. So we got a lot of really great people applying and so we're really excited to support them in all sorts of different areas of research. So in general, it's exciting to see. As you alluded to, this is still kind of a small field. I mean, we are trying to address a kind of unusual problem. We're trying to address a problem that we expect will arise in a few years, possibly something like that, but which is not yet really facing us today. But this field is gaining increasing interest as evidenced by the many thousands of applications that we got. And also, the field is just growing really quickly. And so it's not just this deployment team at OpenAI, which is maybe about 30 people today. Other major AI labs have similar sorts of teams. And there's also work in academia and independent nonprofit research labs, and so on. So this is really, it's still early, but it's really growing. And there's a lot of interest across many different sets of people.

I love that. Hopefully, we'll have some folks in the audience who applied for grants or who will be future collaborators. But I wanted to go, it looks like Grant, you have your hand up. I don't know if there's anything that I... It looks like you're unmuted. Grant, do you want to go ahead and ask a question?

Can you hear me?

Yes.

Okay. I don't know if you can see me. There's some... It looks like there's some issue with my camera, but it's on, but it doesn't appear to be actually working. But yeah, so I wanted to offer some somewhat naive pushback on this method. So it seems like you would eventually reach a theoretical limit where a human or a weak teacher couldn't provide any more feedback to a strong student. And again, my naive assumption was always that to safely release a super intelligent AI into the world, you would have to essentially solve alignment. With Chat GPT-4 and stuff that's weaker than humans, you're already seeing issues where it can create dangers and misinformation and stuff like that. But it's not as big a deal because it's still weaker than humans and you can usually tell when something is AI written. But it's somewhat scary to see that we might be leaning towards this approach where we just kind of accept that we'll get diminishing returns as we get stronger and stronger AI models, kind of accepting that we will continue to not be able to evaluate them at these higher and higher levels. But we can still elicit improvements as you've shown with these stronger models from weaker teachers. But do you see risks and issues there with relying on this approach? Essentially, as you get a bigger model, you open more and more holes where a bad agent could slip in dangerous code or something like that. And it seems like you'd get diminishing returns as you get a stronger model. Pavel, do you want to take this?

Yeah. So, I mean, I think maybe it's an open question and maybe the higher level intuition.

that we have is not like the way I think at least about what we are trying to do is the strong model and potentially superhuman model should have out of pre-training some really good understanding of many fundamental things like what leads to harm what's like or a notion of safe code or harmful code and things like that and then with humans what the role of humans is basically to extract that capability out of the strong one so instead of like teaching the model from scratch to to understand what is safe code or harmful actions and things like that we are just trying to unlock that kind of representation that should already be pretty saliently represented in the model out of pre-training and we see a lot of kind of reasons to believe that that is the case just from prior work and from our work and right now we are doing a very like in what we presented we are doing a very kind of toyish version of this where we are using our current models and some like a wide range of tasks where the model out of pre-training already has pretty good representations of the task and we are able with really bad labels from like a very small model to extract a large fraction of the performance that the model could have on this task if it kind of tried its best so that's how we are thinking about this

so I think from this perspective maybe like I would disagree with the initial point that you made about like diminishing returns and how like humans can only like yeah provide very very partial feedback I think that's like part of that it's true like maybe there will be some under specification from the human feedback maybe like it wouldn't be clear how the strong model should generalize from what the humans would say like maybe if there is some ambiguity in like the notion of a harmful action but and I think like maybe Colin can connect to this but we are trying to like one of the intuitions is that there are some very crisp and clear concepts that we can hope to elicit from the model even with imperfect supervision on or with supervision on easier examples

just one very quick thing I would add to that one way of framing what we're trying to do is we're trying to make the model honest or we're trying to elicit certain things that the model already knows so if the model is say actively lying to you or actively deceiving you intuitively the model should know that it's doing so in which case all we want to do is get the model to tell us that it's lying or that this is false okay and so that's that's all we're trying to do we're not trying to teach the model per se we're just trying to make it honest that's at least one way of framing that I find very useful got it

okay I guess I didn't understand that there were already there was already this kind of understanding baked into a pre-trained model that's interesting all right thank you

yeah thank you for the question Grant Ethan you want to go next

can you hear me I'm Ethan I work on open source LLM at NVIDIA I have a question about data so as I understand the weak label usually weak label requires more data so have you thought about scaling data as another axis so two follow-up questions would be like when when you do weak to strong with more data do you expect the PGR to be more than one another follow-up question is when you do the agreement among weak models you mentioned that the models inherit the same error so if you train them on different data maybe they can they won't have the same error and they can do actual agreement

maybe I can start so I think maybe yeah that's a great question thanks so much one thing to note is I think it's hard to imagine a scenario where we would get above above one PGR or like we can get it in practice but that usually just through some random noise but I think fundamentally you shouldn't expect above one PGRs like the one PGR corresponds to doing as good as possible on the given task and well at least in our setting we are not considering like a setup where you could train the strong model only on like one percent of the data and then do the weak to strong experiment on much more data so yeah so then I guess the second part of your question was about what happens as we increase the number of the data and that's quite interesting because I think we can actually in some cases see worse PGR with more data because of overfitting and I didn't get to talk about that in in the presentation but basically like this idea that if you had an infinite data set then and you you just solved your optimization problem perfectly then the strong model should just make the same predictions that the weak model would make it should like in the largest model as a subset of it has a smaller model like architecturally so it can just uncover that subset and learn to do exactly the same predictions as the weak model and through that like it will get zero loss and like optimization wants to uncover that solution and the longer you train the more that should happen and in practice we do see that in certain cases like initially the accuracy goes up and the PGRs goes up and then it goes down later in training where the accuracy is measured with respect to the ground truth labels and not with respect to the weak labels but that is kind of mitigated by this confidence loss with the confidence loss we are not seeing it as much but it is a real thing and I forgot the last part I think about the agreement could you say

yeah so the last part is when you do the agreement if you turn on the same data maybe the weak models are very similar so they output the same error if you turn them on different data maybe they can generate different results and do an actual agreement

so I mean we like all of the models are pre-trained in a certain way like the standard way that we do here and so that gives them some similarity I guess to begin with and then in our experiments we are like the agreement is measured between the student and the supervisor so the student is trained on some data sorry the supervisor is trained on some data and then it labels other data that it wasn't trained on and then the student is trained on that data so in fact the student and the supervisor are trained in different parts of the data set and the labels for the student are different from the labels for the supervisor so uh yeah ethan thanks for the question um let's go to alan

hello hey colin and pavel nice presentation um i definitely learned a lot uh i just kind of wanted to ask from an alignment perspective i think grant talked on this a little bit but like my understanding with this approach so definitely correct me if i'm wrong is that like the idea would be to use like a weaker model that we have like aligned in a sense like it would be easier than aligning the like the superhuman model and then we would use this like weaker aligned weaker model to like align the the like more powerful model um but I think um something that worried me a little bit was like uh for example the new loss that you guys introduced um which was like it kind of like incentivized uh disagreement or like for the stronger model to prioritize its own like kind of uh i guess intuition on the on the the task um so I guess I was just wondering like how you guys see this work like extending to transferring like aligned alignment properties um i know you like mentioned a little bit about honesty and like additionally like what kind of tasks would you guys want to try like this approach on to like you know convince yourself that like align properties of that we would want an aligned model to have would like transfer from the weaker model to the stronger model

yeah these are really great questions so first of all um i think there are different ways of trying to apply this in the future the way i often think about um most naturally is that it isn't actually quite like we align a weak model and then use that to align the stronger model it's more like we have this strong model that is in some ways more capable than we are along certain axes um and the weak supervisor in that case is actually humans not a weak model um you could do a weak model if you wanted to my guess is humans will be better um but that's sort of up to you and so um uh so I don't think about it quite as much as uh you know transferring alignment properties from model one model to another um it's maybe more like um communicating the desired behavior to the strong model somehow despite being uh weaker um than the strong one um i had two more points i wanted to make uh remind me what were your other questions on this

um yeah sure first of all that's like already like very helpful but um also also like kind of like i think another one i'm interested in is like what kind of tasks would you want um you're

yeah sure first of all that's like already like very helpful but um also also like kind of like i think another one i'm interested in is like what kind of tasks would you want um you're yeah

go ahead great so I think there are um a number of different tasks you could try to apply this to um and several that are in some sense sufficient for alignment if we could get it so for example if the task is is this um is this natural language text true or false if you can get the model to tell us everything that it knows or try its best to tell us if something is true or false that that's basically honesty it's like then you can get them all to be honest prevent it from lying um and you can use all of its knowledge in that way you can just ask it all sorts of questions um based on its honest answers another task that we might try is instruction following um so given a task and some behavior we can ask does this behavior follow the instructions or the desired task uh well or not so this is kind of like the reward modeling setting that we consider um but there are other tasks you can imagine too you can imagine trying to apply this more narrowly to is this code safe or not or is this behavior safe or not and maybe you can focus on like clear-cut cases of clearly

clearly dangerous versus like clearly safe and so I think there are different tasks you could try to apply this to for me in my mind it's an open question which of these tasks will be easier than others and so it totally I want a science to let us know which of these should be aimed for which of these should be build our ultimate alignment scheme on top of it'll depend on empirically what what seems to work okay great questions yeah thanks a lot

yeah thank you yeah this was that those were great questions

I'm gonna do one more and then I think somebody mentioned in the chat about whether or not we can answer some questions offline so it sounds like people are super interested in this stuff Colin and Pavel so I'll try to get these questions and send you some stuff on slack and we can try to get answers for folks who we didn't get a chance to answer here

but Declan you want to do the last question live and then if folks have other questions please put them in the chat and I'll follow up on them sure no pressure thank you thank you so much for doing this talk it's refreshing because it's very different from what I what I normally get to like learn every every single day I guess my question was kind of about scalable oversight and some of what you were talking about with RLHF not necessarily scaling some of the reading that I've done has kind of talked about like including more like open source like evaluation data sets and like other sort of strategies where you're able to use like kind of domain expert question answer sets like I read kind of like I'm gonna say it wrong but like the quality or like the Google proof QA what where do you feel like that sort of fits in and like what sort of other like domain expertise do you feel like is necessary to kind of add to the development of that field maybe I would so first I'll explain what how I view is scalable oversight and I think it's actually closely related but in some sense scalable oversight and weak to strong generalization are very complementary they're trying to solve the same problem from different angles but in a way where if one only brings you this far the other one can hopefully take you even further but they both help each other

so it's scalable oversight is this basic idea of well if if humans aren't capable enough to supervise models that are more capable than us maybe we can improve human supervision

maybe instead of just humans directly labeling examples as this is good or bad or this is correct or not instead humans interact with an AI system that for example points out errors in some behavior or points out some safety issues in some behavior and so this this can intuitively help that the human label things better than they would otherwise be able to and so we actually have a team focused on this on super alignment as well and so we're excited about this direction and I think it's trying to solve the same problem basically how do you get a model stronger than you to behave in the ways you desire and but I would say weak to strong generalization is kind of like suppose you have a fixed level of supervision how far can you go from there and so this is kind of like well if scalable oversight is hard how far can you go anyway or if it's like if scalable oversight takes us two-thirds of the way how do we go the last one third alternatively make maybe you know maybe strong generalization takes the two-thirds and you really need scalable oversight to validate the results or to really take you the last one third it's like the really important part and so I think they're totally complimentary and yeah I think it's very promising as well good question yeah thank you for the question Pavel Colin thank you so much for taking the time to do this I think based on the volume of questions I think people are super excited about this work I want to make a couple of quick closing remarks just about future opportunities to engage with with open AI in the forum and again Colin Pavel thank you for for taking time to do this I'll follow up with both of you on slack with some more questions for folks who have an interest in helping us evaluate some of our frontier models if you're a current undergraduate a recent undergraduate within the last few years and have done a biology course or have completed PhD studies in genetics vile for virology biochemistry or others other similar fields we have an ongoing opportunity to help us evaluate our models in those categories I think Natalie has a has a link that she can share with folks who are interested to help to help us in that area so super excited if you're interested please please help and contribute we also will be having an in person forum event that folks are welcome to join us in person for with Turing recipient and for a member Shafi Golwasser who will be doing a talk on mitigating backdoor vulnerabilities

and again that'll be in person and we'll publish it soon so keep an eye out for that but again thank you everyone for being here thanks for all the incredible questions and the discussion and hopefully we'll have Colin and bubble back to do a follow-up in six months when hopefully they'll have some more research and we'll hear about the grants that y'all are working on so it's super exciting thank you thank you all thanks for for being here and thanks again common bubble thank you so much for having us and thanks for all the great questions they're really really thank you

+ Read More

Sign in or Join the community

Like

Comments (0)

Popular

Watch More

Learning to Reason with LLMs

Posted Oct 04, 2024 | Views 11.5K

# AI Research

# o1 reasoning model

Collective Alignment: Enabling Democratic Inputs to AI

Posted Apr 22, 2024 | Views 19.8K

# AI Literacy

# AI Governance

# Democratic Inputs to AI

# Public Inputs AI

# Socially Beneficial Use Cases

# AI Research

# Social Science

AI Art From the Uncanny Valley to Prompting: Gains and Losses

Posted Oct 18, 2023 | Views 39.2K

# Innovation

# Cultural Production

# Higher Education

# AI Research