Improving Mathematical Reasoning with Process Supervision
Hunter is a researcher at OpenAI focused on improving reasoning and reliability in large language models. His most recent work was on improving mathematical reasoning with process supervision. Prior to OpenAI, Hunter worked on self-driving cars, building data infrastructure for Nuro.
About the Talk: In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.
Full list of Authors: Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe
Improving Mathematical Reasoning with Process Supervision, presented by Hunter Lightman.
Several in attendance today joined forums specifically to be engaged with the math cohort, so welcome to those participants today too.
Our presenter today, Hunter Lightman, is a researcher at OpenAI, focused on improving reasoning and reliability in large language models. His most recent work was on improving mathematical reasoning with process supervision. Prior to OpenAI, Hunter worked on self-driving cars, building data infrastructure for Neuro.
Welcome Hunter, so happy to see your face and we're so just pleased that you were able to take time out of your hectic conference schedule to join us.
Thank you so much, Natalie, for putting this together. Really excited to be here and excited to take this opportunity to talk to you all about our work, Improving Mathematical Reasoning with Process Supervision. So without much more ado than that, I'll jump right in, so let me share the presentation.
And Natalie already did a wonderful job of introducing me and a little bit of my background, but basically I've been at OpenAI for the past about a year and a half, having started as part of the residency program. And I've been working on this MathGen team, trying to solve problems about how we can make large language models better at solving hard reasoning problems.
And today I'm going to be focusing on presenting our most recent work on process supervision, which we published in this paper, Let's Verify Step-by-Step, with a whole bunch of fantastic collaborators listed here, including Yura, who I see in the audience down at the bottom. You can, I think there's a link to that paper in the event description if you want to afterwards check it out for more information.
But this talk will be presented in three parts. For the first part, I'll go over kind of the history of using large language models to solve math and reasoning problems, kind of the motivation for how we got to where we are and what other work's been done in the field so far. Then I'll focus the bulk of the time on presenting our work on process supervision to get better at this problem. And then finally, I'll just briefly talk about some future work that we think extends from this work and how we see it relevant going forward.
At the end of the talk, like Natalie mentioned, we'll open it up for some discussion and questions. So if you have questions in the meantime, write them down, hold onto them, and we'll get to them at the end.
So we'll start off with history. And for this, I kind of like to start with the philosophy or the basis or the motivation for the kind of work that we're doing in trying to make language models like GPT better at mathematical reasoning.
And so this starts from a couple of core questions that I like to think a lot about. What does it mean for a large language model to reason? Can these autoregressive language models, which are just trained to do next token prediction, actually help solve interesting math problems? Can LLMs do more than just simple pattern matching? Is novel math problem solving more than pattern matching? Can we hope to ever build language models that can solve unsolved problems? These are the core questions of our group.
And we can reframe these questions a little bit more formally in the language of machine learning. So we can take a step back, and we can ask, what is the distribution that we learn when we train a GPT? And the answer to that is quite simple. We try to learn the distribution of human language. We read through all the data on the internet, and we tried to build a model that can understand all the different higher order patterns in the ways that humans communicate with each other and model that distribution to make tools like GPT-4, like ChatGPT, like these things that I know you all are familiar with, which, for the purposes of our work, leads to the next question. Does that distribution that we're learning contain the solutions to novel math problems?
And there's two answers to this. One answer is that, in theory, it does, because humans work on novel problems. And so the full distribution of the kinds of things humans would write on the internet in the past and present and future would contain solutions to all the sorts of kinds of hard math problems that we're working on, that we're solving, as we do more science and more math.
But in practice, the answer is a little bit less clear. And the way I like to frame a lot of this, when I think about it personally, is ChatGPT is really good at certain things. It's really good at writing lists of things to do in San Francisco, for example, because there's lots of examples on the internet of people talking about things to do in San Francisco. So that part of the human language distribution has a lot of support, has a lot of examples online of people talking about it. And so we're able to learn really effectively how to do language modeling in that area of the distribution.
But there are much fewer examples online of people providing solutions to math problems, and even fewer than that of people solving unsolved math problems, because they're unsolved. So the more formal version of this core question of our group is, given that we are training large language models on this distribution of human language, how can we amplify the important signal about math and reasoning to better model this part of the distribution that's a lot less well-represented? That's the core question of our group, and the core question of a lot of lines of research about how we might try to use these large language models, which are really good at these language modeling, these NLP tasks, to solve novel reasoning challenges and novel math problems.
And there's a whole history in the last couple of years of people trying to use the models exactly for this purpose. And so I'll go over a couple of the interesting papers over the course of this brief, but exciting history.
And so I'll start with this one in 2021, a couple of years ago, by OpenAI, where we trained verifiers to solve math word problems. So the premise of this was that large language models were at the time pretty bad at mathematical reasoning. And in this data set we worked on here, where we just tried to solve these grade school math word problems, the sort of things you'd get on a test in elementary school. Language models were pretty bad, and would often get the wrong answer. But what this paper put forward was a strategy by which you could train these outcome reward models, where you show a language model, lots of examples of positive and negative solutions to problems, correct and incorrect solutions, and you train it to distinguish between correct and incorrect solutions, so that it can rate solutions based on how correct it thinks they are. And what they found in that paper was that this outcome reward model can drastically improve performance, which helps us start to unlock kind of the secrets of how we can use large language models for mathematical reasoning. We start from this premise that they're sort of bad at reasoning, but then come to this conclusion that they're good enough that we can train these outcome-based reward models to actually boost their performance. So we know there's something going on inside these internals of these models, where they're doing something to model mathematical reasoning, and maybe if we can elicit that part of them, we can do even better to use them to solve hard math problems.
Another big area of work in the last couple of years is this whole area called chain of thought, where people realized that instead of just asking for models to give solutions to problems, if they ask the model to reason step-by-step to show its work, we get a huge improvement in performance. And you can think of this as maybe having two different components. On one side, when you ask a model to reason step-by-step through a solution and show its work, you're letting the model use more computation to solve a problem, because it doesn't just have to, like, in a snap of its finger come up with an answer, it can think a little bit and then come to a conclusion. And then another way to think about how this might improve the performance is that we're sort of focusing in, we're narrowing in on a specific part of that internet distribution we were talking about that's better at math reasoning. If you just look at all of the places on the internet where people come up with or just put forward answers to problems, the internet's a lot of garbage, and so a lot of that data might be garbage, too. But if you look at the part of the internet where people are carefully reasoning through what it means to solve these problems, that's, like, math blog posts and, like, stack overflow forums and things like that, where you might have, like, higher-quality data in those distributions, and if we can target that, we can do better. And so we find, as a conclusion, that reasoning step-by-step through models is one way to improve their performance, which is
data specifically about math. So like we talked about the internet distribution having like lots of examples of people talking about what things to do in San Francisco.
Minerva, one of its big things was making a huge data set specifically of math reasoning and putting that into the training for the model with the hope that that would better increase the part of the distribution that's good at mathematical reasoning. And so that helps too. And so these are a couple of disparate examples of things people have worked on in the space. And so to just kind of give a sense of the state of the art of where we are as a community in solving this problem.
Two years ago, OpenAI introduced this GSM 8K data set, which is 8500 grade school math problems. I put one example of them here. And the state of the art methods in 2021 solved about 55% of these problems, which I think at the time was about on par with what grade schoolers were getting. And so exciting that we came up with methods to solve these problems using these language models, but not necessarily that good.
Now, GPT-4 can solve about 92%. And so we pretty much consider this GSM 8K data set sort of solved. And have moved on to much harder math, where the main focus now is on this MATH data set, where I put one example of a problem and you can see it's a lot more kind of intensive mathematical reasoning than we got from that GSM 8K. It's much harder. State-of-the-art performance when it was released was 6.9%, so pretty difficult. And there were some other strange things going on with the difficulty of these problems where we didn't notice that scaling the models up bigger or giving them more data or having this chain of thought thing didn't seem like it was helping as much as we would think, suggesting that these were going to be a lot harder to crack.
Where we are today, so prior to GPT-4, that Minerva paper I just mentioned by Google, which in addition to adding a whole bunch more mathematical data to their training, did all of these other techniques that we just talked about, where it was able to get to 50.3% performance on the test set for that problem. So that's where we are today, which brings us to the question of like, okay, these are a bunch of different methods we've tried so far. How else can we, going back to the core question, amplify reasoning in language models and prove its ability on its underrepresented distribution and make it more useful for solving mathematical tasks and reasoning tasks?
So that brings us to the bulk of this talk, which will be about the work that my team and a lot of awesome collaborators at OpenAI have been working on for a while and has accumulated in this paper on process supervision.
So the motivation for this line of research is a couple of things. So one thing we've noticed is that models like GPT-4 have gotten a lot better at math in the last two years. They seem to be learning a lot about the structure of reasoning, maybe even a little bit more than we expected they would. There's a lot of modeling going on in the language modeling they're doing of like this mathematical reasoning distribution. We've also noticed that forcing models to reason step by step, doing this chain of thought thing, improves their performance. And we've also noticed that reinforcement learning from human feedback, or RLHF, is a very powerful conditioning for getting large language models to behave in the ways we want them to behave.
So the questions that follow from that are can we use RLHF to tease out all of this, like these models know so much about math, can we tease out what they know about math using some reinforcement learning from human feedback, and specifically what might happen if we do RLHF at that step-by-step level. If we look at individual steps of reasoning and try to use human feedback to train our model to understand what it means for a step of reasoning to be good or a step of reasoning to be bad.
A friend of mine put together this graphic, which I think does a really good job of showing kind of what we did here. So if you remember at the beginning of the talk when I was going over the history, I mentioned an earlier work by OpenAI on training these outcome reward models to improve math reasoning by training models to know if solutions to problems were good or bad. You just show them a lot of full solutions and you say the solution is good, the solution is bad, and you try to train a model to understand the differences. That's the world on the left, where we just show models whole solutions and we give it a thumbs up if it's right and a thumbs down if it's wrong, versus the world on the right, which we work on investigating in our paper, where we don't worry necessarily about the final answer and instead we worry about at each step if the reasoning makes sense to that point.
And so in this toy example, the question is like x squared equals four, what's x? And the solution, it's the same solution on both sides, gets to the right answer. But on the right side, when we do this process supervision, we're able to see that some of the steps in the process of solving the problem were wrong, even though the overall solution was right, which suggests maybe there's a lot of information to gather in this step-by-step feedback world that we're missing when we just try to look at the outcome reward model world.
And so for this paper, we investigate the differences between these outcome reward models on the right and these process reward models on the left. And to do that, we get a whole bunch of humans, we call them AI trainers, to help us label a whole bunch of data about mathematical problem solving by GPT-4, where we have the model generate tons and tons of solutions to problems, and we have our human labelers work on this interface, where they go through each step of that reasoning and mark each step as good or bad. And so here's just the example of the interface that they were using.
So as I mentioned, previous methods in this area have focused a lot on using outcome supervision to train these outcome reward models. These are a lot easier to train on the math data set. You don't need to collect human data, because you can just look, the math data set has ground truth answers to each problem, and so you can just generate lots of solutions, and you can bucket them on the ones where you got the right final answer and the wrong final answer, and then you can train an outcome reward model based on that. And because it's so easy, that I think is why a lot of the work is focused on training models like this.
The way you use one of these reward models to evaluate how good it is, is you basically generate some number, say 100 solutions to a math problem, and you rate each one with your outcome reward model, and you pick the best one, the one with the highest score with the outcome reward model, and you assume that's the one that's right. We call this rejection sampling. You reject all the other ones, and you just take the best one. And state-of-the-art methods training outcome reward models on the math data set gets GPT-4 to correctly answer 72.4% of test questions, which is a lot better than the 50.3% state-of-the-art from before this.
But it's not that much better than just using self-consistency like we talked about before, where you have the model answer the question 100 times, and you pick the answer that came up the most, which for GPT-4 gets about 69.6%. So even without training any of these reward models, just using GPT-4 and asking it for solutions to these problems, if you do the self-consistency approach, you can get 69.6%. And then if you add on this outcome reward model, you can get to a little bit higher to 72.4%.
So what we did for our project was we collected a million step-by-step levels, step-level labels, across 100,000 solutions to 12,500 problems in order to train a process reward model, as I described before. And for that one, what we do is we generate a bunch of solutions to math problems, we rate all the steps, and we pick the problem where the steps looked the best overall. There weren't many
Extension, where if this is the main thing that is happening, then could we train an outcome reward model on infinite data? And then would it perform as well as the PRM? Because it gets way more time, way more computation to do this difficult credit assignment problem.
We ablate these decisions. We try to understand better the difference between the ORMs and PRMs by rerunning these experiments with much smaller models. So we only got one go at the whole process using GPT-4. It's very expensive to collect a million labels of mathematical problem solving and process supervision. But given that model that we trained, we can use it to supervise even smaller models. We can take smaller language models and supervise them using this GPT-4 model that we just trained to try to recreate the experiments from scratch, to change different parameters, to understand better what's going on.
And part of what we do there is we try to understand by redoing these experiments, what is the gap between the PRM and the ORM? And we find that part of the gap is partially explained by false positives in ORMs. So like you can remember from the original graphic, outcome reward models can have this issue where they can reward incorrect reasoning that gets the correct answer because they only care about whether or not the final answer is right. But using our methods, we can ablate this away. We can take the outcome reward model trained on just the final answer. And we can take our process reward model trained on every individual step. And we can compare it to an outcome reward model where we use the process reward model to make sure the reasoning is good the whole way through. And so we get these three models now.
We have the first outcome reward model and the first PRM. And we have this new outcome reward model where we get rid of all the false positives by using the PRM to filter out the false positives. And we see that improves our process a bunch. That's the green line versus the blue line.
And so part of the reason for this gap between the ORM and PRM is that the ORM has these false positives that make it difficult to learn good and bad reasoning because it has to give a thumbs up to bad reasoning that happens to get to the right answer. And we think that the remainder of this gap is explained by the value of this dense feedback, which maybe could be closed with a lot more data.
We also do some more investigation into active learning where we, so during the experiment, because we wanted to collect data as efficiently as possible, we only showed contractors convincing incorrect solutions. So basically, we got a bunch of solutions to these math problems. We picked the incorrect ones that had the highest PRM score, basically representing the ones that the model at that point thought was really correct but were actually wrong. And we just showed those to contractors to label as good or bad and to label the individual steps.
With the idea being maybe by just showing our labelers the data that was difficult for the model to differentiate, we would get more mileage for our money. Each individual label would be more valuable than if we showed it the individual, if we just showed it all of the problems. We don't know if this was true. And so the way we tackle that is by rerunning this experiment again at smaller scale.
And we can ablate both of these things. So the orange line is showing the performance of the PRM trained on uniform data. You just sample solutions from the problem, and you grade it and use that to collect your data. And the red line is showing the active learned version of this where we only grade the solutions that the current PRM thinks are convincing negatives. And you can see that's a little bit better, but not a lot better. It corresponds to about a 2.6x data efficiency gain over labeling uniform samples. So we were able to do as good as we did with maybe 2.6x less data.
And the final thing worth investigating about this project that we think is exciting to think about, especially because at OpenAI, we develop these powerful models. And we spend a lot of time thinking about how we can align these models to human values is what implications this project might have for AI alignment.
So AI alignment literature often discusses this idea of an alignment tax where aligning a model to human values might make it less capable since, for example, it can't cut corners to meet its objectives. And so there's this idea that if we make models more aligned to our values, more aligned to doing things the way humans do them, we might see reduced performance. And then people might not want to align their models, their capable models, to human values because they get better outcomes if they don't.
And so it's really important for us to figure out ways that we can successfully align our models without making the incentives such that people don't want to do that. And it's possible, as we see from our project, that process supervision is an example of a negative alignment tax. Because we find that by supervising at a step level and ensuring our models output really human interpretable reasoning that humans like step by step the whole way through a solution, the model actually performs better than if we just do end-to-end optimization, which often results in uninterpretable, less aligned reasoning.
If these results are shown to generalize outside of math, we could find that process supervision might be a powerful way to align increasingly capable AI models to our human values. And so that overviews our work on process supervision.
And the last thing I want to talk about briefly is future work that extends from this project and could possibly have implications on how we do math with language models in the future. So three general things to talk about.
First is thinking about ways to make our PRMs more data efficient. So can we learn a number of different ways So can we learn an effective PRM with less than 12,000 problems? Would more problems help more? It's important to understand when you try to scale these systems how data hungry they are, how many different problems you need to learn. And so that's definitely an area that we're super interested in and trying to figure out more about.
And on the same idea of data efficiency, can this active learning procedure be improved? For example, we just took, for our experiment to understand the impacts of active learning, we just trained one PRM and used that to rate all the solutions and just did one cycle of that. But it's possible you could break it up into multiple iterations, where you do a little bit of learning on some new data, you train a new PRM, and use that PRM to get more convincing negatives. And then you repeat the process over and over again with a little bit more data each time to really make sure that every piece of data you gather is as efficient as it can be.
And so under this idea of PRM efficiency is like, how can we most effectively gather data from humans to align our models to produce good math reasoning? It's really expensive to gather a million labels. Could we have done it with a lot fewer? Can we do a lot more with the data that we gather?
Also curious to understand is just how powerful these PRMs are. So to that end, collecting this large process supervision data set was expensive. And to lower the barrier of entry to researchers to explore the power of the PRMs, we open sourced all the labels we collected as part of our PRM 800k data set, which you can find on GitHub, that we hope can catalyze more research into how to build powerful PRMs with the kind of data that we gathered.
And also on this thread, so for this project, we only use the data to train reward models that could rate solutions. But there's this open question of how we could use it to train models that output better solutions in the first place. There's this interesting conundrum kind of implicit in our results where we show that the language models are able to be taught that what they're saying is wrong.
We can train PRMs that train language models like GPT-4 to point out incorrect steps of mathematical reasoning. And so that shows that these models have enough understanding of math to know when individual steps are wrong or right, but they still say the wrong things. And so is there work we could do with this knowledge that they know that what they're saying is wrong or that the models understand something about wrong math reasoning to get models that are able to reason better in the first place, that say less things that are wrong, or things along that line?
And then finally of a lot of interest to us is generalization outside of math. So a lot of the work so far has been in this area of a lot of the work so far on reasoning with language models has been in the area of mathematical reasoning. And I think there's a pretty clear reason for that.
It's a lot easier to test your model on math questions than it is to test it on logical reasoning in other domains. You might imagine there's logical reasoning in philosophy domains or creative writing domains or different sorts of things, but it's a lot harder to evaluate that than it is to just evaluate mathematical reasoning where we just get an answer key.
But it's really important to understand, both in general for the power of these PRM methods, but also for understanding how they can be used to better align models in the future. If these sorts of results can
of the story I'm always telling my son when he's getting frustrated and trying to just rush to the right answer.
He's in seventh grade going into eighth grade, and I'm always telling him, Fela, honor the process. Honor the process. Validate your steps. So thank you. You validated my mothering, Hunter. I've also validated my mom, who used to say the same thing to me all the time.
Really? Oh, wonderful.
We do have a couple of questions. We'll start with Daniel. You can go ahead and unmute yourself. Followed by Sean.
Thanks so much. Thanks for the great talk. So I was wondering if you could just talk a little bit more about the MATH data set. What is the model actually achieving 79% performance on?
Totally. So MATH is this data set. You can just Google it, MATH data set by Dan Hendricks. And it's focused on high school level competition math. So the sort of things that high schoolers might work on, not quite like International Math Olympiad, but ranging from the course material of most of your high school level classes up through AMC, if you're familiar with that kind of math competitions. And so it covers topics from pre-algebra to some simple calculus stuff, to counting, to discrete math, and things like that.
Awesome. Thank you, Daniel. We have Sean.
Yeah, I really enjoyed the talk. So I was wondering, to what extent, given how expensive human data is, do you think synthetic data can play a really large role? And this, it seems like it could kind of do this generally. And I'm wondering, with a step-by-step approach, and maybe overall, how do you kind of think about synthetic data combining into this kind of data set, or being used to go further the model development?
Absolutely. I think that's a really, really important question. So I think that whenever you try to tackle a problem at a place like OpenAI, where we pride ourselves on ability to just build these really huge models on these really huge data sets, a question that kind of just is always at the fringe is, how do we scale this up and make it more powerful by making more data, or by spending more computation on the data that we have? And this is always an implicit problem whenever you're doing a human data collection experiment, is that human data just doesn't scale that well, because it's kind of just linear and costs the whole way through. And so the question you're getting at is, how can we kind of lift off from just using, from being bound by the cost of human data? And you mentioned synthetic data. I think that's a really, really valuable thing to think about. Definitely something we've been exploring more of. I think there's, we know RLHF is really powerful at aligning language models to behave the way we want to behave. There's other methods that kind of do RLHF style things with synthetic data. Anthropic has a paper called Constitutional AI, where they talk about trying to do things like RLHF using GPT-4, or sorry, using their version of their language models to evaluate themselves. And so maybe there's work here to be done. Can we use GPT-4 to evaluate the step-by-step reasoning the way we already had humans do, and provide a little bit of signal that way? I think it's pretty inconclusive so far, but I think it's really interesting to think about and explore.
Yeah, thanks for those details.
Thank you for the question, Sean. Jeremy, you have a question for Hunter?
Welcome, Jeremy.
Hi there. Thank you. Yeah, I had a couple of questions. One of them was a bit similar, which is, I guess, there's been a lot of interesting work recently on kind of expanding manually generated data sets using synthetic data sets. So things like Orca, I suppose, and kind of building on top of Flan. So I was wondering if you had kind of looked at that particular direction. And, you know, other directions more generally that people have used in other fields, like the textbooks is all you need, paper, you know, like what things have you tried, which ones kind of, like what have you tried that didn't really work? And then I'm also really interested in, there's been some research that shows or suggests that pre-training general purpose large language models on code doesn't just help the right people, but it also helps doesn't just help them with code, but also helps them more generally. I'm wondering whether you found a, have you tried like including math problems in pre-training? Does that help other non-math stuff? Does helping including code in pre-training help math stuff? Are you got any more thoughts about that?
Yeah, totally. So there's kind of two parts. There's thinking about ways to synthetically amplify datasets. And there's like, what is the important data to include during pre-training? So I'm definitely super interested in, as I was just saying, synthetic amplifications of the dataset. And I'm currently trying things in that direction. Not anything really for, I mean, you kind of see a little bit in this paper because when we train these smaller models, we're training them entirely on synthetic data. It's like it's data labeled by GPT-4. And so it kind of shows that you can use the, but it's just a little bit different because it's kind of still just distilling the data through GPT-4. What's really important for these synthetic data approaches is to figure out how you can improve a big model with that same big model, not just how you can improve small models with a big model. And that's still a lot more inconclusive. I don't think anyone has really conclusive findings on that in any way. I think there's definitely work to show that you can use synthetic data to improve things that the models already have good understanding of. So I think the Constitutional AI paper is like a really good example of this. I'm not as familiar with the Orca one off the top of my head, but the models already have a pretty good understanding of if a response to a question is violent or inappropriate or something like that. And so in that case, you can use their existing understanding of those things to improve its ability to be less violent or less inappropriate or something like that. With stuff like math, it's not so clear that it's easy to elicit this mathematical reasoning synthetically. We often find cases where you can show the base model a really obvious mathematical incorrect thing and ask if it's correct or incorrect. And it just says, oh, I think this is correct. And so I think there's still some very important work in figuring out how to amplify with synthetic data reasoning content that is not the same as some other stuff that's already been done. And then for data sets in pre-training that assist with reasoning 100%, I think it's really important to investigate what we think of as the transfer between these different domains, how math helps code. And you can see from Minerva, from Google, that their Minerva data set, their math data set for that is not just math. It's a whole bunch of data from archive, from the internet, which is across various different subjects and stuff, suggesting this general understanding that we have in the field that high-quality data about reasoning through hard problems is really important, and finding more of it could be really valuable. Thank you for those great questions, Jeremy. So Kristen is in the airport, so I'm going to ask her question for her, Hunter. And just for context, Kristen is the chief learning officer at Khan Academy. She first says, thanks for a really accessible presentation. As someone applying these models as a math tutor at Khan Academy in our application, 78% or even 92% are not good enough for learners. But we love the idea of process models because students are walking through steps, asking you to look into the future. Do you think we will be able to get closer to 98% or 99%?
Totally. Really good question. I think, yeah, the whole time of working through this project, I think I was looking at some of the Khan Academy stuff going on on the side and just so excited about the ways in which this could be applied there. Because I just think like, I mean, you know this way better than me, but if you can kind of democratize access to math tutors at an individual level for students, that's just so powerful. So like super excited about all the work you're doing there. The 92%, so a couple threads. So 92% I think is a little bit misleading. I think the best you could do on GSM is like 94% or something. Because like 6% of the questions have mistakes in the answers. And so like it's very close to about as good as a human could possibly be. And then it's also probably about as good as humans actually get. Like if you actually give, so Dan Hendricks in his original paper on MATH shows that if you give his dataset to college students, they perform, I think it's like if you give it to someone who did some competition math in high school but never like quite got to IMO level or something, they get like 80% on the dataset. And so we're like quickly approaching that level of performance. It's a little bit misleading because they only gave
see a world where the models perform about human level, not perfectly, but well enough to still be useful tutors to critical students who aren't just going through the motions, but are thinking about the problems, are able to help the model point out things that don't understand, and work together with a tutor. So that's one thing. Another thing I think that's important to realize is that for applications like tutoring, the model can see the ground truth answer, which hopefully boosts the performance a lot more. So between these two things, that's a little bit of a rambly response. It's a really cool question, but there's a lot to it. From these two things, I kind of hope that we can come up with methods that use language models to give students the one-on-one instruction that so many students lack, but doesn't require us to be perfect, or we can use other methods to get more perfect than we can get with the models alone, like showing it the answer. Thank you, Hunter. I think that was a great. She said, that's our hope, too. Thank you. Thank you, Kristen.
Sean, I hope you don't mind. I'm going to call on Daniel Lit first, because he hasn't had a question, and then you'll be next, Sean. So Daniel, please feel free to unmute yourself to ask your question.
I did actually have a question earlier, so maybe Sean should go first.
Oh, did you? Oh, Daniel. Yeah, I'm looking at the hands and not the faces. Yeah, go ahead, Sean.
OK, well, actually, mine is related to Kristen's question as well. So hopefully this is a good segue here anyways.
Hunter, to what extent does it get tripped up on just numbers in general? Like, hey, it has the right reasoning, but the numbers are wrong. And then maybe as an extension, does cheating come into play, where it actually plans out how to use a calculator tool, but still does all the reasoning for it? So now it's the language models trying to figure out how to use the calculator tool and get the numbers back from that, but it's still doing all the reasoning and planning. Does that get us to like 99%? Or does that kind of keep the whole purpose that we're doing anyways here?
That's a great question. I think tool use in language models is really, really important. There's basically two threads here. Tool use so far doesn't get us above that 78%, but I do hope that it might. Part of the reason why it doesn't is that our process reward model is actually really good at checking these single arithmetic things. So when you try to use a model like GPT-3 or earlier, it makes a lot of these dumb math mistakes. GPT-4 makes them less often, but still makes them. But our process reward model is actually really good at highlighting those mistakes. And so if you look at the samples from the model, you see for a hard trig problem where there's a whole bunch of algebra to do, you might have 99 solutions where it makes a mistake at one point, and just one solution where it gets it right. And our process reward model is kind of able to throw out all the solutions where it makes mistakes and go to the one where it gets it right, which kind of in some way, in some like theoretic-ish way, is making our PRM act like that calculator. Like it's able to filter out all the bad math, which is why we think that the calculator PRM distinction is not that wide. At the same time, there's kinds of reasoning where maybe it's not important to get that much better at, that maybe calculators can help. Like there's sort of things where you might be able to write a simple Python program to do a for loop to figure out the prime divisors of a number or whatever it is, rather than having or use a library or something. And so tool use is important. We're not super focused on it right now because of the other people at the company who are. We're more focused on the reasoning in general. I like the query planning. Yeah, that is helpful. So all right, cool. Daniel, go for it.
Thank you, Sean. OK, Daniel, back to you.
Thanks. So I was wondering, oh, hi, Srinik, by the way. I was just wondering, so I'm primarily interested in getting large language models to write proofs. And I imagine doing what you've done for problems that require a proof would be much more expensive to hire people. So I was wondering if you could just talk a little bit about the prospects of this kind of work you've done for things where the LLM has to write a proof.
Yeah, so one of the things I like about this approach is that it doesn't rely on the math problems having final answers like a lot of other approaches do, which actually makes it more amenable to doing proof-based things because you just check each step. You don't need to have a way to automatically check that the final answer is right, which is exciting. But you do get to a really good point, which is that it's increasingly hard to find people to supervise these models at harder and harder tasks. The original RLHF paper had human supervising simple following instruction tasks. And then when we came along and tried to figure out how we could do this with math, we spent a whole bunch of time iterating with the folks on the human data side, figuring out how we can actually recruit. We ended up finding a bunch of math, physics, PhD students at colleges who wanted a job that was kind of like TAing and paid a little bit more than TAing. And that was kind of necessary to get this project off the ground. If we want to get to even harder math, you need people with even more expertise. If you want to get to IMO level math, you need the kinds of people who can supervise solutions to IMO problems. That's hard. It's going to be expensive. It's going to be harder to scale. I think one of the things on our research side is figuring out how to make little bits of data more valuable because it's going to be hard to get a lot of that data. And another thing on the side here that Natalie is helping a lot with is hopefully by building out this community, we can find a lot of people with a lot of math expertise that we currently struggle to find and find ways for them to engage with the models that might actually be helpful for the reasoning.
Yeah.
Thank you, Daniel.
So Sarah doesn't have a question, but she kind of did have a statement. Sarah, would you like me to just read that for you? Or did you want to unmute yourself?
OK. I'll read it for Sarah. She says, given the use of Stack Overflow for initial training on math reasoning, she just wanted to share these charts from Ayaan Fawat-Selak about the 35% decline of Stack Overflow traffic and a 50% decline in usage over the past 18 months. So just let everyone know that link is in there. And curious, Hunter, if you had any thoughts on that.
Yeah. I don't want to talk too much personally about the policy details of all this stuff. I just want to totally want to echo that. Alignment for large language models and safety of deployment for large language models involves a lot of complicated issues. And figuring out how all that stuff works is like, it's hard. Yeah. Totally.
OK. We have time for one last question if anyone is interested. Also, Hunter has had a long day at a conference and is in Hawaii. So we could let him go a little early for a walk on the beach if that's it.
OK. Thank you so much, Natalie. My pleasure. Good night, everybody.