OpenAI Forum
+00:00 GMT
Sign in or Join the community to continue

Event Replay: Learning Powerful Models: From Transformers to Reasoners and Beyond

Posted Oct 08, 2025 | Views 64
# OpenAI Presentation
# AI Research
Share

speaker

user's Avatar
Lukasz Kaiser
Deep Learning Researcher @ OpenAI

Lukasz is a deep learning researcher at OpenAI and was previously part of the Google Brain team. He works on fundamental aspects of deep learning and natural language processing. He has co-invented Transformers, reasoning models and other neural sequence models and co-authored the TensorFlow system and the Tensor2Tensor and Trax libraries. Before working on machine learning, Lukasz was a tenured researcher at University Paris Diderot and worked on logic and automata theory. He received his PhD from RWTH Aachen University in 2008 and his MSc from the University of Wroclaw, Poland

+ Read More

SUMMARY

Łukasz Kaiser’s OpenAI Forum talk, “Learning Powerful Models: From Transformers to Reasoners and Beyond” offered a research-focused but deeply values-aligned reflection on how AI is evolving from data-hungry systems toward reasoning models that learn more efficiently and safely. The framing he used emphasized safety, learnability, and human-like reasoning. He consistently underscored that making AI more learnable from less data and more computationally powerful ensures that progress in AI remains beneficial, efficient, and accessible to all, rather than concentrated among a few actors.

+ Read More

TRANSCRIPT

Welcome, everyone. I'm Natalie Cone, your OpenAI Forum community architect and member of the Global Affairs team here at OpenAI.

Welcome to our very special lunchtime talk, Learning Powerful Models from Transformers to Reasoners and Beyond. Many exciting and wonderful things are happening in AI all the time. But how do we understand the bigger picture? Why are these breakthroughs happening now? And what core research is driving them?

In this talk, OpenAI research scientist, Łukasz Kaiser, presents a simple model for thinking about progress, how we are making increasingly computationally powerful systems more learnable. Like any model, it has its flaws, but it has guided Łukasz from co-authoring the Transformer Architecture in Attention Is All You Need in 2012.

2017, to advancing model-based reinforcement learning for Atari 2019, and most recently, to co-developing contemporary reasoning models at OpenAI.

Today, our special guest, Łukasz Kaiser, will share how his framework helps him organize his thinking and research around AI and what it may mean for the future.

Łukasz Kaiser is a deep learning researcher at OpenAI and was previously part of the Google Brain team. He works on fundamental aspects of deep learning and natural language processing. He has co-invented transformers, reasoning models, and other neural sequence models, and co-authored the TensorFlow system and the Tensor2 Tensor and Trax libraries. Before working on machine learning, Łukasz was a tenured researcher at University of Paris Diderot and worked on logic and Audemars theory. He received his PhD from RWTH Aachen University in 2008 and his MSc from the University of Wroclaw, Poland. It's an incredible privilege to have Łukasz in the forum with us today, bringing us up to speed on how a leading deep learning researcher thinks about AI progress and the future of our field.

Please help me welcome Łukasz to the OpenAI Forum stage.

Thank you, Natalie. Thank you very much for the invitation and thank you to everyone for joining us.

I will give a talk today about my broad journey with AI. It started about a decade ago and is still going on. I am very happy to tell you how I think in AI in a very broad general way and how this guides what I work on.

You know, AI is a very busy field compared to, I used to do automata theory before and that's a far less busy field. There are some problems to be solved in the field. Some problems get solved sometimes, but you can skip it for a few years and the problems may still be there to be solved.

In AI, it can feel like every day or every week, there is a lot. You have self-driving cars on the streets of San Francisco. There may be someone doing a robot somewhere in a factory. There was a new video generation model. You can chat with chat GPT, generate some images. There was a lot of discussion whether AI will take some jobs, what will happen.

So in the influx of all of this, it may be very hard to find a path through it to guide your thinking like.

What is really important? How did it come to be? And what will happen later? And to me, I have a very simple model in my mind. And because it's simple, it's not fully true. It can't explain everything. But the basic question that I always ask myself is, can machines learn from less data? And you may not think at the first sight that this is the most important question. You may think, well, maybe the question should be, how should we scale to more GPUs? Or how should we make multimodal AI? And many people ask different questions. So I want to convince you that the question, can machines learn from less data, is actually a very crucial question that can help you think about how AI has progressed and also how it will progress.

So why do I think machines need to learn from less data?

Let me tell you a story. It's not a few months ago, not that long. I went to Berkeley for a talk of a famous robotics professor and he uses the same slide in his talks for quite a while. And the slide is always, where are our self-driving cars?

So I wanted to laugh because I took a self-driving car to get to the Metro in San Francisco. Self-driving cars are very common now. But then before I laughed, I started thinking, well, but it couldn't take me to Berkeley. It's just, you know, you just need to drive over the bridge. It can't do that. Why can't it do that?

Any human driver, any human person that has driven around San Francisco for so much as these cars, and that's already quite a lot, would easily drive you anywhere around California. So why can't the car reach the same level as a human? And of course they want to be safe, but humans are also reasonable.

safe. So there is some problem. These cars need way more training than we need to reach the same level of performance. Way more training not in terms of hours maybe because that's hard to compute, you know, they have simulations and GPUs and so on, but we just spend a number of hours in the world and learn things. And this is, you know, I think most AI researchers have been to some extent driven by these fascinating things, how humans learn, how we learn. And I believe this is still extremely fascinating.

So how is it that we can learn from so little data and the machines can't quite yet, even though, as I show you, they're getting better. And you may think, okay, but is this only relevant to self-driving cars, to small things? And I believe not. I believe this is the question for a lot of things that we kind of hope and dream for.

from AI. So, in particular, if you want AI to do science, which recently is something which OpenAI and many other labs really want to push for, we would like an AI scientist, right? We would like AI to step in and say, can you help us cure this disease? Can you help us cure cancer? Can you help us get us to Mars? Can you help us build new materials? This is the future we dream of. AI was also a promise of this bright future that we're still kind of waiting for.

We don't want just self-driving cars. We want flying cars, right? We want rockets. We want a better world. But if you think of what a scientist does, it's always something new, right? I mean, you learn all the old things, but you study new things. You run new experiments. You look at some data and you're like, okay, now this is something new that wasn't known before. And maybe you have just this one

think that's there about it. If you do research in mathematics, you know, maybe there are like three or four papers about some objects that you want to research, but there is not a wealth of literature. It's not like programming competitions where you can have 100,000 examples from things that came before. It's something where you need to take this very little data you have and think very hard about how this can come together, and from the things that you did, from your little experiments, learn how to do the next thing and again learn from it and progress in this way.

So I believe that machines learning from less data, from very little data, is absolutely crucial both for things like cars driving us around everywhere, but also for AI helping with science and helping with many things that we really hope it will help with. In some sense, I believe this is the crucial step to a lot of the dreams we have for

AI to realize. So how can we make machines learn from less data? And I have an answer in my mind, which is not the complete answer, but I believe is like a very important thing to look at and maybe the most important first order thing, which is if you want to learn from less data, you need a computationally more powerful model. So I will, in the process of this talk, tell you a little bit more what this means. But I want to first say that as we go through this journey, I will use this analogy that came from my friend and OpenAI co-founder Ilya Sutskever, that there are powerful models and they're models we can learn really well. So we have these two circles, and I would say maybe 15 years ago, maybe 20 years ago, but certainly there were 15 years ago, still some people who thought these circles were very powerful.

So I would like to start with Ilya Sutskever, who is the co-founder of OpenAI, and I would like to ask him to tell us a little bit more about OpenAI. So I would like to start with Ilya Sutskever, who is the co-founder

circles are totally separate. You can even prove some theorems saying if a model is very powerful, it can either memorize your data, but then it's not really learning anything. It's just memorizing it, and it will not do anything outside of that. Or it is super hard, like if you want to regularize it, it's super hard, borderline impossible to learn.

And so for many years, we believed models we can really learn well. And models that are computationally powerful, they don't really interact. And I had the luck to come to the field around, I joined Google towards the end of 2013. And in 2014, Ilya had this paper, sequence-to-sequence learning, where there was this first intersection of these two things between powerful and learnable models, which was deep recurrent neural networks, RNNs trained with deep learning. This was the first model where people thought, yes, this is a.

This is a powerful model and learnable model, and maybe it will solve everything. And sometimes I think there was even this thing that, OK, maybe this is the last model we need. Maybe it's powerful, it's learnable, we'll train it, and it will solve it all. The way I think about it now, a decade later, is more that we need to roll this.

We need to make even more powerful models learnable. And it doesn't happen kind of on its own. It's like a lot of effort. That's why I introduced this learnability kicker, who kind of kicks this learnability thing to make more and more models more learnable. I named him after myself for this talk.

But of course, I believe this is a huge effort by the whole community. A lot of amazing researchers have pushed this frontier to make powerful models learnable. They don't always talk about it in these terms. I will not have the time.

time in this talk to acknowledge everyone, but I just want to say that it's been a great journey.

So as I said, this started with RNNs.

So let's get a little overview of how an RNN works. It takes words as input. Actually, it's not words, but tokens. So tokens are words, but some very complicated words get split into multiple parts.

But let's think of them as words for simplicity for this talk. And so it takes a word. Maybe it starts with a word called stark, just to initialize it.

The words get embedded, which means it gets transformed into a vector of numbers. And this vector goes into what was called an RNN cell, which is a function that processes a state vector and this vector. So it takes two vectors, and it outputs two vectors.

It outputs the next state, which is the arrow on this picture. It will go to the next RNN cell. And the other vector will get transformed.

formed into probabilities of the next word. So for every next word, for every next token, you'll get a list of probabilities, like with what probability should it appear next.

So maybe the word A is chosen to appear next, and then this process repeats with the next word. Gets embedded, goes through the RNN cell, now the state gets updated, and the next word gets produced.

So in this way, the RNNs can produce a text. They can also, if you think the first paper about RNNs was, I mean, there were many papers about RNNs, but translation was a very big task at that time.

So here is an illustration of how an RNN can do translation. It can first process, and you ignore the outputs of the words in one language, and then it processes and outputs the words in the other language.

So RNNs.

hands, as I said, were the first truly powerful model. Like, they can do tasks like translation. And they were learnable. But they had some limitations. So I want to just think in a bigger picture.

I want to introduce to you this Mr. Snail. So the snail helps me kind of think of how these models look in a bigger picture than these technical details they have.

In RNN, it processes the text word by word. So these little round things, think of them as words. It looks at them. And it is carrying the state vector. So it carries a fixed amount of numbers inside its tummy. It looks at the word, processes it a little bit. It changes something inside. And it spits out the next one. And it goes on. So it moves.

step-by-step, and in every step, it does a constant amount of computation and through the whole process, it has a constant amount of space that it can use. This is the single state vector. That's the whole context it carries. There is no memory of what happened before. That's it.

So when I talked about computationally powerful models, you can be precise about what this means, and here this means there is a constant space that it operates in, and it does a constant amount per step, per word of input, so if there are n words of input, it will do exactly constant times n operations on them. So that's the computational power of RNNs.

If you ever heard about computational classes, you will know that this is not a very powerful computational model. It's powerful enough to do some things, that's why we said, but it's not that powerful.

Also, I included

parallelism here, which is very important for practical reasons. If something is not very parallel, like the RNNs, you will wait for a long time to do stuff with it, either training or inference. So RNNs are not very parallel. They have this state that needs to be updated every word, and you can't kind of speed it up very much. So these are problems.

But on the positive side, well, they can learn from arbitrary data. This learnability side is quite good for them. You can take arbitrary language, train an RNN on it. You start this training from a network of random weights. So you can initialize your deep learning network fairly randomly, do your training. Now, I say fairly good, because it took some iterations to figure out how to train RNNs. You need to structure.

the cell right. So it wasn't that easy. It took a fair bit of work. But it can be done. And then these models, which are reasonably powerful, become reasonably learnable. But as you see, maybe we could do better. And that was the first motivation on my journey towards pushing this learnability more towards powerful model.

It's like, could we make a more powerful model that's also very learnable, maybe even better? And that model turns out to be the transformer. So what is a transformer? It's also a sequence-to-sequence model. So on some basic level, it works like an RNN. It takes things tokenized into tokens. It embeds the token or the word. And then it will output the probability of the next word. So in that regard, it works the same. But

there is no state that gets processed from one step to the other. What happens is every word in the future can look back and attend to everything that happened in the past. So this attention mechanism is the big new invention of the transformer. It doesn't have a state that made RNN so sequential, but it also allows you to see everything in the past, all the activations of all the previous things.

So here is an illustration of how this works. The arrows you see floating are the attentions. So maybe let's wait for it to start again. This is again translation. It starts with an English sentence. It will produce a French version.

sentence it will just encode it so everything is attending to everything and it's creating a representation.

What's more interesting to us and how the current GPT's work is the decoder model because here you start with the start symbol and it's attending to the previous things but the next word also attends to everything that happens before in the sentence that was produced. It cannot attend to the future because it's not there yet but it attends to everything is in the past as you see. So this attention mechanism turned out to work really really well.

To take a step back and think about it in the big picture now if I think about a snail I think about one that packs every word it sees onto its shell it puts it in and in and in and then every step

it grows by one word, and it can look back and attempt to do anything it's put in there. And only then it outputs the next word.

So now, if it has read n words of input, the space it has available is exactly n times the size of the vectors it uses.

The computation time has grown to be n squared. And this is something we work with a lot to make this a bit faster. So it is a more computationally powerful model in this sense.

It's also nicely parallel, because since it doesn't have the state, at least in training, it can push all the n words at the same time, do this square attention matrix, and move on.

And this parallelism makes it very practical. And it's also very learnable, just like the RNNs you can train it on arbitrary data.

the network randomly, train it. It's actually, in some sense, even easier to train than RNNs because of the lack of this recurrence. So transformers are a really great architecture. And they are more powerful and learnable. So you can ask, does it show in learning from less data? Does it really help us? And well, it does. So you can verify it in a scientific way. So in the transformer paper, there is the big results on translation. But there is also learning of parsing, which is a thing that has much smaller training data. And the transformers outperform RNNs there by a lot. But also now that we have transformers everywhere. They're the basis of GPT. It's the T in GPT. They're in chat GPT. They're in image generation models. They're even in the way most of driving cars. Probably in some of your video models. They're almost everywhere.

everywhere now. You can see that they can learn from less data almost by yourself. You can see that they can learn from context very well. But also, yes, chat GPT transformers were trained on almost all of the internet. So you may see, oh, maybe it has seen every data in the world.

But there are things on the internet that don't appear there that much. Like you can ask the transformer to do parsing, which was this low data thing. And you can check that there is not that much data about it on the internet. But the model has really learned from this little data it saw in training. And it does a very good job. So transformers really learn from less data. And they have brought, thanks to this, to us a lot of beautiful things that we call GenAI, in some sense, at least the first generation of it.

Well, it's not just the transformers,

One way of making models more powerful is to just make them bigger, more layers, right? More layers was the scaling paradigm that we had before. The problem with more layers is, so as you add more layers, you can see how the test loss depends on your number of layers, and it actually goes in a very straight line. And the people who made scaling clause at first had the hope that, well, you know, maybe as the models have more layers, they will actually need less training data.

Even the first scaling clause papers had a plot that said, yes, this may be true, but then luckily later it was corrected, and it turns out that's maybe a tiny bit less, but not really. So as you scale the number of layers in your models, you need to train with proportionally more data, usually proportionally to the number of parameters, maybe a tiny bit less, but not super much.

So the scaling of the layers itself doesn't allow us to learn from much less data, a little bit.

But, well, you think, well, but maybe somehow it helps you process.

Well, we did one day a check on something called GSM 8K. It's math problems from sixth grade. They're like, you know, Tom has eight balls. He sold three balls. How many balls does he have left? Kind of this level of difficulty.

And that was around the GPT 3 and 1 half time, between GPT 3 and 3 and 1 half, I think.

And you could see in this plot that the models reached about 35%. But also, we had a scaling curve.

And it seemed that to reach 60%, we would need a model that has hundreds of trillions of parameters. And there is not enough data on the internet to train.

in it. So it seemed like this scaling would have, like, it works. It goes up. But it would be very hard and extremely expensive to push this to 90%. So we thought, well, OK, but is there a way to make the models even more powerful without just doing more layers? And that's how we got to Reasoning Models.

So what are Reasoning Models? Reasoning Models are deceptively simple because they're the same models. It's usually also a transformer. Just before it generates the answer for you, it's allowed to generate some tokens for itself, which we call the chain of thought. That's why we call them thinking or reasoning models.

So if you ask, like, a GSM 8K question, Roger has five balls, he's a streamer. Well, then he may think, like, he started with five, then bought three.

It will do this arithmetic before, and then it will give you the number, and it helps it enormously, it turns out. Now, there are two things to realize about reasoning models.

First of all, because you have this object, this chain of thought, that's there before the answer, it's the process of generating it. It involves sampling discrete things like words. It's not fully differentiable, so you can't train it with the methods we used before. You can't do just gradient descent like with normal deep learning and train that.

What it turns out you can do is use deep reinforcement learning, and that allows us to train these chains of thought to give good answers. So the way they are trained currently is you have some data that tells you this is a correct answer, or maybe you have a checker that checks whether the answer is correct.

And the model works, works, works. And you'll say, well, if it gives the correct answer, that's good. That's reinforcement.

You reinforce the correct things, which is great, but as we'll come to it, it adds some limitations. The other thing is, as this model is thinking, we call it chain of thought. I'm saying it's generating tokens for itself, but it can do one beautiful thing. It can also call tools that are external to the model, because now we are not bound to differential.

So for example, it can do a search on the web. And I think this is a very good example to think of. If you ask, when does the zoo in San Francisco open? Today. A pure transformer model, like chat GPT 3.5, which you'd get, or even GPT-4, which you'd get a year ago, would probably try to tell you something, but it would have to memorize in its weights the opening times of the zoos all over the world. Well, it probably has.

seen the website of the zoo and its training data so maybe it memorized the time but you know the time may have changed since then and also do you really want to memorize the opening times of all zoos in the world like that's not how we learn that doesn't seem like good for learning from little data so now the reasoning model can do something different it doesn't need to know the opening time it just knows oh wait look this looks like something where i should just go and query google or bing or any search engine and then it will give me back a website i will put it into my context read it and extract the opening time from it but that's how i do it now look that this is a beautiful strategy that allows you to work from way less data because instead of learning the opening time of every zoo in the world you can now just learn to do this on a few examples you maybe practice you know what happens if the website is in another language

So the model understands many languages, so maybe it doesn't even need to translate. But maybe there is no website you need to search twice. So it learns a little bit of things what to do in the edge cases. But it learns one strategy, and then it can answer you the question for any zoo in the world. And it will actually give you the accurate information right now. So that's how reasoning models enable tool use, which leads to agents these days and a lot of applications like this.

So reasoning models are very powerful because they can generate tokens and tool calls before they start giving you the answer. And these examples here may seem very small, but these days, reasoning models like GPT-5, they can be thinking for a long time.

So I want to show you an example of a reasoning model thinking about an IMO problem. IMO is the International Mathematical

Olympiad, one of the most established and hardest math competitions in the world for high school students. And the problems are usually like this. Alice and Baza were playing. And the model was thinking for 23 minutes about this problem. And on the right, you can see, you know, all the things it was going through. And this is just a summary. So everything you see here, it was actually thinking a lot about this, probably many paragraphs.

And then after this thinking, it generates the answer you see on the left, which is the correct answer to the problem. So I want to show you this just so you understand that this thinking is not necessarily just the small thing of one sentence or two sentences. It can go on and on and on, many paragraphs. And it can also take a while, like many minutes.

So let's take a step back again. And with our snail analogy.

think about reasoners. So RNNs had just one state that they carried and couldn't see any context. Transformers could see everything that came before, but it had to be written. It had to come either from your input or it was generated as an answer for you to read.

Now reasoners are like a snail who can stop and just inside his shell start generating a lot of tokens and grow and grow and grow. And spend a lot of computation on that, and only then start telling you the answer.

So in some sense, this is amazing. It can have as much space to do its computations as it wishes, because the space it uses is the length of the chain of thought, and it can decide on its own how big it is. If it needs to think more, it can think more. The computation time is even the square of that, so that can be a lot. But that's great, right? It is almost.

is unlimited in computational power in this sense. Problem is it's limited in practice by certain things. And luckily, because in generating tokens by transformers, you go step by step.
So these models are not very parallel. They generate one token, then the next, then the next. So parallelism, we're back to the one X.
And if you think about how learnable they are, we said you train them with reinforcement learning. So that in itself may be not a problem, but to make reinforcement learning work, you need data that has verifiable answers.
Like in mathematics, this is the answer. It's 15, it's seven. Or maybe it's a function, N squared or something like this.
You can have a sentence that's an answer and you can ask the verifier to say, well, is that answer that model's giving more or less the same?
But it needs to have this notion of correctness and not everything.

in life is correct or not. We'll talk about this in a moment. So it has this limitation, and it has the limitation that all of these tokens, they're put in the context in the transformer. And the transformer context, it's practically very hard to grow it. It costs a lot in the memory and computation of the GPUs. But other than that, that's great, right? We managed to make models that are learnable. Well, they're not learnable from scratch like the previous models.

So to train a reasoning model, you need to start from a pre-trained model that's been trained on a lot of language that has already some idea how to reason.

So this scaling the number of layers turns out was super useful to have what we call the prior, the models we start, the pre-trained models like GPT-4 that we used to start this reasoning process. But after that, you can train it with reinforcement learning on verifiable data and DLR.

So can they learn from less data and can you really see it in the results, just like with transformers you saw, this was much better. And I think the answer is absolutely yes with reasoners too.

If you look at math, which is the thing they were often trained for, if you took GPT-4, so here it is for all, but the results for GPT-4 were very similar, it's about 13%.

So AIME is a competition where there's a bunch of easier tasks and then they get harder and harder and harder. So doing 13% is fine, but it's not great. These are the easier tasks.

Back then, and we released the first O1 model about a year ago, it feels like much longer time has passed, but it was only a year ago when we released it. We worked on it a few years before that. It feels, it felt like getting to 80% on AIME will be very, very hard.

But no, already our first reasoning models, O1 preview was at 56, O1 was already at 83. Currently the models are at 90 something percent. So they basically solve every question in this. And this is already a fairly hard math exam on the high school level. And you can see also that the more they think, the longer this chain of thought, the more computation they get, the better they can solve these things.

And we train them, you know, we train them to learn mathematics. They're trained on math and coding problems. They're not very little of them, but it's not like the whole internet. It's not that much data. And they were trained on the whole internet before and only got 13%. And now you just do this reinforcement learning with reasoning and suddenly they get to 80, 90. So they start thinking and solving problems, trained on just a little data because.

they're much better models. They're much more powerful and we have managed to make them learnable. So are we there? Are we, can we learn arbitrary powerful models?

Well, as we discussed, there were some limitations that reasoners still have. I highlighted that in green here. They were not very parallel. So we would really want to make them as parallel as the number of GPUs we have. We don't fully know how to do that yet.

We start from a prior, that's not necessarily bad. We have no very nice prior models, but maybe it would be nicer if we could start from more arbitrary models. It would be nice if we didn't have to do this whole test time thing inside the context. We could somehow expand this.

But I think the key issue that is still holding us back is the data thing. They can only learn on very.

verifiable data currently. And what we really want is to train them on arbitrary data. Why? Well, if you train on verifiable data like math and code, we can already see that the reasoners do amazing things. So this is recently an example from my colleague Sebastian. He gave a reasoning model at paper.

And so this paper has researchers, mathematicians, working on it for quite some time. And they had a theorem. And it just asked, can you improve upon this? So you know, that's a bit preposterous. These people put a lot of work into it. Do you think you can just improve it? But it turned out, yes. It thought for 17 minutes. And it said, well, yeah, you look. This is how you can improve this bound.

So that's quite amazing. It had to read the paper, use its all math knowledge that has been trained, but also use the knowledge from that paper, this little data.

and it gave an improved bound. And this is starting to happen in many fields of science and technology that scientists just use these models, these reasoning models, let them reason for a while, and they can improve little things like that.

Now, what we would like them to improve is big things, right? To go besides this math and coding thing and work on arbitrary data. And I think that's our last main obstacle for powerful, learnable models. This is what many people are working on.

And just to illustrate it, we have great models on math, we have great models on code, but what if ChatGPT goes to a bar? Do you think it could go to a bar and hold a conversation? I want you to think about it like when, people go to a bar, we don't think we do a lot of research.

not in the math sense, right? But it's actually extremely challenging to be in the physical world in social groups because it requires a lot of planning.

Maybe there's a friend I know for a long time, so I have some memory and knowledge of preferences and what these people like and what they talk about. And there may be someone I just met an hour ago and they said a bunch of things and I need to process that and do something that's appropriate for everyone.

So, you know, can GPD go on stage and tell a joke that's good for this specific audience in this specific circumstance that's actually really funny and not just copied from somewhere else? Can we reason about not just math and code, but also physical things, also multimodal data?

I think this is the final frontier and this will also help us to do science because it's.

It will allow us to reason about fields, not just about the technical parts of a lemma, but the bigger picture that scientists often work. There is a lot of technical work, but on the big level, you sometimes just go by your intuition and some metaphors with the world that just connect things for you on the much more abstract levels, right?

I draw you snails here on the slides. There are no snails in our data centers, hopefully, but that's how we think. And we still don't fully know, but hope to get the models to think in this way as well. Thank you very much.

And now we'll go to the Q&A. I think Natalie will come back.

Thank you so much, Łukasz. I wish I would have been at university and taken your class because you really.

broke that down in a way that I think even the non-technical people in our audience can understand. It's really awesome to hear from you. We are going to jump into questions now. We're not going to be able to get to all of them, but we have some awesome ones that we will have time for.

So, Andrew Holtz asks, Łukasz, what inspires your research and what do you do when you're stuck and need a fresh perspective?

So, I was a fan of AI since a very early age, I think. I always wanted, I think first I wanted to build a brain with computers, you know, and then I got more into computer science. And so, you know, the field of AI, if anything, seems to have been getting faster and better all the time. So, yeah, when I was starting, it was hard.

harder to find, like, you know, PhD in neural networks was not very popular back then, but then they came out. So I did my PhD in logic and these kinds of approaches, which I love for other reasons, but it seems for the machine learning is actually works in this way. So I never had problems with feelings inspired and everything that's happening in AI these days gets me inspired again every day.

Yeah, when we get stuck, I think it's often good to take a step back, go for a walk. In this high pressure of AI now that it's going so fast, it's sometimes hard to just take a break and I think that's a very good, you know, if you try to follow.

everything that's happening all the time there may be very little time to like let yourself take a step back but it's very important at least for human creativity we don't know how it works for machines yet but thank you Łukasz yeah

and I remember my first week at OpenAI I was on Wojciech's team and he's a fan of the fast walk through the city let's take a walk let's walk and talk about this and it's a fast walk it's a very fast walk yes okay

distinguished professor from northeastern university Edwardo Sontag asks when will chatgpt have an interface to lean capital l e a n oh to lean to the theorem prover oh well so chatgpt can run python and as such it should be able to install lean and run it

I'm not sure how well the installation works right now. And it's probably not trained with any or very much Lean data. So it probably would not be very good at that. But there is a lot of fans of Lean in OpenAI. So I would certainly expect the coming versions to get quite good in this regard. But I'm not that deeply into the release processes. But as things move in AI, I would certainly hope that in a few months, it will certainly.

Oh, wow. OK, awesome. John Olafinwa, a research scientist at Microsoft, asks, what's your thought on learning parallel reasoning end-to-end rather than sequential-only reasoning? Yes, so as you saw, there was just a few points of the reasoners that are certainly missing that I feel are research challenges.

I think that learning from arbitrary data is the key, but making them more parallel is certainly the other one. I am a big believer in learning end to end. So if it's parallelized, it needs to be part of the learning process in some way. What way exactly will work best is of course, I don't think we know yet, but there's some ways are quite obvious and intuitive and you need to try them well. It takes a while to try things in deep learning. It's like you shouldn't give up too easily, but then maybe there's still ways we haven't fully thought through that may turn out to work really well.

Awesome, thank you, Łukasz. Our next question is from a student at Cornell University, Yijia Dai. They ask, do you think the current recipe pre-train on transformer with a large data set parallel to evolution and reinforcement learning on specific.

tasks can lead to where we want. I think this depends how general you consider this recipe. I think in the big generality, maybe yes. But there is a lot of detailed questions. When does the pre-training end? And I mean, as I repeated a number of times, I feel like doing this RL part but on arbitrary data is a key challenge that we need to solve.

If you do it on arbitrary data, you need to ask the question, so which data do you do that on, and which data do you pre-train, and how do you shift? It becomes more of a slider then. And where to really put it is very hard to answer until we have some real experiments.

And also, it would be great if we could start the RL from priors that are a little weaker than the ones we have now, and it will kind of catch up.

up on its own. Currently, yeah, we still do a lot of pre-training, and then a lot of RL. I do think it will become more of a slider that will slide and investigate in the future. But the key is to, you know, currently RL can only train on verifiable data. So you cannot slide it that much into pre-training, because a lot of pre-training is, you know, Chad GPD went to the bar, told this joke. There's not much verifiable about it. Was it really funny? Was it not? Should it have word in red color? We don't know. And it's still quite amazing how pre-training teaches a model that can talk to you from all of that data.

If you ever actually read that data, like take the training data and take a random, like not read the internet and what you think is on the internet, because you probably read the better parts of the internet. Ever do this, like, investigation of a large language model's training data.

At least for me, it made me always very impressed of how they come out so much better than what the data seems to be to me. Awesome, thank you, Łukasz.

We have Da Zhang, a post-doctoral fellow at the University of California, San Francisco asks, this is a big one, I don't know if you have time to actually explain this, but we'll ask anyways. Could you explain the relationship between semantic ontology and transformer models in the context of reasoning?

So I am, yeah, I'm not sure if we can manage that looks like a good question. I do think that transformers in pre-training build some form of semantic ontology in themselves and that reasoning with RL kind of learns to use it better, but that's just a very general answer. I think we may leave it at that for today.

Okay, that makes.

sense.

Svetlana Romanova, Python machine learning engineer and aspiring AI researcher, asks, well, first, she says, thank you, Dr. Kaiser, for this talk. She's exploring geometry and flow stability as foundations of adaptive systems. Could learnability be seen as a form of field coherence where structure itself enables efficient information transfer?

So, yes, so before a little context, so I had this computationally powerful models and the learnability. And being computationally powerful is very grounded in math. Can be very precise what this means. Learnability, I had a table, but it's far less precise. And we don't have like a truly great theory yet, I feel, of what it means to be really learnable and well-learnable. And I mean, there is a number of notions, but they usually.

as I said, usually you can just prove nothing is easily learnable, but we do learn models. So the theory of deep learning is only being developed and like inspirations from physics and field theory have led to, like, I think there's some of the best things we have right now, but I don't feel like it's fully formed. So I think it's a very promising line of research to research, like why, you know, why are these models more learnable? Like we've kind of made them by trying a lot of tricks and they emerged well, and that's how we know they're good because we tried, but you know, it would be great, like not to build planes just by trying to flap things and send things in the air, but to have a theory like aerodynamics.

And yeah, I do believe that approaches from physics will contribute greatly to it.

One last one, and Łukasz, maybe you can tell that this community is a highly interdisciplinary.

And I'm not sure if you know this, Crash, but when we first launched this community two and a half years ago, it was in the research org, on the human data team. And the very first 2,000 members of this community, they contributed to post-training. So some of the members in this community actually, you know, contributed to what ended up being our reasoning model. Just wanted, just wanted to share that with you to give some insights into who's asked, actually asking these questions.

Okay, last but not least, we have Mahesh Lambe, CEO and founder of Unified Dynamics. They ask, what are the biggest architectural slash algorithmic bottlenecks you currently are seeing in scaling reasoning or meta-reasoning models? And how do you foresee overcoming them, both theory and deployment, in the next two to five years? So, as I said, I think the biggest problem we face is doing reasoning on RBC.

data, also hopefully in a more parallel way and maybe solving the context problem. So these are the, I think a lot of people identify these as the biggest problems.

Now I'm not sure the solution to them lies in like the architecture, like understood as just the architecture of the transformer or the RNN. It feels like in the reasoning world, it matters how you learn these reasoning chains, which detaches a little bit from the architecture of the model and is more in the architecture of the whole system.

And how is your learning? How does the loss, what is your loss? What is your reinforcement? How does it all combine? How do you push credit around like the reinforcement learning questions? I think, you know, there is

There's a lot of good ideas how you can learn things. I don't think we know which one is the best yet. Maybe there's a number of them that are very good.

I feel like this is an exciting work and because you can do the reasoning models from, you start from a prior, so you don't need to retrain that. It's also much more accessible, I feel, to academic labs now that they're open source reasoning models from OpenAI and others.

And there are also frameworks that allow at least to replicate some of this verifiable learning. So I have a great feeling for academia. It was very hard to do large-scale pre-training in academic settings.

I feel like in this new world of reasoning, which is still very early, it's only a year ago that we published this first post, I think there is still a lot to be discovered. That we may see from many places, not just OpenAI.

Well, Łukasz, thank you so much for joining us and I'm just honestly so grateful and honored that you accepted our invitation. And because we are still learning, I hope to have you back soon.

Thank you very much for the invitation. I'm honored to be here. Thank you, everyone, for the questions. It was so wonderful. And I will be knocking back on your Slack door, Łukasz, in several months.

Thank you very much, Natalie. Come back and give us an update. I hope you have a wonderful rest of your week, Łukasz, and we will see you again soon in the forum.

Wow. Honestly, that was amazing, even for me. Just to share a little story with you guys. When I first got into AI, I was the head of community at Scale AI. And one of our product marketing managers said, you know, as I was surveying the landscape for content, one of our product marketing managers said,

said, you need to get one of the authors for attention is all you need to present. And here we are, four years later, I finally got to host one of those authors. I'm now working at OpenAI and hosted WooCrash in the forum, and that was a really huge honor.

So for October, our last two events in the pipeline are in person, and you are all welcome to request an invitation. Our OpenAI forum event, October 20th in San Francisco for higher education folks, faculty, leadership, post-doctoral fellows, still has a few seats remaining. We are almost at capacity, but if you're interested in joining us for that, you should definitely request an invite. I'm reviewing those every single day.

October 27th, we'll be in Washington, D.C. at the Library of Congress with our friends and collaborators at Pomona, bringing the Thomas Jefferson Library's collections.

That will be really exciting, and then we're also hosting a panel about libraries, archives, museums, and the collective AI4LAM, including the National Art Gallery in D.C. and the Smithsonian. We're going to learn more about how the AI4LAMs is integrating AI into their work streams to make their collections more engaging and more discoverable for the public and also for scholars.

We have plenty of seats reserved for OpenAI Forum members for that one, so please request an invite, and I hope to see you there.

And until then, thank you all for joining us for our lunchtime talk with Łukasz. Have a great day, everyone.

+ Read More
Comments (0)
Popular
avatar


Watch More

AI Art From the Uncanny Valley to Prompting: Gains and Losses
Posted Oct 18, 2023 | Views 39.2K
# Innovation
# Cultural Production
# Higher Education
# AI Research
Event Replay: Democratizing AI: Insights from the Global Dialogues Challenge
Posted Aug 01, 2025 | Views 399
# Democratic Inputs to AI
# Ethical AI
# Public Inputs AI
Terms of Service