Sign in or Join the community to continue

Event Replay: Terence Tao and Mark Chen on AI and Mathematical Discovery

Posted Mar 04, 2026 | Views 980

# UCLA IPAM

# AI Mathematics

# AI Research

# OpenAI Leadership

Share

Speakers

Terence Tao

Professor of Mathematics @ UCLA

Terence Tao is a professor of Mathematics at UCLA; his areas of research include harmonic analysis, PDE, combinatorics, and number theory. He has received a number of awards, including the Fields Medal in 2006. Since 2021, Tao also serves on the President's Council of Advisors on Science and Technology.

+ Read More

James Donovan

Head of Cognitive Outcomes Research @ OpenAI

Mark Chen

Chief Research Officer @ OpenAI

Mark Chen is the Chief Research Officer at OpenAI, where he oversees advanced AI initiatives, driving innovation in language models, reinforcement learning, multimodal models, and AI alignment. Since joining in 2018, he has played a key role in shaping the organization's most ambitious projects. Mark is dedicated to ensuring AI developments benefit society while maintaining a focus on responsible research.

+ Read More

SUMMARY

This conversation took place at IPAM, UCLA on March 4, 2026, between Terence Tao and Mark Chen, Chief Research Officer at OpenAI, facilitated by James Donovan at OpenAI. The discussion focused on how AI is changing mathematical research, from helping with literature search, code generation, and proof attempts to enabling new collaborative workflows. Tao and Chen explored how AI can help solve long-tail mathematical problems, especially when verification is strong, while noting that frontier research still depends on human judgment, creativity, and careful goal-setting. They also discussed the implications for education, attribution, scientific collaboration, and the risks of over-relying on AI without understanding or validating its outputs. Looking ahead, they predicted that AI could help create new challenge-based research ecosystems and accelerate progress across mathematics, physics, biology, and other scientific fields.

+ Read More

TRANSCRIPT

[00:00:00] James: Before we get going, just a massive thank you obviously to the Institute for hosting us today. Beautiful space. And also for all of you for turning up. I know you want to hear me talk so I won't talk for too much longer. So really just to say also a massive thank you to you both for coming. It's rare that you get two such great minds in the same place. So we really appreciate the time going into this. It's our third time. Yeah, a little pattern is starting to build up. And maybe actually that's a good place to start that conversation. You guys had a conversation almost a year ago to the day. And at the time, Terry, I think your prognosis for where GPT was for mathematics was something like a very ineffective grad student. Which remains with me because I've had that feedback myself as a human being. I don't want to go into it.

[00:01:00] James: Why don't we start with how you think things have changed since then and then, Mark, your side of the story?

[00:01:08] Terry: OK. Yeah. So a lot has happened in the last year. OK. That's just... Not just in...

[00:01:14] Terry: Yeah. So, yeah, these tools have definitely become a lot more powerful. I think there are now capabilities that basically are now normalized and, like, we just use them all the time. So deep research tools. Literature search has become really, really good. It has surpassed traditional searches. Code generation, of course, is the big thing. So as a pure mathematician, I'm not as heavy a user of code, but it has changed the way I approach a math problem. I will plot something. If there's an inquiry I think is true, I will ask an A.I. to try to prove it. So I already use it. If there's a lemma that I think I know how to prove, but I can't be bothered doing the pen and paper calculations, I will just outsource it. I've not yet found it to be useful at the deepest level of when I'm trying to solve a problem, with a pen and paper, or with a colleague. I can't sort of interact with it on a conversational level quite as a level I need yet, but maybe in the future.

[00:02:28] Terry: But I think also socially, I think we're beginning to, the mathematician community as a whole is beginning to understand that this resource is here to stay, and we have to actually start adapting how we do our research. So certain things that were very tedious, and maybe we would force our graduate students to do, we can offload to AI. And this opens up lots of new ways to do mathematics, lots of research projects, especially at scale that we just could not dream of doing. So while I think we can use AI to assist our current workflows, it's a little bit awkward still to do that. But I think there will be much more milestones in creating new workflows, which are optimized for AI. It's like when we invented the automobile. We started changing the way we build cities. And of course, you could say that maybe not all the changes were good. But yeah, we're sort of in this intermediate stage where somehow roads are still built for people on horses, and we now have automobiles. Would it be fair to say that we've got to the point where occasionally helpful, collaborating. But maybe more interesting that all the bigger open spaces how you change the way you do math with these tools coming.

[00:03:43] Mark: Mark, would that be true to what you're seeing and what you're building for?

[00:03:46] Mark: Yeah, honestly, I don't blame Terry for saying it's an ineffective grad student a year ago. I think that's largely the state that we were in back then. And I really do think of the backdrop of AI progress as hill climbing this what we call of the models doing autonomous work for longer and longer periods of time. And I think last year, we were in the category of minutes. And you saw that right, it would just the model would hallucinate, it would kind of fall over when you gave it significant chunks of work. But I do think the last year has been a transition for a lot of us in that we've seen the mistakes go down. And therefore, you can trust the model to do longer periods of work in general. And that's, you know, really kind of allowed us to do away with a lot of the scaffolding that we might have needed to use before, and really start to attack, you know, bigger problems and truly orchestrate with the model.

[00:04:43] Mark: And I just think of a year ago, we were in the world where we were kind of roughly achieving a bronze medal at the IMO. I think this summer, you know, across all kind of high school mathematics and programming competitions, we are achieving gold medal performance and...

[00:04:58] James: Gold medal performance. And I think we've just kind of run out of these human-written benchmarks. And that's why you do see people evolving to this fear of doing mathematical research. Fundamentally, that's always been the goal. We don't find any pride at OpenAI just kind of solving, you know, IMO problems or anything like that. The real ambition is to push the frontier of science. And finally, the task horizon has caught up to a point where we are actually able to go do that work. And again, it's not there yet. I think the trend and trajectory is strong. But yeah, I do think, you know, it's true that a lot of people are finding utility in it today. I mean, I'd like to come to maybe first proof in that transition as we go into more frontier mathematics. But maybe to stay with the capabilities right now, I think often the Erdos problems are seen as a way of getting a litmus test for where the models are. And that maybe is a representative setting that some of those problems are maybe not as complex as other parts of those problems and were designed that way. You might say that the success of the models has been in doing lots of the easier ones quickly, versus necessarily moving towards the kind of acorn level problems. Is that a fair depiction, in your mind, Terry, of where they are today?

[00:06:10] Terry: Yeah. So I've been heavily involved in tracking the progress on the Erdos problems in particular. So, I mean, yeah, and it is basically largely true, what Reggie said. These problems range widely in difficulty. There are some that we desperately want to have solved, and they've been worked at for decades. You know, I have papers that are making tiny progress on some of these problems. And to date, AI has not really helped with the ones that have been that we've already poured a lot of attention to. But there was this very long tail of problems. It was Erdos posed a thousand problems. They weren't all winners. But he understood that the important thing was to stimulate discussion and interest. He kind of knew that the problems that were going to be important would have to take on a life of their own. But there was this long tail, unexplored problems, where there's maybe almost no follow-up literature. And that's where the AI tools have really made a lot of spectacular progress. Maybe 20, 30 of these problems have been solved with fairly minimal human supervision by the AI tools. We were able to verify them too, often with some other AI tools and form verification. We kind of worked out a kind of a workflow for doing this without being overwhelmed by AI slot incorrect solutions, or whatever. So yeah, it is a new capability that we hadn't had before, because we can now attack attention bottleneck problems.

[00:08:07] Terry: And so I think what this suggests to me is that we need to start creating more and more broad challenge sets of problems for AI tools also, but the general public. So actually, in the same period, many of these problems were also solved by amateur mathematicians, sometimes with AI tools, sometimes without. The same kind of workflows that enable AI to be successful also actually enable amateur mathematicians to be successful as well. So I foresee a change in our culture, where instead of only working on a small number of really hard problems, and not sharing a longer list of other things we'd care about, we'll all start, as mathematicians, releasing problems of things that we want to get answers to. A hundred problems, and maybe this AI can solve 10% of them, and maybe this other high school student can solve another 5%, but we can get a much more community-driven way of doing mathematics. I think this is what the Erdos problems are, and they know the harbinger of.

[00:09:03] James: Yeah, and it's maybe interesting to contextualize that in other domains of science, at least in my own world of biology, in which the number of people collaborating on a given paper is just exponentially rising over time, and that seems to be the trajectory that science is much more of a team sport. Maths is, maybe, and to some degree, theoretical physics, the outlier in that domain. When you're thinking about this, Mark, is it always just a question of, how smart can we make the models and the ever more difficult questions they can answer, or is it also a question of, how do you empower humans to work collaboratively on these problems?

[00:09:32] Mark: Yeah, I mean, right now, we really do see heavy engagement with the community. That is a necessary part of driving progress in all these scientific fields. Kevin here, he runs our OpenAI for Science program, and part of this is that, kind of like you said, these experiments first prove for their issues problems. It really is an engagement with the community on figuring out what problems are actually important to tackle.

[00:09:56] James: To tackle, we've done this kind of exercise in physics as well. We've brought in an expert physicist to kind of lay out a program of here are the really important things that feel like they're amenable to AI. And that also helps us shape the AI in turn. It allows us to kind of find the deficiencies, where we can look at where our models fall over and really shore those things up. What we hope to build is this platform where scientists around the world can just accelerate themselves. We want to empower that community mathematician. We see people like that today, empowered. You have these 20 year old, 21 year old kids using the models to solve some of these urgent problems. It may not be these sophisticated and very significant leaps, but they're able to do a lot of self-directed work.

I kind of had this thought when you were asking the question of, I know Terry, you've organized a lot of big community initiatives in math before. I don't know how you think AI is changing that world or does it enter that world in a significant way?

[00:10:57] Terry: I think it combines very well, actually. So I think what AI will enable is finally a way to use division of labor, which is something that, like all industrial revolutions, like every industry has managed to become more efficient through division of labor, except mathematics. To do mathematics traditionally, there's several different tasks: there's problem generation, there's strategy generation, and then strategy selection among all the strategies generated, and then execution of a strategy, verification of a strategy, and communication of the results. We basically need, we've trained our mathematicians to be somewhat good at each of these tasks. So we specialize in the field, but we have to have some idea of where problems come from, what are good problems, what are good strategies. We have some technical skill, we have to verify, and we have to explain. Some mathematicians are better at some of these than others. We have been able to benefit from collaboration because of this, but we can't really specialize the same way that in the sciences you can have technical staff, and you can have people who are project managers, and things like that.

But now with AI and other modern collaboration tools, and formal verification, it has become possible to run math projects where individual participants specialize in just one of these areas. Maybe there's some gaps among your collaboration; no one knows how to do the technical things, but AI can plug in some of the gaps in a collaboration. You still need the humans because the AI performance is very jagged. Maybe some of these inputs can now be automated. But if you automate that too much, for example, if you can automate strategy generation, but you can't automate verification, then you just get hundreds and hundreds of possible strategies, but AI can't deal with that. If the verification also keeps pace, then suddenly you have a new style of doing mathematics that is extremely effective.

[00:13:11] James: Yeah, just one quick comment on that too. I actually absolutely agree that AI capabilities are super jagged today, and so you see this really fruitful collaboration with humans. It's also interesting to explore the flip side of that, which is that some of these AI systems are more human-like than you imagine. You have to pump a lot of RL in the right way to not have the problems, have the models give up in the same way a human would. If you give a too hard problem, oftentimes the model can just run a couple of tester prompts in its own train of thought and be like, ah, this problem's too hard, I don't think I can actually do it. Let me pretend to the user I tried really hard. It's the same with the ODISH problems; you get an AI to try this ODISH problem, and the first thing it will do is go to the ODISH problem website, look it up, and say, it's an open problem, it's too hard, I'm not gonna try.

Say, do not use the internet, try this ODISH problem yourself. It's actually pretty easy, I swear. It's good to know that frontier research is actually just about coaxing the models into behaving the way you want them to. That vision right now is probably quite a compelling vision for this room and beyond, where we're sort of saying the technology fundamentally empowers more people to collaborate on these problems. But is this just a stepping stone to a world, Terry, in which you're only collaborating with many AI agents, and slowly but surely they come to dominate the sphere?

[00:14:35] Terry: I think yes and no. I think the type of math that we do today might slowly kind of move in that direction. But there could be very new types of doing math that we can't even envisage right now, which I think, so math is infinite and difficult.

[00:14:54] James: And the, the problems in math, the difficulty levels are unbounded. There are even problems in math that are unsolvable. We know they're unsolvable. Well, there's an asterisk, but I don't want to talk about it. Okay. So, there are certain things that even with the most powerful AI, even right now, there's certain cryptographic challenges that AI cannot mine all the Bitcoin right now, or whatever. So, I think there will always be a frontier, and I'm appreciative just because how complementary human and AI, at least current generation LLMs are, with human skills, that the best combination is always going to be a complex combination of humans. But the nature of the combination may change over time. So, let's assume even just philosophically there is this frontier beyond which, at least the current paradigm of AI wouldn't be able to cross and some central like human-AI collaboration is needed.

[00:15:56] Terry: Getting to that frontier, in your mind, Mark, is that a question of much smarter RL training or is it actually just a question of raw computation? If I could give you an infinite amount of compute today, would you be able to accelerate your way to that frontier?

[00:16:11] James: Yeah, I mean, when I think about the OpenAI research program overall, it is really fundamentally about how do we improve the algorithm such that they scale to the level of compute we have next year and the year after. So it's grounded in the reality of what compute we actually have. And I think all of the algorithms, we know they are simple and they scale, but they take a lot of engineering and fine tuning to make sure that they truly scale to the next order of magnitude and the order of magnitude beyond. One really great thing is this is a very multidimensional problem today. There are many axes by which we can scale model intelligence. We can scale up the model, build these bigger brains with just more core knowledge. And this kind of captures the intuition that just like maybe the more math you know broadly, like just internalize deeply, like it's easier to make these connections and these jumps. There's also a reasoning access which we scale. And this is the ability to take all of that base knowledge and chain it together to create new insights. And we have a couple people in this room kind of working on this thing we talked about in the GPT-5 livestream, which is kind of connecting this a little bit and having the models just generate new knowledge for themselves and really kind of amplify its knowledge in certain domains. So I think there are a lot of different axes that are gonna play into bringing the models to the next frontier. Yeah, but overall, like all of these things are grounded in this meter plot. Like we are aggressively hill climbing towards more and more autonomous, longer horizon tasks. And we see that trend continuing. The term hill climbing, if I've learned anything working with the research team is that we always must find and define a hill to climb. And perhaps that's where these two worlds come together as defining the right hill.

[00:18:00] Terry: So maybe we focus on that for next segment. First proof would seem like an obvious example of where we've tried to co-define a hill to climb. In your mind, Terry, is that representative of what you're thinking could be the new emergent mass to come, or is that the final form of this more classical mass that AI has been working on?

[00:18:14] Mark: There'll be a spectrum. Yeah, so first proof is a very, very interesting experiment. And the proofs that the various people with AI tools generated were quite good. What we saw actually, there was a definite verification bottleneck. So we had a lot of proofs generated, some were terrible, some were quite good. Some were similar to things in the literature. Some were similar to the proofs that the authors themselves had. There was a couple which were actually different from the official proofs, and so that was interesting. But to evaluate carefully exactly how novel and how interesting each proof was, we actually don't have a way of doing that effectively. So I think that the first proof team are gonna create a more structured competition later where they will have some mechanism for verification. So in order to take full advantage of the new capabilities that AI have, we do need to create challenges that are easily verifiable.

[00:19:31] Terry: So somehow the level of automation and AI power that you can profitably use before it becomes slop is roughly proportionate to how stringent your verification is. So, yeah, so I think initially you're gonna see a lot of progress in areas, either which are sort of elementary enough that they can be relatively easily formalized, so combinatorics. I think you're gonna see, so the Erdos problems definitely fall into this category. There's some numerical type.

[00:19:52] James: There's some numerical type challenges where you want to find a configuration, mathematical object that obeys certain properties. Once you have the object, verifying them is very easy. We saw some examples here in physics. There are some similar type problems. I think there we will see a lot of progress. But there are other parts of mathematics where the objective is not to find an object that obeys a certain property but to find a good overarching theory to explain something or a good definition. Those we have a lot harder time to verify. If you want to propose a new conjecture or a new strategy to solve a problem, maybe AI can generate a hundred of these possible strategies, but only a human expert can verify or can give an informed opinion. That will be a bottleneck. Even if AI drives the cost of creating solutions down to zero, there are still other huge bottlenecks which were not front and centre in our minds. I think we also need to become much better at stating goals precisely. So AI is almost too good at fulfilling a goal to the letter. You ask, I want to solve this problem, I want to prove all this theorem. Maybe AI of the future just runs for an hour, accuser proof. But actually, what you wanted was for people to work hard, to fail, to find examples, to connect to our literature and to communicate all the partial results. That was actually the value of solving a particular problem. And there is a danger that if you specify your goal to an AI too narrowly, you miss out on most of the benefit. So we will have to be more careful about goal specification.

[00:21:42] Terry: Just two really quick things to add on. I do kind of think of this offline version of first proof. We were actually discussing this a little bit, too, where you can imagine that you just train a model with knowledge of some specific, very, very detailed, like, oh, this day, this time. You can imagine what a first proof would be at that point in time. And now you have the benefit of hindsight; you know kind of what the techniques you're after might be, what creativity in the model might look like. I think those are very interesting thought experiments to run. There's a thought experiment of what you would choose as a cutoff to get maximal signal on that experiment. But yeah, I also do kind of think about, you know, the process of mathematics isn't just answering or proving a theorem. It's all this partial progress that you assimilate somewhere. We have AI systems at OpenAI that are kind of just like central repositories for information. You could imagine that kind of serving a function in mathematics as well. You have just this kind of global library in some sense; I know, I think Daniel Lit published something online a while ago. It just is this agent that mathematicians can interface with, and it kind of fills out this convex hull of mathematical results. You can always use that as a source of truth for what people are exploring, and it'll kind of connect a lot of the dots for you. Yeah, just be this place which stores what we know.

[00:23:13] James: Well, it may sometimes be useful to turn that off. I mean, I've worked sometimes on a problem where I know too many techniques. There's a power putting I know will solve the problem, but it requires a lot of technical skill to use. I do it, solve my problem, and then publish or something. Then someone points out, actually, you could use this much simpler tool; you'd have a much simpler proof. I do worry a little bit that sometimes having access to every single technique known in the literature is not necessarily the best way forward. But having a diverse array, like mobile AI tools, I think there will still be people who take pleasure and pride in doing things old school and finding more human ways to solve problems, too.

[00:24:00] Terry: Well, I wonder if that pattern even plays to the case study you gave, Mark, where it's true if we could go back in time, just before a particular paradigm shift in whatever domain of science, and then see whether or not the model would predict it, that could be one verification tool. But I guess the kind of Cuny envision of paradigm shifts could also mean that, in fact, there is a future paradigm shift to come that would invalidate the prior paradigm shifts. You don't actually want the model to guess the previous one because it might take it off a pathway that doesn't get you to the next one. It throws into relief, I think, this question of verification and validation; it's both a philosophical and a practical question. Of all the domains, you could argue that math has the actual capacity to do automated...

[00:24:50] James: Do automated verification in a way that very few other domains do. Not perfectly and not without its own drawbacks. Is your instinct that that structure, where there'll need to be a sort of separate validation tool, will need to come into existence for all the other domains of knowledge that we want to work on, that it will mirror what's happened in maths, or it will need some other type of paradigm?

I definitely believe that there's an upper bound on how much AI you can inject into a workflow before it becomes a net loss. It is causing more errors and problems than it is solving. And one of the biggest upper bounds is the ability to verify. So yeah, so in math, I think we have the best shot at getting really high levels of automation, being able to effectively use highly automated in a way that you couldn't do in less trustworthy, in less verifiable domains, because we have a high verification bar, at least for the specific task of proving things, which is not the only thing that we care about, but proving things that we've already specified we want to prove.

Yeah, but even formal verification does have weaknesses. The language itself can be exploited by malicious agents. So an AI may sort of, you know, attempt to be helpful and try to prove as many things as possible, just sort of secretly add some axioms to the formal system and things. You can try to shut them down, but if the AI is too powerful, actually, yeah, I mean, at some point you actually have to limit how capable the AI is, or have periodically humans involved in the process.

So, you know, there are other, in the other sciences, you can do some of this. For example, numerical simulation can be used as a verifier in some cases, but again, you can't rely on it. Like if you would say, well, you want to model the weather and you have a supercomputer that predicts the weather and you have to train an AI to mimic the numerical simulation, it is possible that at some point they would just exploit some feature of the numerical simulation that is not part of the ground truth. So it will work up to a point and then it will stop. We do need to get a lot better at knowing the limits of our verifiers.

A lot of verification systems that we have, they work just fine if they're used non-adversarily. But, you know, if you're training AI specifically to maximize output based on using this verifier, it will find the exploits. Yes, AI is so good at that. It’s a ruthless cheater. So, yeah, we do have to be aware of that. Just because a human verifier surpasses all human tests, it may not be suitable for AI use.

That makes all the sense, and intuitively to AI cheating, the easiest way to make something measurable is to design it to be measurable from day one, from step one.

[00:27:58] Terry: Mark, do you think like that when you're trying to make the models ever smarter? Do you think in terms of first principles, what would have to be true to be measured as being smarter, or do you rely purely on generalization to try and get ever smarter models?

Yeah, so I really think when it comes down to it, why do we care about attacking math and physics at a place like Yoga.ai? And it really comes down to, we are out of good evals, good human-written evals, and doing science is the eval now, and math is particularly exciting because you can attack some kind of theorem, you can verify it in many cases, and you feel confident that you are legitimately pushing the frontier forward. I know there are initiatives in physics, too.

In physics, there's a little bit more hand-waving around, "Oh, you know, this constant's too small," and so, you know, you can still build pretty formal systems, right? And I think, and so, it allows us to kind of really push the frontiers in both math and physics. But fundamentally, one of the reasons we cared so much about reasoning in informal language is we care about generalization, right? We want to be able to do deep reasoning in fields like biology, too, and create breakthroughs there, even if it's kind of fuzzy what a breakthrough means, right?

I think in math, it's much more clear. It's like you solve a Navier-Stokes. Yeah, that’s a big breakthrough. If you kind of, the model says, "Hey, here's your next breakthrough in machine learning," I mean, I don't know how to verify if that's true. And I think it's just so empirical, and, you know, kind of time tells with a lot of these things. So I think what we care about is this fundamental generalizable reasoning layer. Natural language feels like a good way to express this in a way that kind of falls less into this trap of like, you have a tool bag of techniques and you just center on the known techniques.

I feel like in natural language, we are able to kind of.

[00:29:48] James: In natural language, we are able to kind of express, you know, these new techniques at least. And I think, yeah, we've been able to do that so far. So, yeah, we really deeply care about generalization. And I do think these formal fields give us a really rigorous way to test that we're pushing the frontier. Beyond the structural nature of maths, and therefore the ways that you can formally verify, is there some other practical benefit to pursuing ever greater capabilities in that space, or is it, in your mind, really more just the equivalent of an e-mail?

[00:30:31] Terry: Well, I think one positive feature for using maths as a test bed for other use cases is that, you know, we have this quote earlier today of Vladimir Arnault that mathematics is the place where experiments are cheap. It's also the place where failure is cheap. So, you know, it's related, you know, if you're an engineer and you're asked to build a bridge and the bridge collapses, that's an expensive mistake. You know, if you're a surgeon and you're asked to, and you cut the wrong thing, that's an expensive mistake. But in math, if you try to prove a theorem and your proof strategy doesn't work, that's not an expensive mistake. So I think we have this freedom to fail, which is more so than in other disciplines. And because of this, we have a culture of learning from our mistakes a lot more than in other disciplines. And so it's a relatively safer place to experiment with AI than, you know, let's say, bridge building or heart surgery.

[00:31:25] James: Okay. Yeah, I love that you say that. It's exactly the way that we think about things at OpenAI as well. Because I think fundamentally, what we care about developing AI for, the really, really inner core goal is to use it to develop stronger AI, right? We want to design better experiments to build stronger models, more intelligent models, which by extension will do even better math, but you know, it'll build an even stronger model and you get that flywheel. And that is an expensive thing. If you screw the system up in any way, computes at stake, right? Like you run the wrong experiment, you burn a lot of money, a lot of compute. And so I do think about math, physics as safe domains to push the frontier.

[00:32:10] Terry: Yeah, it makes sense. I mean, I even wonder if you can push that. And Kevin, your work touched on this—assuming a world in which the models may be discovering things that really are beyond the frontier of human knowledge or even the ability for a human to really conceptually follow. Presumably, there needs to be some way to re-represent those findings into a logical chain that at least we can follow the steps if not the actual constituent parts. And that math, more than any other domain, seems to have invented a workflow that accommodates that. Certainly compared to say biology or chemistry, which doesn't have as much by way of kind of formal axioms that you can work against. So potentially it's a necessary precursor to any truly frontier science advancement to have this capability. Regardless though, it does to a point Terry suggest that the way we think about doing maths in the future will change that we're emphasizing maybe creativity, collaboration, different skills, perhaps, than what have been happening the last 100 years. Does that filter them through into how you teach maths?

[00:33:12] James: Yeah, it's an open problem still how to. Yeah, so in the very short term, some things have had to change. So yeah, like weekly homework assignments have been the first casualty. Yeah, for instance, but I think we can push our students to do more ambitious things now. So I switched much more to a project-based type of assessment. In smaller classes, you can do some oral assessment. The skills we need to teach will be different. Yeah, so validation, the ability to independently verify AI-generated output will become essential. Yeah, softer skills, how to work with people. Mathematicians have been not uniformly good at that in the past. But it will have to get better.

[00:34:05] Terry: Yeah, it's, yeah. I mean, the pace of change is such that education systems are not catching up as rapidly as, but I think by necessity we’ll be forced to. I mean, with COVID, for example, we did do some emergency changes to our curriculum, and it kind of worked; it was not a great experience. So hopefully this time we can do a bit more planning. But we will, it will be on that level of change, I think.

[00:34:37] James: Yeah, I think the analog of that is, you know, our interviews became busted very quickly too. I think, you know, people, if they have, you know, time to do some kind of take-home or...

[00:34:46] James: You know, time to do some kind of take home or some kind of, you know, written type of interview. It's very hard. I do think kind of moving to a world where you can have a model also just kind of interact with you and teach you things. And the model itself can judge, you know, how much are you learning? Are you uptaking? You know, that actually kind of feels like a directionally good update. I've thought about revamping interviews in the form of, you know, you convince the model that you have the skills necessary to look at it from the eye. So, you know, I mean, and of course, you know, you have hacking and jailbreaking and stuff like that. But yeah, I really do think, you know, fundamentally, yeah, teaching has to change in some way. I am kind of curious to take on a couple of things. First, you know, I've heard from some other professors that, you know, it's, yeah, you really do see this divergence of like, you know, it's like the worst, the best homework and like worst live exam scores in history. I don't know if that's a trend that you see as well. And I think the second thing is like, do you actually see this divergence of students who are like very motivated to learn that get really good using the tools? And is there this cohort that really feels accelerative?

[00:36:00] Terry: Yeah, so definitely I've noticed homework scores going up and in-person scores going down, not so much, it's not like a collapse level. I mean, I don't have hard data. I do get a sense that the weakest students are using AI to sort of get to a median level. And the brightest students generally tend to avoid using AI because they're worried about—they can notice they're using it too much atrophes of skills. The weakest students, I think, they feel like they have less to lose. But, so it sometimes is an equalizing feature in that respect. And I mean, once you are having some expertise, these schools are great. So maybe the equilibrium is to actually discourage their use or use them in specific ways. I can certainly see homework exams in the future where the solution is not the point because anybody can enter into it. But for example, what prompt did you use to get to the solution? And that might be the more interesting assessment tool. So yeah, we have to figure it out.

[00:37:30] James: Yeah, and actually, it's important, just like AIs will optimize whatever reward function. The reward function we give to our students will actually make a lot of difference. Yeah, we have to think this carefully. Yeah, it's, in some ways the extreme cases are easy to understand. The total cognitive offloading and therefore no learning is occurring, et cetera. Maybe the more nuanced, though, would be something where it is a productive use of AI. And I think you used this metaphor in a recent interview, which is to say, you've been helicoptered to the destination as opposed to taking the scenic route there. There's something about the change in workflow corresponds to something at the cognitive level for humans. And that we don't know yet what it may be that you lose if you do that, but we need to be alive and alert to it. Do you have a thesis about what we might lose if we start doing that?

[00:38:13] Terry: I think we will see empirically pretty soon. So yeah, I think we just need much more awareness of all the different facets of research or any other task. So somehow, I said before, AI allows for a decoupling of many things, which can be good for division of labor. It is more efficient. But it does mean that goals that were previously OK to set very fuzzy goals, because any human attempt to reach these goals would also hit all the nearby goals as well. So as I said, with an AI, if you want to go see a nice waterfall or something on a mountain, you take a hike and sometimes you see some interesting wildlife. You get a glimpse of an even nicer location you might want to go to someday. And maybe you meet some other hikers and you have a conversation. But there's also serendipity, which just naturally happens.

[00:39:30] James: And so in the past, we would just say it's a good idea to go visit this waterfall. But we didn't unpack that carefully enough to say why we do that and what are the actual values, what's the actual benefits. But now we have this alternative way to get to the waterfall, so you can get an AI helicopter to drop you off there. And so yes, you get your little Instagram photo, but maybe that's not the only thing that you wanted. So yeah, I think unfortunately.

[00:39:44] James: Yeah, I think unfortunately we're gonna have to learn this by experience. It's hard to, I mean you can talk romantic about the journey and things, but I think it's only when we see what happens when we don't have that that we'll really understand what we're missing.

[00:40:01] I should say, just today we released our Learning Outcomes Measurement Suite, which is how we use the models to assess whether humans are learning when they use them. So agreed, it's sort of a live research question. But I wonder Mark, for you, serendipity and the idea of an inexact answer from the model in order to create space to explore. Is that a quality of the models that you're interested in exploring? Is that a model behavior question, a personality question? How do you grapple with that?

[00:40:31] Terry: Yeah, so actually one of the biggest initiatives we have this year in terms of building a new primitive and a new interaction paradigm with AI is we started an interactive agents team. I think it's not sufficient that you just ask an AI a question and then it just comes back, even let's say like a day later with its best attempt at a solution. I think humans are collaborative. They work in these constructs and you take like a completely non-math example, right? You wanna create some kind of, let's say PowerPoint or some kind of artifact like that, right? That's not the way you operate. You don't just tell some AI like just make me a PowerPoint and it should be perfect and should kinda address these things. You want it to kinda come back and you shape the direction it's going and there's multiple rounds of interaction with this agent. I think that truly is what it's like to co-work with a very intelligent agent. We wanna build that deeply into the model. Just something that's very steerable. It feels like a thought partner. And I do hope within a couple of months, at least within a year, that's the way that AI looks. It's much harder to reinforce on learning, on collaboration.

[00:41:41] James: How do you score how good you're vibing with your co-workers?

[00:41:44] Terry: Yeah, yeah. My thesis is it is possible.

[00:41:48] James: I agree that it might not be as difficult as you imagine. There are probably quite clear biological signals of what vibing looks like in the real world. You’ve got to get into the machine world.

[00:42:00] Terry: Okay, yeah, so embody your AI with body language. Embody your vibing, Jerry. There we go. I’m glad that I’ve got a quote that will be kept after this.

[00:42:12] James: I will open up the floor for questions just after this. I'll ask one more just to give you more time to think about it. Perhaps we will meet again in a year. I'd be interested to know what your predictions look like for where we'll be in one more year's time.

[00:42:26] Terry: I really hope we're gonna see a lot of new types of mathematical projects that are challenge-based. First proof type things where some group of mathematicians, for example, will create a really good, creative set of problems that they would like some proportion of these problems solved. And they have a very good gradation of difficulty. They have a very good verification protocol. And they'll just open it up to the community. So it's just taking full advantage of, well, not just AI, but also just like the internet and Metcalfe's Law. If there's n people who can produce problems and n people that can solve problems, then there's n squared possible connections. Mathematicians have been very bad at using, at sort of making this large scale network. So I think we will see a different, almost like a marketplace, for the style of doing mathematics. And there, I think, we'll see AI shine. So this is one thing I wanna see. And maybe in a year, we'll start seeing that.

[00:43:37] James: Yeah, I think of ML's kind of foreshadowing math here. Like when you look at how Frontier Labs operate today with research scientists, we are moving into this world where the strongest research scientists, they're able to kind of pursue a lot of ideas in parallel and just really act as orchestrators. So, they can think about this idea, think about a bunch of variations in the experiments, and just kind of have the model go and execute and implement that. I hope there is that kind of similar paradigm in math where people like Terry and yourselves feel empowered to just go explore a fairly broad set of ideas and strategies with very little hand-holding.

[00:44:17] I do think kind of the very little hand-holding part will also become more true. The task horizon will continue to elongate, just like we were at minutes, in terms of the horizon a year ago. I think in a year from now, we're going to be in multiple days where you can actually trust the model to do tasks that would take you that long. And then I think beyond that, yeah, it's just making sure the interaction is seamless, right? These things should be seamless.

[00:44:42] James: Interaction is seamless, right? These things should just feel like they interact very naturally with groups of humans and with the communities that you guys operate in. And finally, I really do hope we have some really big breakthroughs, whether it be in math, in physics, in biology. I think, you know, today, the things we're proving are good and all, but I do think there's the potential for this to actually produce something that's very beneficial for humanity. Fantastic, there was this metaphor in software development of bazaars and cathedrals, the bazaar being a self-organizing thing that springs up and is very diverse, the cathedral being one great mind architecting it, which is therefore very elegant. The idea, perhaps, will be that in math we get both those phenomena occurring, and hopefully that is a flourishing. So I do want to open up to questions. Does anyone have a burning question?

[00:45:36] Terry: Yeah, I wonder whether any of you can talk about the world models, which instead of predicting the next token, predict the next state. What I'm reading about the BJP-2AC is that it can actually self-correct things because the hallucinations can be avoided. Is that all true? Because whatever I could run on my math context model, it's just a hello world thing, so I don't know much about it. For the benefit of those watching, the question was about how do we feel about world models? Are they a paradigm shift, and would they be specifically useful for maths and solving hallucinations?

[00:46:11] Mark: I think it's potentially a very promising alternative direction. LLMs are great, and in some ways, they're too great, actually, in that we've routed our entire AI infrastructure around making the LLMs as powerful as possible, and it could crowd out some other very complementary ways to create AI assistance that happen in a completely different way. So I definitely support research into world models. I think for a long time, they will underperform the LLMs just because of all the momentum and infrastructure that LLMs have. It’s like we have built our cities around the automobile and gasoline, and we have this entire infrastructure that is actually making it hard for alternative modes of transportation to break through. But there are definitely people who are pushing that, and I wish them a lot of luck.

[00:47:06] Terry: Yeah, I do think when you think about a pure video generative world model, we still seem pretty far from that. I think the existing video models are pretty good physics simulators, but I do think with a little bit of our pressure, they also fall apart. I do imagine that will get more and more robust over time, but it’s just not quite there yet. We are pushing fairly hard on that. There are many spectrums of world models. You can see an LLM as a world model, too, but I think digital world models, where we interface with computers, have all the rules and feedback of the computer. That’s a very important and interesting system, and I do think we’ll tackle and really get a lot of value from that very soon.

[00:47:51] Mark: I wonder on that if there is a middle ground in as much as you can construct RL environments that are based on the laws of physics or follow the laws of physics, which to some degree confers the benefits you otherwise get from a world model to an LLM or anything else. Perhaps it's an intersection rather than two alternate pathways.

[00:48:09] Mark: Yeah, so AI in science, in many fields, is effective when it's predicting very well. For example, protein folding, predicting weather accurately, but in mathematics and theoretical physics, we're asking for something different. We want to kind of understand, get a formula, get a proof. Do you think it's potentially too limiting? Would it be easier to get an AI that will tell you, "I have a proof of theory, my hypothesis, but your brain is too narrow to understand it, so I can teach other AIs about it and do more progress with it?" That's definitely my relationship with AI already. Some of us have hit that frontier.

[00:49:45] Terry: Again, for the benefit of those watching, the question was that in some domains of science, we may be satisfied with pure simulation. If it can do the thing, we consider the thing to be proven, even if you can't formally verify it. If it can predict weather accurately, even if we don't know how, we're kind of happy with that outcome. We hold, in maths and physics, to a different standard, which is that it must be verifiable in the way we already discussed. Is that somehow limiting or a mistake to put that restraint?

[00:50:15] Mark: Yeah, I think there will be different types of mathematical tasks that we don't do nowadays, which we will entrust AI to, and they could be quite complementary to the task of getting a formal proof of a problem. To give you an analogy, in chess today, all chess players train using these chess engines, and one thing the chess engine does is give you the score at any time it's positioned, like white is three points ahead or whatever, and it’s a really great signal to train humans.

[00:49:40] James: And it's a really great signal to train human chess players. You get instant feedback. Oh, that was a really bad move. I'll try to tell this instead. I can imagine an AI, which, a human is trying to do a proof, and like every time you say, "I'm going to try to prove a contradiction," your score goes down way. Okay, back up and do something else. Maybe a small electric shock. That's a pro model, yeah, okay. Yeah, so a good math tutor can do that. But yeah, we have to be creative about the type of tasks that maybe AI could help with that we just don't think about today.

I do think verification is important. It doesn't have to be swimming. I think we deeply want to know why something is true, and I think that's actually part of a deeper alignment problem, right? When the AI is attacking actual human players and AI is attacking actual real-world impact tasks, you want to know why it made a certain decision, right? Say, it decided here's the best strategy to, I don’t know, grow a business or something, right? You don't want it to do that without having a good justification. And so we have a lot of alignment techniques like debate, right? Where even if you don't get necessarily a narrow type formal thing, you can kind of understand the outline and kind of interact with the proof and kind of question it. So, yeah, I do think investment into techniques like debate and alignment will really help us in the future.

[00:51:12] Terry: Please. Can you get anything out of the latent space stuff? Like see what the earlier…

[00:51:16] James: Yeah.

[00:51:21] Terry: Yeah, so that's something we look into a lot, right? I think the top order thing is, you know, just being able to monitor the reasoning of the train of thought, and you actually get a lot of insight from there. You can actually do a lot of, even with many attempts on a problem, we kind of get a sense for what strategies the model gravitates towards, and just kind of glean a lot of insight into how the model brain works. Yeah, I think you can go a level deeper, like look at the activations, and try to find mechanistic circuits and things like that. But yeah, definitely a very deep field of study.

These two questions may link up as much as, for interpretability, the way that we collapse the latent space does limit the kind of associations that could come out of the model. At what point do you decide that it's better to maintain the latent space because of the theoretical new connections it could make at the cost of interpretability, or do you just think that we need a different interpretability paradigm that doesn't require that kind of collapsing?

[00:52:19] James: I think the reason we operate in tech space today is interpretability buys you so much, right? I think you can debug so many things that go wrong with the model by just being like, oh, well clearly it's reasoning wrong here, so we can go and debug. When you're doing something in just pure, uninterpretable latent space, you lose that, and I don't know that we would switch to something like that in the interim.

Ideally, we should have a diversity of models, so maybe there are some applications where you just want the answer, you don't care about interpretability, and then you just turn the dial one way. But there'll be other applications where you really want to see the process, you really want to see a human-readable chain of thought, or whatever, and you turn the dial the other way.

[00:53:01] Terry: It does feel intuitively like having to compress things into language does come at a cost. So even alongside a plurality of models, you might also want a plurality of expressions or verification methods so that you don't somehow force it into a shape that doesn't make sense.

Sorry, please go, Carl.

[00:53:20] Mark: I have a question about attribution and symptoms a little bit as we look to this kind of future of science. One thing with AlphaFold, the world, I think, largely thinks AI came and solved that problem, and of course, AI in some sense did, but it was sitting on the Protein Data Bank and decades of effort. Then you see the Protein Data Bank loses its funding immediately after various things like that. I'm just curious that as we think about these large-scale math problems and all of the human effort that's gonna need to go into producing the sets of problems and verifying them, it does seem like there are, for many of these things, theoretical physics problems, all of them, a lot of danger that in some sense while AI was maybe the critical enabler that made something happen, it didn't happen on its own. It's really this ecosystem, and so on one side, it's how do we control that narrative, and the other part is that in some sense, like for OpenAI and for the big companies, there's a lot of responsibility in some sense about how they navigate that, and I'm just curious about how do we avoid it going in a bad direction.

[00:54:28] James: Well, this is an important point. A partial solution, so as I said, I do envisage the rise of challenge problems where people will create these datasets of top...

[00:54:38] James: Where people will create these data sets of tasks that they want solved and there it's kind of win-win because the people who create these data sets, they will get the problems, some fraction of the problem solved, which is what they want, but these data sets could be very useful to calibrate AIs. So there are some cases where it can be win-win. But, yeah, there are definitely cases where people have built a data set at great expense not for this reason and then it gets absorbed into various AIs, and yeah, I don't know how well we can track that. Yeah, it leads into like, intellectual property right law and it is a very tricky problem.

[00:55:17] Terry: Okay, I will toss to you.

[00:55:19] James: No, I think as of now, I think the AI doesn't want your credit so, I think, but I do think the vision we have for OpenAI for Science, it's really not about us claiming the credit here. I do think, I mean, we certainly have the ambitions to move science forward but, you know, Kevin here, he wants to build a platform where mathematicians around the world can just accelerate the field in its totality. I think like we don't know the right questions to ask, we are the orchestrators within OpenAI. I do think kind of the credit should just go to you guys. I know that's not exactly the question you're asking. I think it's not like the AI is going to claim credit but I think the public perception is AI solved this problem on its own in that somehow like humans and humans and they have all the answers.

[00:56:07] Terry: So what the Erdos problems, what we've seen is that there'll be times when there's an open Erdos problem that no one has looked at and then some AI solution guesses a solution and then the social media, AI is open to answer the problem, okay. And in many, many cases, you know, like 24 hours later, someone often armed with one of these deep research tools uncovers that this was almost already proven by a very similar method in the literature. And we can't say for sure whether the AI used that solution or indirectly was aware of it. But it happened so frequently. I mean, it has this whole table of AI contributions. It's like the whole section, just this.

[00:56:51] James: And so to some extent, we at least have the capability to detect at least some of this, because we also have these research tools. So we can recover attribution sometimes. It's not perfect, but it could be that the same technology that allows us to use the literature to solve problems can also use the literature to attribute solutions.

[00:57:19] Terry: Yeah, just one thought on top of that. In general, it is a very hard problem, just data attribution. When you generate something, how inspired is it from which data points? One interesting thought here is perhaps the novelty and contribution is somewhat correlated to the amount of time the models spend thinking about something. Maybe it's not always true, but yeah, I think modular stuff, it rediscovers in literature.

[00:57:47] James: I think there is a PR moment, though, that DeepMind could have gone out of their way to say more about the protein databank and the narrative. There are things that are not about the AI models and data and socials, it's about just how we talk about it in PR and analysis.

[00:58:03] Terry: That's a very fair point. I understand the incentives. I do hope anyone here at OpenAI can verify this, but we care a lot about integrity. I think, at least I would very much strongly fight for the correct narrative there.

[00:58:21] James: Maybe just to add a flavor from my previous life. Maybe one paradigm shift is that the general public tends to underestimate the degree to which the progress of science relies on improving tools, fundamentally. You need better microscopes, so on, so forth. These are not glamorous jobs. It's not typically what theoretical scientists like to work on. And that's the narrative we should be saying, is that by building ever better tools for science, that's what enables humans to accelerate the entire field, often really big steps forward. AlphaFold is really a tool. It's not per se scientific research, though it does do some research component things there too.

[00:59:00] Terry: And if we can stay on the emphasis of it being a tool, that might also flow back to your point about what is the investment flow in society? Because it's to the things that need to surround the tool to make it effective, but ultimately in service of the human researchers or scientists that will use the tool to solve the problem.

[00:59:17] James: We are just in the edge of time, so I'll take one more if that's alright.

[00:59:20] Terry: Okay, perfect. We can go to the back.

[00:59:23] James: Yeah, so I think one of the really interesting things about this conference is, it's accelerating math and physics using AI. And I think there's a certain amount of synergy that comes from being able to solve math and how that can help.

[00:59:36] James: of math and how that can help solve other physics problems. So I'm curious to hear your thoughts on additional synergies you see coming out of OpenAI, through you guys tackling math, physics, and then what else you guys see coming out of here.

[00:59:51] Terry: Yeah, so I think Kevin would probably be the best to speak to this, but we did hear about tackling the domains outside of math and physics as well. I think some that we have explored are biology, where we have had the AI work on just making the biological procedures in the wet lab much more efficient. I think with one of our partners, Skinco Bioworks, we iterated on a lot of their core processes and made the cost for synthesizing proteins 40% more efficient. And I think that's just the underlying primitive that will drive more progress. A lot more you could imagine doing in things like material science and other domains.

[01:00:31] Terry: So the IPAM, this institute, our entire core mission is basically to find these synergies. The Institute for Pure and Applied Mathematics is basically in the name. We bring together events like this, where different communities talk together. A lot of it is this serendipity. I mean, we have some idea. We don't just randomly smash together fuels. OK, but we do pick ones where we do believe there will be a lot of unexpected fruitful collaboration.

[01:01:00] James: I'm glad you leave randomly smashing together particles to the physicists.

[01:01:02] Terry: Yeah, yeah, yeah. That's a good moment to close it and also to plug that Kevin is about to give a lecture, which covers many of these questions, i.e., what needs to be true to accelerate all parts of science and the tools to build for it. So I hope you'll come to that. But thank you also very much.

[01:01:19] Audience: Thank you. Thank you. Thank you. Thank you. Thank you. Thank you.

+ Read More

Comments (0)

Popular

Watch More

Exploring the Future of Math & AI with Terence Tao and OpenAI

Posted Oct 09, 2023 | Views 27.9K

# STEM

# Higher Education

# Innovation

Event Replay: Sam Altman on Building the Future of AI

Posted Apr 06, 2026 | Views 7K

# OpenAI Leadership

# AI Governance

# AI Safety

# Economic Opportunity

Event Replay: OpenAI's Chief Futurist on AGI and What's Next

Posted Feb 26, 2026 | Views 1K

# OpenAI Leadership

# OpenAI Team

# Responsible AI