Sign in or Join the community to continue

Deep Research in the OpenAI Forum

Posted Mar 28, 2025 | Views 3.7K

# AI Research

# OpenAI Presentation

# o3 reasoning model

Share

speakers

Isa Fulford

Member of Technical Staff @ OpenAI

Isa Fulford is currently a Member of Technical Staff at OpenAI, working in post-training research. She has previously been involved as a Scout at Sequoia Capital and served as a Mayfield Fellow with Stanford Technology Ventures Program (STVP). Isa gained practical engineering experience at Mem Labs and Amazon Web Services, where she contributed to projects in automated reasoning and verification for critical systems. Isa holds a Master of Science in Computer Science and a Bachelor of Science in Mathematical and Computational Science, both from Stanford University. She graduated with Honors and Distinction, achieving recognition as a Phi Beta Kappa and Mayfield Fellow, and participated in international academic experiences at Oxford and Florence. Beyond her academic and professional achievements, Isa is an accomplished violinist, having earned a Gold Medal at the Golden Key Music Festival and performed at Carnegie Hall. She was also a violinist with the National Youth Orchestra of Great Britain, showcasing her passion for music alongside her technical expertise.

+ Read More

Zhiqing (Edward) Sun

Member of Technical Staff @ OpenAI

Zhiqing (Edward) Sun is a Research Scientist at OpenAI, specializing in post-training research. Edward brings extensive experience from prestigious internships at the Allen Institute for AI, the MIT-IBM Watson AI Lab, and Google's Brain Team, where he contributed to cutting-edge AI research and development. Edward earned his PhD in Computer Science from the Language Technologies Institute at Carnegie Mellon University's School of Computer Science. He also holds a Bachelor of Science in Computer Science from the School of Electrical Engineering and Computer Science at Peking University. His international research engagements across leading institutions in both the United States and China highlight his global perspective and commitment to advancing artificial intelligence.

+ Read More

SUMMARY

The presentation from Isa Fulford and Edward Sun offers an in-depth look into “Deep Research,” a capability within ChatGPT powered by a fine-tuned version of the o3 model. The model is built with agentic capabilities that enable it to autonomously conduct complex, long-horizon research tasks involving browsing, reasoning, data processing, and synthesis. Deep Research is positioned as a leap toward more capable AI agents that save users significant time and deliver high-quality, sourced outputs. The presentation also showcases how reinforcement learning, reasoning models, and safety measures contribute to creating a robust system meant to support real-world professional tasks—particularly in business, science, medicine, and academia.

+ Read More

TRANSCRIPT

Tonight, we'll hear from research scientists who contributed to deep research, Isa Fulford and Edward Sun. We are always listening to our members, and recently there was a lot of buzz in the forum around how deep research in CHAT GPT was saving members from professors to nurse practitioners to writers time, anywhere from hours or even days of manual research, how it was delivering high quality analysis, providing source attribution so researchers can quickly review and go deeper. Members are finding that deep research asks clarifying questions to refine research before it even starts.

We've learned that the majority of our members are using deep research for work and professional needs. Some of the top use cases included technical encoding work for debugging and script generation, for business and market research, such as competitor analysis, evaluating investments. And we've learned that faculty and graduate students in academia and in science labs are using it for literature reviews and citations and even more.

We thought it would be valuable to share deep research with the whole community because that's what we do here. We learn from each other, we share resources and ideas, and I hope that this presentation will be only a starting point for many, many conversations and projects in the forum. We also aim to ensure that all of our members get a seat at the table with OpenAI.

So for those of you who are here as members, we'll host a live Q&A with our researchers after their presentation. And if you're here as a guest and would like to become a member, please fill out the application form that our community manager is going to drop in the chat now and also a little bit later.

So without further ado, let's get started with an introduction to tonight's special guests.

Isa Fulford is currently a member of the technical staff at OpenAI, working in post training research. She's previously been involved as Scout, Sequoia Capital, and served as Mayfield Fellow with Stanford Technology Ventures Program. Isa gained practical engineering experience at Memlabs and Amazon Web Services, where she contributed to projects in automated reasoning and verification for critical systems. Isa holds a Master of Science in Computer Science and a Bachelor of Science in Mathematical and Computational Science, both from Stanford University. She graduated with honors and distinction, achieving recognition as a Phi Beta Kappa and Mayfield Fellow, and participated in international academic experiences at Oxford and Florence. Beyond her academic and professional achievements, Isa is an accomplished violinist, having earned a gold medal at the Golden Key Music Festival and performed at Carnegie Hall. She was also a violinist with the National Youth Orchestra of Great Britain, showcasing her passion for music alongside her technical expertise.

Since hosting Music is Math in the OpenAI Forum in February, we've actually learned that many skilled technologists at OpenAI live double lives as musicians, and we absolutely love it.

Edward Sun is a research scientist at OpenAI, specializing in post training research as well. Edward brings extensive experience from prestigious internships at the Allen Institute for AI, the MIT IBM Watson AI Lab, and the Google Brain Team, where he contributed to cutting edge AI research and development. Edward earned his PhD in Computer Science from the Language Technologies Institute at Carnegie Mellon University's School of Computer Science. He also holds a Bachelor of Science in Computer Science from the School of Electrical Engineering and Computer Science at Peking University. His international research engagements across leading institutions in both the United States and China highlight his global perspective and commitment to advancing artificial intelligence.

Isa and Edward, thank you so much for being here, and the stage is yours now.

Hi, everyone. Thanks, Natalie, for the very kind introduction. I'm Isa.

I'm Edward. We're very excited to present about deep research today.

So first, what will we be talking about?

So first, we'll go through some background to the project. So we'll talk about reasoning models and how we were inspired to start this project. Then we'll talk a bit about how we train deep research. We'll talk about some internal and external benchmarks. Then we'll talk a bit about the safety training that we did to make sure that the model was safe before we launched. And then we'll talk a bit about limitations and next steps for the project.

So first, what is deep research?

Deep research is an agent in ChatGBT that can do work for you independently. You give it a prompt, and it will find, analyze, and synthesize hundreds of online sources to create a comprehensive report at the level of a research analyst. It's powered by a version of our upcoming O3 model that we fine-tuned specifically for web browsing and data analysis. It leverages reasoning capabilities to search, interpret, and analyze huge amounts of text and images and also PDFs on the internet. And it can pivot as needed in reaction to information it encounters. We think that deep research can accomplish in tens of minutes what would take a human many hours. And we're excited to hear that a lot of people in the OpenAI forum community have also been playing with deep research.

So first, a bit of background on reasoning models. So we launched O1 in September of last year. And this was the first model that we released in this new paradigm of training where models are trained to think before answering. And we called this text where the model is thinking, chain of thought. So on the right, you can see a screenshot of a question in ChatGBT asked to O1. This is a cipher problem. And then you can see the model actually thinks through the problem and reasons about how to answer before giving its final answer to the user. And this is a lot more similar to how humans would actually solve hard problems.

So how does this work? So reasoning models are trained via a large scale reinforcement learning algorithm that teaches the model how to think productively using this chain of thought in a highly data efficient training process. So what does that mean exactly? So on the slide, we have a very classic diagram for reinforcement learning. And in the context of a reasoning model, one way to think about this is that the agent or the reasoning model receives a question or a task. Then it generates a chain of thought as it's thinking through the problem before producing its final answer. Then the correctness of the answer is evaluated. And then based on whether the response is correct or incorrect, the model gets a corresponding reward and then is updated.

So we train our language models on many, many problems. And we positively reinforce the trajectories, so chains of thought and final answers that lead to good outcomes, i.e. positive rewards. And then eventually the model gets very good at reasoning through these hard problems and also reasoning through new hard problems that it hasn't been trained on. Through this training, the model learns a lot of emergent behaviors like error correction, trying multiple strategies and breaking down problems into smaller steps. And this is all learned through the training process.

One thing that is very exciting is that we found that performance of models like O1 not only consistently improves with more reinforcement learning, so train time compute, it also improves with more time spent thinking, so test time compute. And what this suggests is that as we give the model more and more time to think or more compute at test time, we'll be able to tackle harder and harder problems.

So how does this relate to deep research? Around a year ago internally, we were seeing really great success with training models in this way using reinforcement learning. We were training mostly on math and science and coding tasks, and we wondered if we could apply these same methods but for tasks that are more similar to what a large number of users do in their daily lives and jobs. I think a lot of these tasks involve browsing, synthesizing a lot of information, doing data analysis. And so we wondered if we focused on the right real world tasks and economically valuable tasks that our users would care about during training and we evaluated on these tasks, could we try to start teaching our models to perform well on them? And not rely solely on generalization from training on these math and coding tasks and then trying to answer these more real world type tasks.

So at OpenAI, we have this concept of stages of artificial intelligence. Level one is chatbots. So these are AI that use conversational language, so like the original chat GPT. Level two is reasoners. So these are reasoning models like O1 that are able to solve problems like humans would. Level three is agents. And so these are systems that can take actions. Level four is innovators, and this would be AI that can aid an invention. And then level five is organizations. So AI that can do the work of an organization.

The O1 reinforcement learning algorithm unlocks the potential for truly agentic experiences, so level three. And these are long horizon tasks that require reasoning in complex environments with access to external resources. And the reason for this is that in order to operate in a complex environment, you need a model that's flexible, able to error correct and reason about information as it's actually coming into contact with it in real time. And to be able to make a robust agent, you really need to be able to train with end-to-end reinforcement learning on the real distribution of tasks you want to solve.

So for innovators, the ability to synthesize, the ability to create new knowledge, I think a prerequisite is the ability to synthesize existing knowledge. So if we're trying to get to AI systems that can make new scientific discoveries, we first need to be able to create models that are able to synthesize existing research in an area. So we think that deep research is a significant step towards this broader goal.

And, you know, concretely, you couldn't do novel research in a field without being able to write a literature review or summarize the existing work in the field. So with that, I'll hand over to Edward to talk more about how we train the model.

Yeah, thank you.

Hi, I'm Edward. I'm happy to introduce how we train the deep reasoning model. So to train this model, we need at least two components. The first component is that we want to give a reasoning model tools so that we can unleash its capabilities. For example, in the deep research, we give the model a browser tool and like a code execution tool. And then the second thing is that we need to curate some specific reinforcement learning data sets that the model can practice their capabilities of using these tools and also synthesize information to deliver a high quality response.

So for the browser tool, it's kind of an abstraction of the real browser, but it's designed with a predefined action space, like a search, where it's like issuing some search queries to Google or Bing and click, it can click any links in a search result page or any further links on an existing page and also scroll because the models can't contact the real browser. And for the code execution tool, it's a Python tool or maybe a terminal tool and it allows the model to execute nearly any code in a sandbox environment so that the code can be executed safely.

And also, we need to curate the data sets for the training as well. So we need data sets that include these tools, internet tools and data sets that involve both using the browser and code execution tools. And also these data sets all need to be structured in a way that is suitable for doing reinforcement learning so it aids the training process.

In conclusion, training a deep reasoning model requires not only giving the model the tools it needs to be successful but also curating the right data sets to practice using those tools effectively. Through this method, we're able to train models that can think through complex problems and deliver high quality responses.

safely and securely. So the code execution tool is pretty useful for performing calculations or processing data in batch. So to sum up, these two tools are kind of very complementary, especially in deep research.

Where the browser tool helps the model to aggregate or synthesize real-time data, and the Python tool is helping the model to process this data. So concretely, we change the model to generate champsots, re-learning, and tool calls in an interleaved manner. So in the diagram in the left, it kind of breaks this down.

So first, the user will give you an input. And there, the model may generate interleaved champsot and tool calls. So specifically, the model will maybe think about what tool it should call or what specific parameters it should use in the tool call. And also, the model may generate some plans before at the beginning of the champsots or generate a summarization of how it should present the results at maybe its last champsot.

And on the right side, and then finally, the model is going to deliver outputs. And on the right side, we have an example of how this process appears in reality. So specifically, here, the model is performing a go-to-total ratio analysis for the Tokyo 2022 2020 Olympic media medal table. And you can see how the model interleaved reasoning with actual tool calls to search for information, refine the data, and process it programmatically.

It demonstrates how the model can adaptively switch between reasoning and the execution, using tools to gather necessary information and apply the calculation to process meaningful results. So here, we give one of the concrete examples of how we change the model.

We change the model on quite a diversity of data. But here is just one very easy example to understand. Here, the question is that, can you give me the title of the scientific paper published in EMLP between 2018 and 2023, where the first two authors are affiliated with Google Research, and the third author is from New York University, and the last author is from both Google Research and Brown University.

And the answer returned is that this paper has various frequency effects on synthetic neural learning in transformers, which appears in EMLP. So to perform this task, the model needs to practice lots of tool call and reasoning capabilities.

First, the model needs to analyze how it can process this problem. So for example, it may need to use a program to scrape all the published data between 2018 and 2023. And then it needs to figure out how it can extract or analyze the author names from all these papers. And finally, it needs to figure out how, given an author, how can the model search this author on the internet and figure out its current or at that time affiliations.

And so for the training, given this data and given the model's action space, we train the model with end-to-end reinforcement learning. So here, the deep research is trained with a large-scale reinforcement learning algorithm, which allows it to improve both its reasoning and tool call capabilities in a very data-efficient manner.

And unlike a normal or previously prompting-based approach, where the model is given some instructions, and we just hope that the model can follow these instructions and give you OK results, in reinforcement learning, we directly optimize the model to actively learn from the feedback, both positive and negative feedback. And here, just like many reinforcement learning algorithms, the model will learn to explore many different approaches, like Esav just mentioned, like error correction and multiple strategies and verification.

And it will refine its strategy over time. So here, the diagram on the left illustrates this feedback group, where the agent and reasoning parts figure out the current state of problem-solving.

And then it issues some actions to the environment. And the environment will respond to these actions by providing new information, like processing some new information given the action. And then this will give the model some reward to learn from.

So the combination of reasoning, tool call, and the reinforcement learning will create a system that can continuously improve its problem-solving capabilities, as it's learning from the environment. So here are some comparisons between the OG search GPT and deep research.

Unlike previously AOM-powered search solutions, where this workflow system relies on one-shot information aggregation and pretty shallow research, our approach on specific deep research uses an end-to-end training paradigm via high-computer reinforcement learning. And these tasks are designed to demand long-horizon reasoning, iterative exploration, and dynamic programming.

So these are challenges beyond the scope of traditional search solutions. And we train this model to progressively master these complexities. This end-to-end methodology ensures that our model can handle truly long-horizon tasks and delivering very robust performance, because it can learn to error-correct itself.

And unlike previous products, we chose not to make latency not a constraint here. So typically, a deep research report will be finished in 5 to 30 minutes. But usually, like a search GPT, it can instantly give you a result.

So here, because the model, instead of providing a quick summary of search results, the model needs to be more capable. So it will need to iteratively do some dynamic programming to find the solutions, like with maybe multi-hop reasoning. So here, we showcase how the model can handle a complex scientific query about pure and mixed-gap sorption in gassy polymers.

The prompt on the left describes a researcher's query about understanding sorption mechanisms, predicting behaviors using the dual mode sorption model, and exploring the associated challenges. And the model breaks down the problem into several steps, like understanding the sorption models, and then assessing the open access sources using the browser tool, and extracting specific insights from the retrieved research papers, and also clarifying key properties.

And finally, we omitted several steps, maybe hundreds of steps here. And finally, the model will also have some reasoning to piece everything together to deliver very accurate and well-cited reports. So this showcases how the model can effectively break down a complex task, gather information from various sources, and structure the response coherently for the user.

Now, I will hand it back to Isa.

So we have some out-of-distribution external benchmarks that we measured the model performance on. So one of them is Humanities Last Exam. This is a fairly recently released evaluation from the Center for AI Safety and Scale AI.

And it tests capability across a very broad range of subjects on expert-level questions. And the model pairing deep research scored a new high of 26.6%, which I believe is still the high score. And this test consists of around 3,000 multiple-choice and short-answer questions across over 100 different subjects, from linguistics to physics, classics to ecology.

And it's pretty cool if you look through some of the trajectories with the deep research model, because the way it solves the problem is kind of similar to how a human would solve a problem. If you're tackling a hard problem, you probably will look up existing work, try to find equations that you might need to use. Maybe for a literature example, you'd look up similar pieces of work to try and help inform your solution, and then do some calculations for a physics problem or something like that.

So it's pretty cool to read through and see how it approaches these problems. Another public evaluation is called GAIA. And this measures capability on agentic problems that require multi-step web browsing, reasoning, multimodal ability, and code execution.

And at the time we released, the model reached a new SOTA, or state of the art, topping the external leaderboard. I think that since then, there have been some new high scores as well. And then we have an internal evaluation that we collected.

It's expert-level tasks across a very broad range of domains, including the sciences, engineering, finance, policy, law, many more. And had these experts rate the deep research model completions. We found that the model could complete tasks successfully that would take humans hours of hard manual investigation.

We have two graphs to illustrate this. So on the left, we have pass rates on these tasks by estimated economic value. And then on the right, we have pass rate by estimated hours for a human expert to complete the task.

And pass rate reflects the rate at which the model provides a satisfactory answer as judged by an expert in a given domain. And it's interesting to see that the estimated economic value of the task was more correlated with pass rate than the number of hours it would take a human.

And we think this shows us that the things that the models find hard are not the same as what humans find time-consuming. So since launch, we've had a few surprises.

So we had some ideas about the things that the model was good at, the things that people might use the product for. But I think we've been surprised to see on Twitter and other places how people have been using the model.

One thing that we hadn't considered that much is that people might be using the model a lot for coding. And that's been a really big use case of people trying to write a script, but they want to use all the newest up-to-date APIs or packages or whatever it is.

We've also seen some really cool scientific use cases in medical and biology domains. I think for us, as we're reading trajectories, we don't necessarily know if the output is medically accurate. So I think it's been cool to see real experts verifying that the outputs look pretty good.

We've also seen interesting user behavior where people put a lot of effort into refining their prompts using O1 or another model to write a really comprehensive prompt. And then only after really refining that instruction, then they'll send it to deep research, which I guess makes sense if you're going to wait a long time for an output. You want to make sure that you're

Sure that you really specify the prompt upfront. And that was the same reason. So if you've ever used deep research, you'll know that you'll ask Chat GPT to do something for you. And first, it will come back with some clarification. So it will ask you some clarifying questions about your request. And the reason that we chose this user experience is that we felt, if you're waiting for a really long time, we want to make sure that you're happy with the final output. And so you should provide as much detail upfront in order to lead to the exact kind of response that you wanted. And so I think some people really love this clarification flow, but then I've also seen some people being quite annoyed by this clarification flow. So that was interesting to see.

So on safety, we did extensive safety training and evaluation before launching this model. We had new safety data, training data, in a few categories. One of them was privacy, so making sure that the model refuses to respond to requests for very sensitive private information, and then also extremism. And then we also wanted to make sure that the model was robust to prompt injection with content from malicious online websites that might be trying to confuse the model, and make it do something that we wouldn't want it to do. And before release, we did extensive red teaming with external testers, and then also went through preparedness and governance reviews that we always do at OpenAI.

So we also wanted to highlight the model still has some limitations. While this is the model that performs best on internal hallucination evals and external hallucination evals that we've published, it still may hallucinate facts or infer things incorrectly. Sometimes it struggles to distinguish between authoritative sources and rumors. And then it's not always the best at calibrating its own confidence in a statement that it's making. So these are all things that, as we continue to train deep research models, we really want to improve on.

What's next for the project? So we're very excited to release a smaller, cheaper model to the free tier, and extend access so more people are able to use deep research. As I mentioned, we're working on core improvements to try and address some of the limitations that we outlined. We're also excited to make outputs more personalized. And then we're also excited for the new product direction, where we could let the model access things that you have subscriptions to or private data that you have that you might want the model to also be able to do research over. So if you haven't tried deep research, it's available now for Plus, Pro, Teams, Enterprise, and EDU users. And we'll launch it to free very soon. And in chat GBT, there's a deep research button. So you can type something in and press that button, and you'll be able to try deep research as well.

Thank you very much for listening to our presentation about deep research. We hope you try it and enjoy it. And we'll hand back to Nathalie to close out, and then also we'll move into a Q&A. Thank you very much.

Thank you. Thank you, Isa and Edward. That was so awesome. I asked the team to leave you on for just a few minutes because we have a lot of really great questions in the chat. So before we jump into the Q&A for members, I thought we can answer three questions from the chat with you guys, if that's all right with you. Sounds great.

And then for everybody that's present, just so you know, if you were here live tonight and you're a member, you can find the entire chat history from this event in your messages. And if Isa and Edward have a little bit of time in the next week, maybe they can answer a few of the questions from the chat that didn't get addressed tonight because we have limited time.

Okay, here's a question from Eeyore from the NASA Ames Research Center. Can I direct ChatGPT to search only Google Scholar and go through at least 20 pages on each query? That is something we would love to support in the future. I think now the model is quite good at instruction following but it's not perfect. So sometimes if you ask it to research 50 different things, it might only research 10 and that's because it knows it has limited context. And similarly, it will try to stick to sources that you ask it to, but it won't always do that. So I think in future, that's something we would love to support.

And then Nicole, a CMO from Atlanta, Georgia asks, she would love to understand how the model you're in when you use deep research impacts the results, if at all. Are there pros and cons from selecting different models from the dropdown, 4.0, 0.1 Pro, et cetera? Yeah, so right now the results are quite good. So I think it's a good idea to use deep research. Yeah, so right now the design is that, like whichever model you select from the dropdown, it will always cause a fine-tuned version of 0.3 for the deep research task. Yeah, but I think the initial model, I know that UX is a bit confusing. It actually doesn't really matter which model you have chosen. I think it will always default to using 4.0, but it will keep whatever conversation you have already in context. So I think if you're trying to refine your prompt or something like that, you might have a better time with 0.1 or a reasoning model. And then when you press the deep research button, then the clarification question comes from 4.0 and then it gets sent to the deep research model, which as Edward said is the 0.3 fine-tune.

Okay, last but not least for this section of our event, any specific recommendations for how to prompt specific to deep research? Yeah, I think like right now what I saw, like there are some way to prompt, these reasoning models is that you should, rather than instruct the model to, like how to perform the task, you should like tell the model what's objective of your task. And then the model are usually like maybe more creative or like because model knows its own constraints. So it may like find some ways to complete the task more effectively. Yeah, totally agree. And I also think with deep research in particular, since you don't have as much chance to follow up, I think specifying exactly what you want upfront, your prompts might even be a page long or longer. I think that's totally okay. And the model will be able to handle those kinds of requests. That is really good advice. Thank you so much, guys. I'll be using that as well.

Okay, so Isa, Edward, thank you so much for joining us. That was very special. And we'll see you in a moment in the live Q&A. I just wanna make sure before everybody goes that you know we have some awesome events in the pipeline as well. So next Thursday, April 3rd at 5.30 PM PT, you can join us for another event with Sora Alpha Artists, the next chapter in AI storytelling. The first one was absolutely beautiful with Manu and Will. We hope you can join us again for the next phase of their storytelling journey. Wednesday, April 9th at 5.00 PM PT, join us for OpenAI Presents leading impactful chat GPT trainings with one of our very favorite technical presenters from OpenAI, Lois Newman. So if you are a trainer and you're hoping to train your teams or you're hoping to just put together small initiatives like I do with my family and with my colleagues, and you would like to learn how to be a more impactful trainer, please join us for that. We've been hosting community events for two years now and have an awesome archive of storytelling. Almost 70 events are now published in the content tab of the OpenAI Forum. So if you're a visitor, you can visit the content tab and you still have access to almost all of the recorded content that we've ever produced in the forum. I encourage you to explore that. If you haven't joined a global chapter or an interest group, we also encourage you to look around. If you join a global chapter, then you might be able to find OpenAI members in your community very close by. You can actually meet them in person. You can host coffee dates all on your own. So join a global chapter, find your people there. Finally, if you're here as a guest, but you'd like to be a member, you'd like to go a little bit deeper with us, you'd like to be able to join us for the live Q&As, you'd like to be able to meet all of the other members and join us for in-person events, you can apply to be a member with the forum that our community manager is about to drop in the chat. We would love to see your application. We would love to invite you in this expert community. We're growing all the time. Last but not least, it was really my pleasure to host everyone. Happy Thursday, everyone. I hope you enjoyed and learned something from deep research in the OpenAI forum, and I will see you again soon.

+ Read More

1

Comment (1)

Popular

Watch More

AI Literacy: The Importance of Science Communicator & Policy Research Roles

Posted Aug 28, 2023 | Views 40.7K

# AI Literacy

# Career

Exploring the Future of Math & AI with Terence Tao and OpenAI

Posted Oct 09, 2023 | Views 27.2K

# STEM

# Higher Education

# Innovation

OpenAI Residency Program Info Session

Posted Mar 12, 2025 | Views 8.9K

# Career

# Future of Work