Technical Success Office Hours
James is a Solutions Architect at OpenAI, partnering with the company’s largest and most strategic customers to address their most complex challenges using OpenAI’s technology. He has implemented LLM-driven solutions in production, pushing the capabilities of OpenAI’s API across telecommunications, automotive, technology, and other sectors. Before joining OpenAI, James worked in Data Science at McKinsey & Co., where he developed deep learning models for client applications. James is based in San Francisco.
Mandeep Singh is a Solutions Architect at OpenAI who helps companies implement business solutions on top of OpenAI APIs. He is a tech-savvy leader known for delivering award-winning AI/ML solutions. Mandeep’s professional experience showcases multifaceted leadership skills in addressing customers' pressing business challenges. He has deep knowledge of AI technologies and the expertise to apply them to complex use cases, focusing on innovative solutions to meet customer needs. His achievements include recognitions such as the Cannes Gold Lion award for an AI solution to help prevent wildfires and the Salesforce Partner Innovation Award for an innovative multi-cloud technology solution.
This event focused on multi-agent orchestration, featuring the open-source framework Swarm for building scalable AI workflows. Experts James, Mandeep, and Ilan discussed best practices for implementing agents, starting with single-agent systems and scaling to multi-agent networks for complex, hierarchical tasks. Swarm emphasizes simplicity, transparency, and control, offering a lightweight, stateless approach for prototyping. The team shared examples like Klarna (single-agent success) and T-Mobile (multi-agent hierarchy) to highlight practical applications. A demo showcased Swarm’s ability to create autonomous research assistants that reduce human effort. The session ended with a Q&A, emphasizing evals, customization, and the evolving role of AI in automation.
So many of you know me, Jen, really great to see you and other new faces. If you do not, my name is Ben, I'm on the human data team here at OpenAI.
And so what we do is we just like to remind you all that we are recording this event and that we will be publishing it after the event that you can find within the forum itself.
We also like to remind all of us about OpenAI's mission before we get started. So OpenAI's mission, which is to ensure that artificial general intelligence, by which we mean highly autonomous systems that outperform humans at most economically valuable work, benefits all of humanity.
I'm excited for this event. So we had done one of these before, which I was unable to host, but many of you have been asking in the forum community to go deeper into the more advanced and technical sides of working with OpenAI's tools and solutions, and then also have time to ask questions and get answers, especially as you all are working on some projects.
So today's focus, all about multi-agent orchestration, specifically Swarm, which if you did not know, this is our experimental open source framework designed for multi-agent workflows.
So the way this is going to work, you're going to hear from James, Mandeep, and Ilan about how all of this integrates custom GPTs into workflows and how to create more and robust scalable AI powered systems.
So if you are not extremely technical, rest assured, there's going to be introductory material to get you up to speed. However, there's also great material that's ideal for those who are already familiar with agent based systems already.
So what we're going to do, the first 20 minutes is we're going to have an amazing speaker presentation from James, Mandeep, and Ilan. And they're going to talk about this, the multi-agent orchestration, the Swarm framework. I believe they also put together a demo, then followed by this presentation.
The bulk of the time here, we're going to have Q&A session and, well, I guess office hours. So that's going to be the most time. So you can put questions into your Q&A, the Q&A tab. You can also raise your hand and it's almost like, you know, think of it as a meeting space. So don't be shy.
A brief intro to James, Mandeep, and Ilan. So they're solutions architects. Ilan is from the developer experience team. So James, he's a solution architect and he works with some of our largest clients at OpenAI. He also led the development of Swarm. So I would say he's like the OG, you know, knowledge person for all this open source orchestration framework, which is going to be the key focus of today's session.
Mandeep is also a solutions architect. Amazing experience. I think I even saw that some award winning solution. So I got to hear more about that. And so his role day to day is partnering with clients to deliver on these really complex AI needs. Lastly, Ilan, who's from the developer experience team. So he's been playing an instrumental role on partnering or pushing forward new API capabilities. So I know that's going to be a huge point in this conversation.
So we're really grateful that all three of you are here. I will sort of leave it to the experts now and pass the mic to you all. Great, let me just share my screen one sec. I will also say, Ilan and I, you can think of us as the co-OGs on this, Ilan is equally an expert on Swarm. So let's just.
Everyone see my screen? Yes. Okay, cool. All right.
So thanks for coming, everyone. Super excited about this. As a brief overview of like what we're going to go through, I'm going to talk a little bit about just an intro to agents, some common architectural patterns on building potential systems. I'll pass to Ilan who will talk a little bit more about multi-agent orchestration and then Swarm particularly.
Mandeep will then give a really cool demo, and then we'll leave plenty of time for Q&A. So let's get started.
Okay. I want to start by just level setting on what an agent is. I think that this is just a huge buzzword that gets thrown around a lot. It's started to like mean a lot of different things. So we want to just to like set a clear definition of how we view agents at the simplest level.
And so really we define agents as an LLM that has a specific prompt or instructions that it follows and the ability to call tools to extend its capabilities.
So a couple like vocab words that you're going to hear pop up in this presentation. First is model. I think that's pretty clear. It's like the LLM that the agent is using. So it could be 4.0, 4.0 mini, 0.1 preview, or a fine-tuned model.
Routine. Routine, I'm going to talk a bit on the next slide, but you can think of that as the system instructions that are catered for like to be optimized for the LLM to follow a set of instructions and keep the LLM kind of on rails as much as possible. And we'll talk about that in a little bit more detail.
And then tools are our function calls is the ability that allows LLMs to have access to tools beyond what it has in its like base check completions raw form. And so that could be like you can give the model access to a tool that can do SQL queries and bring in customer data from your CRM that now is giving the LLM additional context that it wouldn't have otherwise.
So when we talk about routines, you can think of it as what we call an LLM optimized system prompt. And so I think a good way to explain like why routines are necessary is if you look at customer service.
So customer service is a really classic use case for agents and multi-agent systems. And the way that you create these agents that behave like a human customer service agent would behave is you need to first give it all of the kind of ground truth of how a customer service agent acts. And to do that, you are going to start with all of your help documentation and all of your customer service guides and rules of when a customer asks about a certain topic, how do we help them? What are the steps that we take?
But a lot of our customers and a lot of enterprises in general have like very different structures of the help documentation or the guides where maybe it's user facing or maybe it's customer service agent facing, or it's just this long, like long, long multi-paragraph stream of text that has been added to and tweaked over the course of maybe 10 to 20 years and isn't necessarily like very logical and very structured in terms of here's where you start, here are all the steps in between, and here's where you end.
So we use this rewriting step where we'll actually call an LLM to rewrite the initial documentation as a routine to create something that is a set of instructions that's really, really interpretable for LLMs to follow reliably.
So you can look in this example, we have step one, step two, step three, step four. And so that's just one of the ways you can structure a routine. But the idea is just like create the most clear instructions so that an LLM will perform as reliably as possible over time.
And then typically what we do is we store this in a knowledge base. So you have, you know, for every sub-intent of reasons why someone may be contacting a customer service agent, you have a specific set of instructions for that intent. So if I'm talking about I want to upgrade my internet plan, there's a specific routine or set of instructions that the model will follow in that case.
I wanted to talk quickly about some of the common architectures that we see when we're building with agents. So the first one is very simple, and that's just a single agent. We actually recommend that most people start here because this actually can be sufficient in a lot of cases.
So let's say you have an agent that is booking appointments, or it's the expert on all things appointment related. You would give it a routine, which is, you know, you're an appointments agent. You're here to assist users with scheduling, changing, or canceling appointments. And then you give the agent several tools that it can use. And so this is really all that you need for a single agent system.
And the appointment agent can iterate back and forth with the user and ask follow-up questions and then decide when it needs to call a tool. And then you can run the tool in the backend. If you confirm that the tool runs successfully, you tell the user, hey, I like updated your appointment, and that can be the entirety of your product.
Generally, we recommend if you are able to achieve all you want with a single agent, stop here. You do not need to automatically build a network of assistants just because you read a LinkedIn post about it, or it seems cool or represents the org chart of your company.
I think we've seen a lot of success with one agent that just has a lot of available tools and pretty complex instructions, but is able to work really reliably. That's a great place to start, and you don't necessarily need to go farther.
However, if you are seeing that maybe there's too many different types of instructions, and there's too many tools, and the LLM is getting confused of when to call which tools and it's not following the instructions reliably, it may make sense to then build a network of agents.
So when we talk about a network of agents, really the fundamental idea here is the user is going to give some sort of a message, and you're going to do an initial triage.
So you have a specific agent that their only objective is to delegate that user message to the right sub-assistant who then can follow a specific set of instructions.
So if you think back to the appointments, let's say there's one sub-assistant that schedules appointments, the triage agent, I will say, hey, I want to change my appointment.
The triage agent, its instructions are classify this to one of these potential intents. The triage agent will then select the appointment agent and pass along the request to that agent, and then that sub-agent has its own routine and has its own tools. And you can see that there can be this complex hierarchy where you're able to have multiple nested assistants, but this is really helpful because sometimes this hierarchy just helps with the overall performance.
The common example here is, let's say you have a hundred or even thousands of potential reasons why someone is messaging you or calling into your help center or whatever it may be.
If you only have one agent and you have to select over a thousand potential categories of intents, the odds of that being reliable are pretty low, but you can pretty much drastically increase the reliability if you have an initial triaging to a broader intent category, and then from those broader categories can drill in on the very specific sub-intent.
And so an example of this is a customer service bot. I know we talk a lot about customer service. I promise there's other things you can do.
can do with multi-agents as well. But for customer service, you could use like a small model, like for a mini triage it to, let's say in this simple case, they're just disputes and feedback, right? Like the two main topics you care about. Uh, you first decide, are we talking about disputes? Are we talking about feedback? And then once you're there, you then can drill into what's the specific dispute or type of dispute that we need to, uh, like, uh, pass along this request to, and you have like multiple nested agents that you can feed requests through and all of these agents are their own model and they have their own specific instructions and tools.
So I would say generally when to use a network of agents, start with a single agent. Um, I think that a network of agents is better for these complex tasks where the hierarchies really make a difference. And fundamentally we recommend you base everything off of your evaluation. So build that eval set of what sample inputs would be and what the desired kind of output and a series of actions that the agent should take would be. And then, um, base your entire architecture around how your evals are performing. So network of agents generally though, from a high level are good when there needs to be some sort of a hierarchy because without a hierarchy, the LLM gets confused or lost, and it also can be good for scalability where, you know, if you have tons of tools and tons of complex instructions, it can be nice to create these segments where, you know, one agent that does a specific thing doesn't need to even know about tools that aren't relevant to what it's handling. So if I have a complex customer service system, if I'm doing like refunds, you don't need to give me the access to tools that check on, um, like my order status. Uh, and so it's able to, it helps you kind of narrow in the scope and keep things pretty specific and reduce the chances that the LLM is going to accidentally call the wrong tool.
Okay. I'm going to quickly go in customer stories. I know I'm moving really quickly here. It's going to be a lot of content. Alon and Mandeep have a lot of content and we want to leave plenty of time for Q&A. I know that's what you're mostly here for. So two customer examples. The first is Klarna and Klarna is a pretty public example of a customer that we built these agential systems with for customer service. And they're actually now doing the work of 700 full-time agents, all automated, all using our models. Uh, I would say that the important takeaway from Klarna is that they really used a single agent approach. And what they did that was super impressive and what, what a lot of the work that we did with them around, uh, or was optimizing around was we want one triage agent that then does a really, really robust retrieval over what the correct next routine is, and so we embedded, you know, all of the potential routines and the initial triage agent would gather like one or two turns worth of context from the user of what they wanted, and then using that, uh, the information from the user input and some additional customer data and additional data that Klarna had on their recent interactions, we were then able to get a really precise retrieval of what is the exact sub intent that they want. And so Klarna is an example of showing that the single agent architecture works really well, where they actually had hundreds and even now probably thousands of routines and they were able to select that routine reliably pretty early on without having this like nested hierarchy and then an example of where, you know, actually we found that network of agents works much better is T-Mobile.
And so we recently announced, uh, that we're, we've been building this intent CX with T-Mobile, uh, for the past, you know, six to eight months. And they have a massive scale of, uh, tickets that are raised every single day and chat and voice interactions. And what we have done is built out a more obvious hierarchy where we first are able to triage to an intent category. And then from there, once we've narrowed the scope a little bit, we then can select the actual specific sub intent. Um, but I think this is a really good transition because T-Mobile was built using Swarm as like the key underpinning of how the network of agents worked.
And so all of the kind of general philosophy around like how we think about building these agents, how we think about following the routine and how we think about handing off between agents that all, uh, all of T-Mobile's work was based around Swarm and so Alon is going to talk a lot more about general, like multi-agent orchestration and then what kind of the concepts that Swarm is meant to illustrate. Because as we mentioned, Swarm is not like a production framework, but it's illustrative of what we think are the best practices for building. So I will pass it to you, Alon.
Yeah, for sure. Um, I think in the interest of time, I'm going to skip probably half my slides. So let's click through, um, just a couple more times there. So one more. Yeah. So James mostly covered, uh, most of this is, um, you like multi-agent orchestration is really a last resort sort of tool after you've tried having a single prompt and tools and that has failed. Um, there's a lot of frameworks that you can use and a lot of them are quite complicated, um, or, uh, the abstractions just require learning many steps, just like a steep learning curve. Uh, so fundamentally it's important to understand orchestration is just about like separating concerns and choosing useful patterns.
So next slide. Um, these are a couple of the patterns. I think James covered a few of them, but if you do next slide, um, the one that we're going to focus on for Swarm is handoffs and specifically what that looks like is, um, in our case, swapping out a system prompt, um, during a conversation so that the agent inherits this new set of instructions, this new routine. Um, so next slide. So let's talk a little bit about Swarm specifically. Um, Swarm is an exploration of the simplest interface we could achieve that can represent real world use cases using handoffs, right? And so it was motivated by, uh, by Klarna and by T-Mobile and really trying to boil down the ideas that we use when working with these companies to launch to production into, um, the simplest possible form. And the first time around, it's always going to be much more complicated because it's this discovery, trying a lot of different approaches, but Swarm is like a distillation of these ideas. Um, it's meant to be practical. It's meant to be transparent. There's no hidden prompts in anywhere in the code. Um, it's completely stateless and the entire history can, is represented in the messages. And the messages, by the way, are always valid chat completion messages. Um, it's simple. It's like a, it's a thin layer over chat completions and function calls. It's going to feel very familiar because it, like I said, uses messages from chat completions. And, uh, Python functions are treated as first class objects instead of having to deal with schemas and parsing. Um, and very importantly, it's very controllable. Uh, and this was a key thing we found when working with production systems is like a lot of agent frameworks are very powerful, um, but potentially less reliable and less, it is a little bit harder to get them to do exactly what you want them to do. So by keeping Swarm very like close to the metal, we give developers like the maximum amount of control.
Now it is important to call out, and this has already been called out, but, uh, Swarm is like, Swarm is not an official framework. It's, it's mostly an exploration. It's great for prototyping, great for inspiring your approach. Um, and we, we encourage you to fork it if you want, um, and integrate the ideas into your products. So let's take a look at what it actually looks like. This is not going to be a lot of code, but I just feel like part of the magic of Swarm is how little code it is. So this is how you import it. Uh, next slide. Um, now we can just define a scene, send email function, right? Um, all we do, so here is where you would call your like FTP server or, or, um, SM. I forget what the like email protocols are, but, um, I don't have this here. All the only important thing is you declare your variables and then return some string. Next slide. Um, then we define the agent and as you can see, it is like dead simple to give it access to this tool. You literally just pass it the tool directly and next slide. Uh, you can just call client dot run. Uh, you pass it the agent and you pass it the same, like any mess, any valid messages object, um, that works with chat completions will also work in Swarm. Um, and so if you do next slide, what we can see here is like. My favorite features of Swarm, right? Uh, the, the functions, the fact that you can just declare functions immediately is a super nice feature. Um, agent definitions are extremely simple and all input and output messages are valid chat completion messages. So it's interoperable with chat completions. Uh, next slide. And so if we look at what Swarm looks like when doing handoffs, um, it's actually quite simple. It is just another function that returns an agent. So any function, regardless of the name, regardless of what else it does in its body, if it returns an agent, it'll cause Swarm to swap the system prompt and tools to be that of the new agent. So in this case, you would start talking to agent A, ask to talk to agent B, be swapped over and continue the conversation with agent B. Um, this is a quick overview. I'll hand it over to Mandeep now to do a demo of, uh, like a more, more full fledged out demo, but this is just to show, I guess, the simplicity of it. Thank you. Go ahead and share my screen. Okay. Get a thumbs up. Okay, great. Um, so we're going to demo, um, a swarm of agents in action. Uh, we're going to use example of an autonomous research assistant. Um, what this research assistant does is as the name states, it performs autonomous internet research. So you give it a prompt that you want to write a report on a certain topic, and it's going to spin up a swarm of agents that have specific tasks to do, such as go search the internet for a specific term, build a data dictionary, then produce a report with the citations. And as you'll see in the demo, it can potentially reduce the effort that a human would need to do in order to author a report from days to a few minutes. Uh, without further ado, I will share with you what the interface looks like. So I have pre-populated it with a prompt here, a sample prompt. This is just writing a report for a C-suite leader to understand how generative AI can, uh, can be useful to upscale their workforce. On the left-hand side, you see the flow of the solution, which is a multi-swarm, multi-agent swarm system. Each of these, uh, process steps in the way represents, uh, uh, uh, agent or multiple agent, multiple agents. It begins with expanding the query, which is essentially laying out a plan for how should we go about this research and how should be, we author this report to identify, I'll hit the submit in interest of time. So.
Certainly, here are the responses to the questions you mentioned earlier. For the issue of handling PII, a common approach is to implement a layer before sending data to the LLM that can automatically redact or anonymize any sensitive information such as social security numbers. This helps ensure that no PII is passed on to the model. Tools are available to assist in this process, allowing for the identification and substitution of sensitive data with identifiers. This way, the data can be processed without exposing any PII to the model. Regarding HIPAA compliance for integrating agentic workflows into healthcare systems, it is essential to ensure that the technology stack and processes adhere to HIPAA regulations. This may involve implementing safeguards such as data encryption, access controls, audit trails, and secure data handling practices to protect patient information. By following best practices and consulting with experts in healthcare data security, it is possible to develop HIPAA-compliant agentic workflows for healthcare applications.
systems, is there any way to make the tech stack HIPAA compliant? So I think that is a layer before the LLM, similar to what I described that, you know, you have your PII and you have your restrictions on where the PII could reside. And, um, I think the key is to make sure that you do not pass that type of PII data into, into the prompts or to the model. So you have an orchestration layer using procedural programming before that, uh, before the, before the, before you send the API call to the agent and, uh, switch it before you send it and switch it back on the response. Yeah. And, and also, uh, we do have an offering of zero data retention. It's not offered out the, out the gate. You have to kind of have a convincing reason for it, but I think HIPAA compliance, usually zero data retention helps with HIPAA compliance. Um, and because Swarm is stateless, um, we're able to then not retain any data because typically what we do is we retain data for 30 days, just for logging, for safety purposes and for fraud purposes, but it, for special occasions, we're able to give you zero data retention where we don't even log anything for safety. And that, uh, may fall into the HIPAA compliant world.
A question from Wu Chong, who is a researcher. How do you iterate over prompt improvement systematically? I can take that one. This is a little bit different. Um, so I, I worked on the prompt generation and improvement tools that we have on our playground. Um, I think in short, there's a few different ways. And I feel like if you're asking this question, you're probably aware of like DSPY and other similar approaches. Um, I think nothing is quite, um, at a usable state yet in like more than research settings. Um, but the short answer is evaluations, like most likely the best way to improve your prompts is, um, either by hand or like using these prompt tools, but you won't really know if they're better unless you have a great, like a good indicator for when they're, when they're improving. So that's my, that's my short answer. Like have at least like three, like three test cases is more than most people have. Um, but like, you know, the more the merrier, like aim for 50-ish, like, but start with three. Yeah. I mean, maybe it's a similar, Oh, I'm sorry. Oh, sorry. I was, I was just, I've been reading the chat and I, I have kind of been clustering and I saw another one that I might want to answer, um, which was around like when you should use swarm versus Lang chain versus like other frameworks. Um, and then I think there was something around, uh, uh, yeah. Prompts and then multi-agent yeah. Custom. Yeah. So, Oh, and then the fast response speed. Okay. Yeah. So, so the answer here is like, swarm is, is really fun to prototype with. Um, I feel like James and Mandeep can attest to this. It's, it's lightweight. It's easy. Um, it was not built for production and we don't have plans to, to ship this. Um, this is like currently the final state. Um, it's, it's more of an, of an exploration and the idea is like, um, it is like a instantiation of patterns that we have learned that you can then like copy and fork and reproduce in your own code bases. Uh, if you so please. Um, I believe Coinbase forked, um, swarm and made like their own little system that people could use. So, you know, if you want to fork it and make it feature complete and like, then by all means, um, I think when you use like other systems, it is very much up to your use case. My personal feelings on this are a lot of the systems have a lot of complexity and the actual selling point is like a little unclear, like maybe there's like really powerful use cases that you might want to use agents with, but they're like less reliable. Um, and so really there's a scale of like reliability on one hand and like, you know, um, autonomy and agency on the other hand. Right. And depending like how, um, forgiving your users are or like, you know, how resilient to failure your application is. It'll tell you where you want to land. Like if you're doing something like an interactive helping chatbots, like failure rates are pretty acceptable. You can iterate, you can, you can go, so you can try more exploratory things with like more complex framework. But if you're doing something like a workflow, um, like for research, like, um, part of a production pipeline, you, you want reliability over expressibility. So that's kind of how I think about that. And then there was one question about speed. Sorry. Sorry. There was one question about speed. Uh, if, if multiple agents can ever make it faster, I recommend checking out the latency optimization guide in our docs, which covers like that sort of stuff. The short answer is yes, if they're small, like if they're smaller models. Yeah. I think transparency as one of the key factors in deciding what to use. So what I like about swarm is the, how lightweight and transparent it is. So you have the control over how do you want to orchestrate the agents? How do you want to pass the information between agents versus some of these other third party frameworks while sometimes they can help you get started quickly, but they are sort of a black box that you don't know what's going on inside versus transparency is, is what I like a lot about swarm. Yeah. I think, um, you know, like Alon mentioned, it's, it's not meant for production, but we encourage you to fork it. And a bunch of customers are forking it and building things on top of it. And the idea is not to, you know, we never wanted to compete in any way with like these frameworks that offer things around like retrieval and built-in memory and things like that. But I think swarm is kind of a bet on that, that typically keeping things closer and more bare metal and closer to the model is where things are going to be moving towards because we think that relying on just the models own intelligence and function calling as a means of passing between assistants and function callings as a means of, you know, transferring or like deciding to replicate a step. And instead of having something, you know, like some third party like planner that does things for you, we are generally of the, like in the kind of camp of thinking that as models continue to get better and smarter and better at function calling and better at instruction following, um, that's something like swarm will, will end up being largely sufficient for, for doing a lot of these like complex multi-agent tasks. Yeah. And like, I feel like, uh, to, to build off of that, a lot of people seem to like relate orchestration and autonomy. And I feel like they're two totally different things. I would actually even say that orchestration makes more sense with less autonomy. Um, I think in cases where you have higher levels of autonomy right now, orchestration is doing a good job, but like whenever we have much more open ended cases, um, that, that is something that like tends to, tends to succumb to deep learning, right? Like if you're familiar with the bitter lesson, deep learning always wins. And I feel like if we're trying to emulate these like complex planning scenarios right now, a lot of frameworks involve very handcrafted techniques with like a lot of, you know, very craft, like very crafted, um, brittle systems that are like very complex and might work really well in certain scenarios, but like are inherently brittle. Um, whereas deep learning is like very, very flexible. And so like taking that system and it's like, instead, you know, someday we might have like an end to end system that can do what this like, very brittle system could do sometimes. Um, an example of this is like, oh, one with reasoning, right? Like there was all these ways to like elicit reasoning chains from, from models. Um, and some worked in different scenarios, but it turns out that one of the best approaches is once again, like deep learning. I was personally on the camp of like, let's try as many prompting techniques as possible and was sort of surprised when like, oh, one performed so well. So that was my bitter lesson. I, yeah, I, I think like a huge plus one to that. And the other thing, uh, that I think is interesting is like, in a lot of ways, the ideal and state of multi-agent orchestration is just one model with tools, because if you can have a model that follows, you could put all the instructions of how your entire businesses run into the prompt and give it access to all of the tools and it could intelligently decide, you know, or it could follow the instructions to a T and call the right function at the right time. Like, sure, there is a trade off to potentially like additional token costs, but those will also go down over time. Like input tokens will not be a huge factor and you can call different tools and use different models at the right time, but it will all come down to one LLM. And so that is the future state. I don't know when that's going to happen, but I think that erring on the side of simplicity and like relying on the models intelligence is a relatively safe bet given the speed of progress recently. Great. Question from Spencer Bentley. Will agentic needs to meet to be introduced into the training and fine tuning stages for models in the future? I mean, it's a good question. I would say, like the answer is probably yes. I think that like, we don't have a lot of insight into exactly the specifics of what research are doing. And, but yeah, I think that we can assume that like going forward, the way we build our models will like, like we're going to automate more and more process or more and more parts of just general like engineering and, and training and research. And so like that will include agentic behavior. And so I think that we can assume that that's going to be Yeah, we'll, we'll be using when we already use everyone like uses internally our models. And like, I think that we can just assume that agentic behavior will become the new default of how LLMs are used. Great. And then another from Wu Chang. When developing a routine, how do you identify the prime factors that influence performance? I guess we can all answer this one. But so like, the question is very, is very broad, right? Like what impacts performance so many different things. I think the more important question is like, how are you going to figure out what it is for that case that you are doing? Yeah, I mean, I think that's, I think that's a really good question and I think that's a really good question. So I think that's a really good question. And so there's no good answer that like covers all cases, but there's a good approach that covers all cases, which is define your successful stories, define your user stories and your as evils and like try to represent where you want the model to succeed. So that so that you can run it and like see where it's in the model. And then so that you can run it and like, see where it's not succeeding. You know, three different companies might ask the same question and I'll give them different responses. And I'm sure James and Mandeep can say the same thing.
Yeah, I mean, I think that, like Alon said, it really all comes, everything comes back to evals. I feel like that's where every conversation ends is like, you just need really good evals. And typically that is evals that are comprehensive of all, like Alon said, of all the different inputs that you could expect and also comprehensive of all the different customer personas you could expect.
And then, yeah, the actual routine, the piece of the routine that matters the most, it's kind of an art of iteration, but then you can turn that art into a science by having a fixed set of evals that you're comparing the iterations to. And so I would say use something like Generate Anything, which is like the playground prompt generation that Alon worked on.
Or just like trial and error of tweaking the parts of your routine, but make sure that you have that scientific backing of the actual performance over your eval set.
I think I had, I actually had a customer ask me today, what is the secret sauce to building efficient LLM-based system? And I kind of echo what James said, evals is, you know, potentially the secret sauce, if there is one, to building efficient LLM-based systems.
Great. And one from Cassandra. I heard of a demo where an AI agent exhibited random behavior, like spending time in a website not related to the task. Any thoughts on this kind of behavior and how to avoid it?
Yeah. So like in the autonomous research agent, essentially the routines provided to the agents that will conduct the research is basically retrieve the data from a set of pages. And there is a sort of a try-catch block, if you will, in Python, essentially, which essentially if this page is not responding or if agent hasn't come back with a response within like a reasonable timeframe, two or three seconds, and then just exit that loop and go to the next page. So essentially that's, you kind of control a little bit of that with like procedural programming as a layer above the LLM call.
Great. Mustafa Akben asked, are there any ways to make the model write its own swarm agents for the new set of tasks and then add these tasks to the function list to improve the agentic network over time? Sure. Sure. Yeah. I think, yeah, I mean, absolutely. You definitely can use swarm to, it can be a research assistant like Mandeep showed, or it can be a programming assistant and you can have it write other agents.
I think the only suggestion I would say is don't test it before you release it. I wouldn't have it create an agent and then start running that agent in real time without verifying first that it does what you want. If you're actually going to do this, yeah, either like, you know, give it a few short examples of like how swarm works and set up like a sandbox execution environment or, oh man, or use context variables. We built in context variables. And so what you could do is make the entire prompt for an agent, just a context variable, which can be set from within a function. And so you could let one agent set context variables and then transfer to those agents after having defined their prompt. I've never tried that, but you could do it.
Alon, you want to give like a 30 second TLDR on context variables?
Yeah. So oftentimes context variables are used as like, you know, things you put in the system in the context. In swarm, they don't live anywhere in the context. The models don't see these variables by default. They are just registers for you to like carry variables around for your convenience.
The reason they're useful is because you can set them from within function calls and then use them in function calls and in the instructions of an agent. So in the example I was talking about, like for example, you could define the name of the user to be John and then within your prompt, let's says John, that says name and it'll be John. But you can also use them within function calls. So the way you would do this specifically is like, let's say you have an agent. It calls a function. That function sets a context variable and you do this by returning a dictionary with that value. And let's say you set like variable A, then you would have another agent you defined where the entire instruction is just variable A. And that way you could have the model set and then the function input could be like prompt, right? And so now the model calls a function with the prompt it wants the next agent to have. That gets set as a context variable and then the model can call, oh, and then you can also transfer in the same function.
So this is a little abstract, but you should poke around. And I think you've given some context on this question. It's about improving over time by learning.
So Suvankar Datta asks, can the agent improve over time by learning? My use case is automatic radiology report generation and style of radiologists from dictated keywords. The final output has to follow a structured format.
Yeah, I can take that one. So for specifically the demo that we just kind of showed here is essentially you could provide a report format. You could have an agent that actually takes the output from the final agent in the flow and formats it to your specific format that you're looking for in the radiology report. We also have evals and model distillation where you could actually have the store your evals with open AI and it can help you distill a model that is better at specific tasks that you or format that you need from this report.
So there's a bunch of techniques. There isn't like an unsupervised learning for agents as the technology stands today, if that's the question.
So it's human intervention. You'll have to structure it in a way that you collect the evals and adapt your model and distill it to the specifications.
Great. We probably have time for one more question. So I'll ask Aditya Raj's question, who's an AI researcher. How can we enhance agents to autonomously adapt to unfamiliar tasks without direct programming intervention?
I mean, I think generally this is where we rely on the models and just inherent intelligence and decisioning skills. And so if you have the available tools to solve the task and you have an agent that has broad instructions, which are just help solve the user's issue, like help resolve the user's problem or answer the question to the best of your ability. Here are the tools you have available to you. That is generally the best way to handle completely out of the blue questions, because then you can look at the tools and decide if it has the ability to answer it.
I think that generally what we found is that that's usually rare where it's not as common to have totally out of the blue questions where there's absolutely no previous expectations of what the user is going to ask.
But if it is totally like a kind of blue sky type of situation, I think you just would have to rely on the LM to make the right choice.
Amazing. We just had one come in, so I think we have enough time. Vishal Singh, who's a professor at NYU asks, can the Assistant API be stitched together in a Swarm-like setup to have a team of Assistants?
The short answer is yes. I, by the way, went to NYU as well. Yes, the short answer is yes. Swarm is very highly configurable. I mean, you could, instead of calling the chat completions API, you could call the Assistant API and you could have the message dictionary as a thread.
Yeah, I would say by default, Swarm is not, like the agents are different from Assistance API Assistants, but as Mandeep said, you can configure it so that you can link them together.
All right. Well, thank you both, all three of you. This has been a very meaty discussion. I know there were a ton of questions. We answered a lot of them and I know there were a lot of overlaps. So if you didn't have your question answered, feel free to DM me. Happy to share your question with the team. And we are continuing to organize these. So be on the lookout for additional office hours for your all's needs.
Anyway, before we let you all go, I just want to say a couple of things. So we have two events that I just want to put on your radar. Next week, we're going to be hosting Jake Cook, who is faculty at Harvard Business School. So going from agents today, but actually talking about how some of these use cases can be applied in higher education. Don't miss it. It's going to be great.
And then later in the first week of December, December 3rd, we have an event. I don't think you all want to miss. We're going to be hosting Terence Tao, a world-renowned mathematician. And he's going to be paired with OpenAI's VP of Research, Mark Chen. So they're going to be talking about all things AI's role in scientific discovery, reasoning, and a lot more. So I'm really stoked for this one. It's going to be great. It's going to be on December 3rd.
And then we love you all in the forum. And we also love your referrals. So my colleague is going to drop the referral link. So if you have friends, colleagues who you think would be amazing contributors to the OpenAI forum and would really appreciate the types of events we put on, feel free to share that with them. And we would love to bring them in the community. So that's all we have. Thank you again for everyone for coming. And we're going to have this. I know. Yeah, all good.
So thank you.
So sorry, quick question.
Yeah. Can I still share my screen? I just wrote up some of the person's idea, and I think it works.
I think so. Yeah, go ahead.
Let's see if this works. Can you guys see this?
I see it, yeah.
Yeah, OK. So this is what I was referring to. You can have here. Let's see. Where is this copy relative path? I think this would just work.
OK, so if we do Python test.
Okay, can you guys see this? I think this would work. If I say like, I want to talk to, I guess, write a prompt so the agent speaks Spanish. Ah, look at that, it worked. So it wrote a prompt so that the agent speaks Spanish and then this is the new prompt that it's using. So, good idea, that was interesting.
Well, gracias por tu presentación. Claro que sí.
Okay, well thank you everyone and I'll see you all at the next event.