MEETING

Virtual Event: Weak to Strong Generalization Presented by the OpenAI Superalignment Team

About the Talk:

Collin Burns and Pavel Izmailov present their research, Weak-to-Strong Generalization

Here's the entire paper.

Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior—for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models. We study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model? We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks. We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization. However, we are still far from recovering the full capabilities of strong models with naive finetuning alone, suggesting that techniques like RLHF may scale poorly to superhuman models without further work. We find that simple methods can often significantly improve weak-to-strong generalization: for example, when finetuning GPT-4 with a GPT-2-level supervisor and an auxiliary confidence loss, we can recover close to GPT-3.5-level performance on NLP tasks. Our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.

About the Host:

Logan is OpenAI’s first Developer Advocate and DevRel hire. He is building the developer relations function from the ground up and supporting developers who are building with ChatGPT, DALL-E, GPT-4, the OpenAI API, and more!

About the Speakers:

Collin is a researcher at OpenAI working on aligning superhuman models. Before joining OpenAI, he was a PhD student at Berkeley. His research interests include (1) studying extreme forms of "weak-to-strong" generalization, (2) developing unsupervised methods for making language models honest, and (3) understanding when and how high-level abstractions are encoded in representations.

Pavel is a Research Scientist at OpenAI working on reasoning. Previously, he worked on the Superalignment team, focusing on weak-to-strong generalization and interpretability. In 2023, they defended their PhD in Computer Science at NYU, under the supervision of Andrew Gordon Wilson. Pavel’s primary interest lies in understanding and improving deep neural networks. Their research interests include out-of-distribution generalization, AI alignment, reasoning, probabilistic deep learning, representation learning, and other related topics. Recently, their work on Bayesian model selection was recognized with an outstanding paper award at ICML 2022.