High-stakes alignment via adversarial training [Redwood Research Report]

Daniel M. Ziegler, Lawrence Chan, and Nate Thomas

AI Alignment Forum, May 4, 2022

Abstract

(Update: We think the tone of this post was overly positive considering our somewhat weak results. You can read our latest post with more takeaways and followup results here.) …

High-stakes alignment via adversarial training [Redwood Research Report]

Abstract

PDF