works
Daniel M. Ziegler, Lawrence Chan, and Nate Thomas High-stakes alignment via adversarial training [Redwood Research Report] online (Update: We think the tone of this post was overly positive considering our somewhat weak results. You can read our latest post with more takeaways and followup results here.) …

High-stakes alignment via adversarial training [Redwood Research Report]

Daniel M. Ziegler, Lawrence Chan, and Nate Thomas

AI Alignment Forum, May 4, 2022

Abstract

(Update: We think the tone of this post was overly positive considering our somewhat weak results. You can read our latest post with more takeaways and followup results here.) …

PDF

First page of PDF