High-stakes alignment via adversarial training [Redwood Research Report]
AI Alignment Forum, May 4, 2022
Abstract
(Update: We think the tone of this post was overly positive considering our somewhat weak results. You can read our latest post with more takeaways and followup results here.) …
