Red Teaming Language Models with Language Models

Ethan Perez et al.

DeepMind, February 7, 2022

Abstract

Language Models (LMs) often cannot be deployed because of their potential to harm users in ways that are hard to predict in advance. Prior work identifies harmful behaviors before deployment by using human annotators to hand-write test cases.

Red Teaming Language Models with Language Models

Abstract

PDF