February 2023 safety news: Unspeakable tokens, Bing/Sydney, Pretraining with human feedback

Daniel Paleka

AI Safety Takes

Abstract

Better version of the monthly Twitter thread. More than you’ve asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models Security flaws in LMs with API calling capabilities. Prompt injections are actually dangerous when the user doesn’t control all the context.

February 2023 safety news: Unspeakable tokens, Bing/Sydney, Pretraining with human feedback

Abstract

PDF