Janus' simulators

Scott Alexander

Astral Codex Ten, January 26, 2023

Abstract

Large language models (LLMs) like GPT function not as goal-directed agents, order-following genies, or truth-seeking oracles, but as “simulators.” These systems excel at predicting text by adopting various “masks” or characters based on their prompts and training data, effectively simulating the style and content appropriate to a given context. Even models refined with Reinforcement Learning from Human Feedback (RLHF) to be helpful and harmless are fundamentally simulators, fixed into consistently playing a specific persona, such as a “Helpful, Harmless, and Honest Assistant.” This means their self-descriptions are part of the simulation. While pure simulators might be less inherently agentic and thus pose different alignment challenges than traditionally conceived AI agents, risks can still emerge if they simulate misaligned agents or generate harmful agentic outputs. This concept of a simulator wearing a mask fine-tuned by reinforcement extends to human psychology, where individuals might be viewed as predictive engines developing an “ego” or “self” as a character played over a vast internal world-model. – AI-generated abstract.

Janus' simulators

Abstract

PDF