Artificial Intelligence: Approaches To Safety
Philosophy Compass, vol. 20, no. 5, 2025, pp. e70039
Abstract
AI safety is an interdisciplinary field focused on mitigating the harms caused by AI systems. We review a range of research directions in AI safety, focusing on those to which philosophers have made or are in a position to make the most significant contributions. These include ethical AI, which seeks to instill human goals, values, and ethical principles into artificial systems, scalable oversight, which seeks to develop methods for supervising the activity of artificial systems even when they become significantly more capable than their human designers, interpretability, which seeks to render comprehensible the workings of complex machine learning models, and corrigibility, which seeks to discover ways to ensure that powerful AI systems will not resist being shut down or modified by humans.
