Problem area report: pain
Happier Lives Institute, 2020
Abstract
Redwood Research attempts to prevent a fine-tuned language model from describing someone getting injured. To achieve this, they train a classifier to predict whether a completion involves injury. Using this classifier as a filter, they generate safe completions. However, this method is inefficient if the unfiltered model is unlikely to generate harmful content. The authors detail other issues with this method, then discuss plans to improve the classifier and train a new model that can generate exclusively safe content. – AI-generated abstract.
