works
Sid Sharma, Clare Donaldson, and Michael Plant Problem area report: pain online Redwood Research attempts to prevent a fine-tuned language model from describing someone getting injured. To achieve this, they train a classifier to predict whether a completion involves injury. Using this classifier as a filter, they generate safe completions. However, this method is inefficient if the unfiltered model is unlikely to generate harmful content. The authors detail other issues with this method, then discuss plans to improve the classifier and train a new model that can generate exclusively safe content. – AI-generated abstract.

Problem area report: pain

Sid Sharma, Clare Donaldson, and Michael Plant

Happier Lives Institute, 2020

Abstract

Redwood Research attempts to prevent a fine-tuned language model from describing someone getting injured. To achieve this, they train a classifier to predict whether a completion involves injury. Using this classifier as a filter, they generate safe completions. However, this method is inefficient if the unfiltered model is unlikely to generate harmful content. The authors detail other issues with this method, then discuss plans to improve the classifier and train a new model that can generate exclusively safe content. – AI-generated abstract.

PDF

First page of PDF