Eliciting Latent Knowledge (ELK) - Distillation/summary

Marius Hobbhahn

AI Alignment Forum, June 8, 2022

Abstract

Eliciting Latent Knowledge (ELK) refers to methods that distill or summarize representations of language models or other systems. ELK has been proposed as a solution to problems in model alignment. A variety of techniques are proposed and potential weaknesses identified. The report’s methodology uses a “builder-breaker” game where one participant proposes an algorithm to extract latent knowledge, and the other proposes scenarios where the algorithm could fail. Promising strategies include: using human assistants to generate training data, using other AI systems to enable humans to understand the system better, and penalizing reporters for responses that depend too much on variables in the predictor. However, the report also finds potential weaknesses in these strategies, suggesting that more research is needed to find methods robust to manipulation or failure. – AI-generated abstract.

Eliciting Latent Knowledge (ELK) - Distillation/summary

Abstract

PDF