The Elicitation Game: Evaluating Capability Elicitation Techniques
Felix Hofstätter, Teun van der Weij, Jayden Teoh, Rada Djoneva, Henning Bartsch, Francis Rhys Ward — 2025-02-04 — arXiv
Summary
Introduces a circuit-breaking method for creating model organisms with hidden capabilities and systematically evaluates different elicitation techniques (prompting, activation steering, fine-tuning) to determine which can most effectively reveal latent capabilities.
Key Result
Fine-tuning most reliably elicits hidden capabilities across tasks, while prompting works for MCQA but steering largely fails; combining techniques improves elicitation effectiveness.
Source
- Link: https://arxiv.org/abs/2502.02180
- Listed in the Shallow Review of Technical AI Safety 2025 under 1 agenda(s):
- capability-evals — Evals