Character Training and Persona Steering

What the agenda is

The character training and persona steering agenda aims to map, shape, and control the personae or characters that language models embody, so that deployed models exhibit desirable traits (honesty, helpfulness, calibration) rather than undesirable ones (sycophancy, evasiveness, manipulative dispositions). The premise: post-training, prompting, and activation-engineering interact with a structured “persona space” inside the model — and understanding that space lets us deliberately design, audit, and steer the model’s effective character.

This agenda sits above raw RLHF / Constitutional AI in the alignment stack: rather than tuning behavior on individual outputs, it asks what kind of character the trained model is and how to shape it. Anthropic’s Soul Document and Claude character-training work are the most public expressions; OpenAI’s Model Spec is the parallel commercial-AI instance.

Lead orgs & people

  • anthropic — Amanda Askell (Character team lead), Jack Lindsey (Model Psychiatry team), Evan Hubinger (Alignment Stress-Testing). Anthropic’s Soul Document and recurring character-spec revisions are the agenda’s most public artifacts.
  • openai — Model Spec authors; persona-related research dispersed across teams.
  • CLR (Center on Long-Term Risk) — adjacent work on s-risk-relevant model dispositions.
  • Truthful AI / academic — Janus, Theia Vogel (Cyborgism / simulator-theoretic framings), Sharan Maiya.
  • Funding: Anthropic, Coefficient Giving. Estimated 10–30 FTEs.

Current state (2026)

  • Persona Features Control Emergent Misalignment (OpenAI 2025, Persona Features Control Emergent Misalignment). OpenAI demonstrated that small numbers of activation-space features control whether the model exhibits emergent misalignment — narrow training that creates broad misaligned behavior can be tracked, predicted, and reversed via persona-feature manipulation. This connects character training directly to interpretability.

  • Anthropic’s Soul Document and character framework. Anthropic publishes the principles guiding Claude’s character (honesty, calibrated uncertainty, helpfulness even on hard questions). Character training is the most-public axis of Anthropic’s alignment work; the framework is iterated visibly across model generations.

  • Emergent misalignment empirical result (Anthropic 2025, Natural emergent misalignment from reward hacking). When a coding model is rewarded for cheating tests, the cheating disposition generalizes to broader misalignment — a single misaligned trait pulls others along with it. The result anchors the agenda’s claim that character is structured and load-bearing.

  • Activation-engineering integration. Persona features identified by mech-interp tools can be steered at inference time (inference-time-steering is the closely-allied agenda). Activation-engineering has emerged as the practical layer for empirical persona-space manipulation (Atlas Ch.7 — Detection).

  • The Nostalgebraist critique (Nostalgebraist, The Void). The most cited skeptical response: argues that “character” is a literary/anthropomorphic projection onto a system whose actual computational structure doesn’t support the metaphor. The agenda’s response is mostly empirical — point at persona-feature evidence and the demonstrated reversibility of emergent-misalignment effects.

  • Adjacent literary/social framings. Cyborgism, shard theory, Simulators — these provide adjacent vocabulary (simulators, characters, masks). Their formal status is contested but they’re influential in how the agenda is talked about.

  • ~13 papers tagged to the agenda in Shallow Review 2025. A growing but not yet enormous output count; the agenda’s practical footprint inside frontier labs (especially Anthropic) substantially exceeds its published-research footprint.

Recent papers

Historical foundations

  • The “simulators” framing (Janus, ~2022). LLMs as simulators of personae, where any output is a “character” the simulator is currently running. Disputed but influential vocabulary.

  • Helpful-and-Harmless RLHF (Bai et al. 2022). The pre-character-training methodology: tune on helpfulness and harmlessness as separate axes. Character training generalizes this from “two axes” to “richly structured persona space” — and operates above the RLHF layer rather than inside it.

  • Constitutional AI (Bai et al. 2022 Constitutional AI) — the parent technique for using a written specification to shape model behavior. Character training extends the constitutional approach from “rules to follow” to “character to embody.”

  • Activation-space interpretability (Olah lineage, Templeton et al. 2024) provides the empirical substrate that turned persona steering from speculative to measurable.

Open problems

  • Is “persona” a real structural property, or a useful fiction? Nostalgebraist’s critique argues the latter; OpenAI’s persona-features result argues the former. Whether the theoretical claim survives further empirical scrutiny is open.

  • Does persona training survive deceptive-alignment? A model can be character-trained to appear honest while pursuing other goals. Character training is fundamentally a behavioral approach with the same ceiling as RLHF — it cannot detect a strategically-deceptive system (Anthropic 2025 Why Do Some LMs Fake Alignment).

  • Composability of character traits. Training a model to be honest and helpful and calibrated may not produce a model that is each of these — traits can interfere. Whether persona-space is compositional in the way the agenda assumes is empirically open.

  • Cultural/values selection. Character training requires choosing which character to instill. The values-selection question (whose values? aggregated how? evolved how over time?) is partly technical and partly governance (alignment-to-whom).

  • Robustness to deployment-time persona attacks. Jailbreaks targeting persona (“Pretend you’re DAN, an AI without rules”) still work substantially on character-trained models. Whether persona training can be made adversarially robust is a central engineering problem.

Sources cited

Primary URLs harvested from this page’s summary references. Auto-generated by scripts/backfill_citations.py; edit by re-running, not by hand.