<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
    <channel>
      <title>AI Safety Compendium</title>
      <link>https://aiforhumanity.eu</link>
      <description>Last 10 notes on AI Safety Compendium</description>
      <generator>Quartz -- quartz.jzhao.xyz</generator>
      <item>
    <title>Connect AI Tools</title>
    <link>https://aiforhumanity.eu/connect</link>
    <guid>https://aiforhumanity.eu/connect</guid>
    <description><![CDATA[ Connect your AI tool Proof-of-concept stage. ]]></description>
    <pubDate>Fri, 08 May 2026 00:00:00 GMT</pubDate>
  </item><item>
    <title>About</title>
    <link>https://aiforhumanity.eu/about</link>
    <guid>https://aiforhumanity.eu/about</guid>
    <description><![CDATA[ About the AI Safety Compendium Status: proof of concept. The Compendium is an early-stage prototype maintained by one person. ]]></description>
    <pubDate>Tue, 05 May 2026 00:00:00 GMT</pubDate>
  </item><item>
    <title>Capability Evals</title>
    <link>https://aiforhumanity.eu/agendas/capability-evals</link>
    <guid>https://aiforhumanity.eu/agendas/capability-evals</guid>
    <description><![CDATA[ Capability Evals What the agenda is The capability evals agenda builds tools that measure whether AI systems have crossed risk-relevant capability thresholds — autonomous replication, deception, weapons-related uplift, situational awareness, long-horizon planning — before deployment. ]]></description>
    <pubDate>Tue, 05 May 2026 00:00:00 GMT</pubDate>
  </item><item>
    <title>Chain of Thought Monitoring</title>
    <link>https://aiforhumanity.eu/agendas/chain-of-thought-monitoring</link>
    <guid>https://aiforhumanity.eu/agendas/chain-of-thought-monitoring</guid>
    <description><![CDATA[ Chain of Thought Monitoring What the agenda is The chain of thought (CoT) monitoring agenda supervises an AI’s natural-language reasoning trace to detect misalignment, scheming, or reward hacking — before the model takes a harmful action. ]]></description>
    <pubDate>Tue, 05 May 2026 00:00:00 GMT</pubDate>
  </item><item>
    <title>Character Training and Persona Steering</title>
    <link>https://aiforhumanity.eu/agendas/character-training-and-persona-steering</link>
    <guid>https://aiforhumanity.eu/agendas/character-training-and-persona-steering</guid>
    <description><![CDATA[ Character Training and Persona Steering What the agenda is The character training and persona steering agenda aims to map, shape, and control the personae or characters that language models embody, so that deployed models exhibit desirable traits (honesty, helpfulness, calibration) rather than undes... ]]></description>
    <pubDate>Tue, 05 May 2026 00:00:00 GMT</pubDate>
  </item><item>
    <title>Control</title>
    <link>https://aiforhumanity.eu/agendas/control</link>
    <guid>https://aiforhumanity.eu/agendas/control</guid>
    <description><![CDATA[ Control What the agenda is The AI control agenda asks: how do we safely deploy AI systems even if they are actively misaligned and trying to subvert our safety measures? Rather than betting on aligning the model’s goals, control assumes the worst case (the model is scheming) and develops deployment-... ]]></description>
    <pubDate>Tue, 05 May 2026 00:00:00 GMT</pubDate>
  </item><item>
    <title>AI Safety via Debate</title>
    <link>https://aiforhumanity.eu/agendas/debate</link>
    <guid>https://aiforhumanity.eu/agendas/debate</guid>
    <description><![CDATA[ AI Safety via Debate What the agenda is The debate agenda exploits a structural asymmetry between truth and falsehood: in the limit, it should be easier to compellingly argue for true claims than for false claims. ]]></description>
    <pubDate>Tue, 05 May 2026 00:00:00 GMT</pubDate>
  </item><item>
    <title>Iterative Alignment at Post-Train-Time</title>
    <link>https://aiforhumanity.eu/agendas/iterative-alignment-at-post-train-time</link>
    <guid>https://aiforhumanity.eu/agendas/iterative-alignment-at-post-train-time</guid>
    <description><![CDATA[ Iterative Alignment at Post-Train-Time What the agenda is The iterative alignment at post-train-time agenda is the dominant practical alignment program at frontier labs: modify a pretrained model’s weights via post-training procedures — RLHF, DPO, Constitutional AI / RLAIF, instruction-tuning, prefe... ]]></description>
    <pubDate>Tue, 05 May 2026 00:00:00 GMT</pubDate>
  </item><item>
    <title>Reverse Engineering (Mechanistic Interpretability)</title>
    <link>https://aiforhumanity.eu/agendas/reverse-engineering</link>
    <guid>https://aiforhumanity.eu/agendas/reverse-engineering</guid>
    <description><![CDATA[ Reverse Engineering (Mechanistic Interpretability) What the agenda is The reverse engineering agenda — broadly synonymous with mechanistic interpretability — aims to decompose trained neural networks into their functional, interacting components (circuits, features, attribution graphs), formally des... ]]></description>
    <pubDate>Tue, 05 May 2026 00:00:00 GMT</pubDate>
  </item><item>
    <title>Weak-to-Strong Generalization</title>
    <link>https://aiforhumanity.eu/agendas/weak-to-strong-generalization</link>
    <guid>https://aiforhumanity.eu/agendas/weak-to-strong-generalization</guid>
    <description><![CDATA[ Weak-to-Strong Generalization What the agenda is The weak-to-strong generalization (W2S) agenda asks: can a weaker model effectively supervise a stronger model, and if so, by how much can the strong model exceed the weak supervisor’s accuracy ceiling? The premise: when AI systems become more capable... ]]></description>
    <pubDate>Tue, 05 May 2026 00:00:00 GMT</pubDate>
  </item>
    </channel>
  </rss>