AI Safety Atlas Ch.1 — Appendix: Discussion on LLMs

Source: Appendix: Discussion on LLMs | ai-safety-atlas.com/chapters/v1/capabilities/appendix-discussion-on-llms/

This appendix systematically addresses common critiques of LLMs, asking whether perceived limitations are fundamental or likely overcome with tools and scale. It is the textbook’s answer to skeptics like luc-steels and other “LLMs are stochastic parrots” voices.

Empirical Insufficiency? — No

Can LLMs be creative?

Yes, in multiple senses:

  • Autonomous discovery — DeepMind’s LLM “discovered new solutions for the cap set problem,” a long-standing math problem favored by Terence Tao.
  • Independent strategy rediscovery — AlphaGo found human Go strategies via self-play without any human data input.
  • Creative optimization — specification gaming demonstrates AIs find unintended (sometimes brilliant) solutions to problems.

Aren’t LLMs too sample-inefficient?

Improving rapidly:

  • EfficientZero (RL agent) surpasses median human performance on 26 Atari games after just 2 hours of real-time experience per game.
  • Scaling laws indicate larger models are more data-efficient — the “few-shot learners” line of evidence.

Are LLMs robust to distributional shifts?

Improving but not solved:

  • Robustness correlates with capabilities — significant improvement GPT-2 → GPT-4. Segment Anything is far more robust than predecessors.
  • Robustness is a continuum, not binary — AI often surpasses humans, but no system is immune to adversarial attack. KataGo plays superhuman Go and is still vulnerable to specific attacks.

Shallow Understanding? — No

Stochastic Parrots — do LLMs only memorize?

Two ways to represent information: point-by-point lookup table vs. compressed higher-level features (a “world model”). Per Superposition, Memorization, and Double Descent: models start with pure memorization, then with more data they compress and gain generalization. Examples: LLMs encode color words as a correct color circle when visualized.

Will LLMs inevitably hallucinate?

No — multiple mitigations work:

  • Bias is inherited from training data, not fundamental.
  • Scale matters — GPT-4 is 40% more accurate and factually consistent than GPT-3.5.
  • Fine-tuning for factuality — Direct Preference Optimization (DPO) cut hallucination 58% in a 7B Llama 2.
  • Retrieval-Augmented Generation — anchors output in real-world facts.
  • Prompting techniques — consistency checks, Reflexion (self-question, improve), verification chains, expressing degrees of confidence.
  • Process-based training — detailed reasoning steps without skipping.
  • Metacognition trainingLanguage Models (Mostly) Know What They Know shows AIs can be Bayesian-calibrated about their own knowledge.

Structural Inadequacy? — Mostly No

Are LLMs missing System 2?

System 2 = slow, deliberative, logical thinking. Concern: theoretical arguments suggest transformers are provably incapable of internally dividing large integers. In practice, GPT-4 performs the calculation step-by-step via chain-of-thought or by calling a code interpreter.

Key reframe: System 2 may essentially be an assembly of multiple System 1 processes appearing slower because of more steps and memory interactions. LLMs operate similarly — each step is constant-time. Breaking tasks into smaller steps via prompting significantly improves performance — and emerging metacognition (Reflexion) bridges further.

Are LLMs missing an internal world model?

Growing evidence: world models can be implicit.

  • Time and geographic coordinates — Llama-2 represents real-world coordinates and historical timelines of cities.
  • Board representation — Othello-GPT trained only to predict legal moves develops emergent linear board-state representations.
  • Many other examples — see Eight Things to Know about LLMs.

Continual learning and long-term memory?

Real challenge — the catastrophic forgetting problem: training on a new task tends to wipe a previous one. Workarounds:

  • Multi-task scale — GATO trained simultaneously on many games + text + robotics. “Blessing of scale” — what’s impossible at small scale becomes possible at large.
  • Scaffolding — LangChain (retrieval), Voyager (Minecraft, AutoGPT-style continuous learning, tool crafting). Acknowledged as inelegant; alternatives might use the model’s own weights as dynamic memory.

Planning

Active area of weakness with fast progress. Voyager demonstrates GPT-4 planning in natural language for open-ended Minecraft tasks via task decomposition.

Differences with the Brain

Convergence between LLMs and the linguistic cortex:

  • Behavioral similarity — close to human linguistic abilities; lags in long-term memory, coherence, general reasoning.
  • Internal representation convergenceadvanced LLMs predict 100% of explainable neural variance in some evaluations. LLM and linguistic-cortex representations align with scale.
  • Scale parallel — the principal architectural difference between human and other primate brains seems to be number of neurons, not fundamental design.

Why Continue Scaling LLMs

Loss → Qualitative Improvement

Scaling laws translate to qualitative jumps. The Janelle ate ice cream exercise: a 0.05-bit difference in loss separates a model that randomly guesses pronouns from one that tracks gender consistently across 20 tokens. Therefore a model within 0.05 bits of minimal theoretical loss should understand far more nuanced concepts. “Text completion is probably an AI-complete test.”

Size vs. Biology

Current LLMs have only as many parameters as small mammals have synapses. GPT-3 ~ hedgehog. GPT-4 (if 512B params) ~ chinchilla. Human neocortex: 140 trillion synapses — 200× more than chinchilla. There is room.

Cost vs. Stakes

GPT-4 cost ~25B (500× more, before inflation). “Achieving human-level intelligence may be more economically important than achieving the nuclear bomb.”

Connection to Wiki

This appendix is the direct counterargument to skeptical positions held by figures already in the wiki:

  • luc-steels — VUB’s “godfather of AI” who dismisses generative AI x-risk on technical grounds the appendix systematically disputes.
  • 2501.04064v1 (Swoboda et al.) — addresses the “Distraction” and “Human Frailty” arguments against AI x-risk, paralleling several lines of argument here.
  • ai-risk-arguments — critiques of x-risk arguments often rest on “LLMs are fundamentally limited” claims that this appendix disputes.

The appendix also connects to rag (one of several factuality-improvement techniques) and offers concrete techniques the wiki should reference under robustness.