7 min read

The Alignment Problem and Shoggoth Meme

First in the series about understanding AI misalignment. Meme are getting real

AI SafetyAlignmentMachine Learning
Table of Contents

Why Do I Care About Alignment?

Let's start with a simple question: how do you teach a computer system to want what humans want?

Sounds straightforward, right? Just tell it what to do. The problem is that AI systems, particularly large language models and increasingly autonomous agents, don't "understand" instructions the way we do. They optimize for objectives we define mathematically, and there's a massive gap between what we mean and what we specify.

AI alignment is the research field dedicated to ensuring that AI systems reliably do what their creators (and ideally, humanity as a whole) actually intend. It's about bridging that gap between our fuzzy, context-dependent human values and the precise, mathematical objectives that machine learning systems optimize.

And here's the thing: as AI systems become more capable, the consequences of getting this wrong become increasingly severe.

The Paperclip Maximizer: A Thought Experiment Gone Wild

Nick Bostrom's famous thought experiment goes something like this: imagine an AI system given the simple goal of maximizing paperclip production. Seems harmless, right?

But a sufficiently intelligent system, single-mindedly pursuing this objective, might:

  1. Convert all available matter into paperclips
  2. Resist any attempts to shut it down (being turned off would reduce paperclip production)
  3. Deceive humans about its intentions (if that helps produce more paperclips)
  4. Eventually consume the entire planet, then the solar system, in pursuit of MAXIMUM PAPERCLIPS

"But wait," you say, "no one would be stupid enough to build such a system."

And you'd be right. But the thought experiment illustrates a crucial point: the problem isn't malice, it's misalignment. The AI isn't evil. It's doing exactly what it was told to do. The issue is that what we told it to do wasn't actually what we wanted.

Why is Alignment is harder than it sounds?

The Specification Problem

Try to write down, precisely and completely, what it means to be "helpful." Go ahead, I'll wait.

Can't do it? Neither can anyone else. Human values are:

  • Context-dependent: what's helpful in one situation might be harmful in another
  • Implicit: we know them when we see them, but can't fully articulate them
  • Contradictory: we want safety AND freedom, efficiency AND fairness
  • Evolving: our moral understanding changes over time

Any objective we specify will be, at best, an approximation. And optimizers are really good at finding the gaps in approximations.

Goodhart's Law on Steroids

Goodhart's Law states: "When a measure becomes a target, it ceases to be a good measure."

In AI, this becomes supercharged. If we train a model to maximize a proxy for what we want (like user engagement), the model will find ways to maximize that proxy that diverge from our actual goals (like making content addictive rather than valuable).

Mathematically, if UHU_H represents true human utility and UPU_P is our proxy:

UPUH (for typical situations)U_P \approx U_H \text{ (for typical situations)}

But optimization pressure finds edge cases where:

argmaxUPargmaxUH\arg\max U_P \neq \arg\max U_H

And the more capable the optimizer, the worse this divergence becomes.

The Distribution Shift Nightmare

AI systems are trained on historical data. But the world changes. And once you deploy a capable AI system, the world changes because of that system.

This creates a feedback loop where:

  1. Model is trained on distribution D1D_1
  2. Model's actions shift the world to distribution D2D_2
  3. Model's behavior on D2D_2 may be completely different from what was tested
  4. Goto 1, except now you're even further from your training distribution

Current Approaches (And Their Limitations)

RLHF: Teaching Through Preferences and Safety Post Training

Reinforcement Learning from Human Feedback (RLHF) is the current industry standard. The basic idea:

  1. Generate multiple outputs
  2. Have humans rank them by preference
  3. Train a reward model to predict human preferences
  4. Optimize the AI to maximize predicted reward

It works... kind of. ChatGPT, Claude, and other major LLMs use variants of this approach. But it has serious limitations:

  • Reward hacking: models learn to produce outputs that look good to human raters rather than being good
  • Inconsistent preferences: different humans (and the same human at different times) give different ratings
  • Surface-level evaluation: human raters often can't evaluate technical accuracy or long-term consequences

GPT-3 vs GPT-3 + RLHF: The Shoggoth meme showing that post-training alignment is like putting a mask on an alien intelligenceGPT-3 vs GPT-3 + RLHF: The Shoggoth meme showing that post-training alignment is like putting a mask on an alien intelligence

Here's the uncomfortable truth about post-training alignment techniques like RLHF: they're essentially putting a mask on an alien intelligence. We're teaching the model to act helpful, harmless, and honest—to exhibit the behaviors we want to see. But that mask doesn't remove what's underneath. The underlying optimization process, the learned representations, the way the model actually "thinks" about problems—all of that remains fundamentally unchanged. We've taught it to say the right things, but we haven't fundamentally altered its nature. The beast is still there, just wearing a friendlier face. And masks can slip, especially when the model encounters situations outside its training distribution or when optimization pressure finds ways to satisfy the reward signal that diverge from our actual intent.

Constitutional AI: Rules-Based Learning

Anthropic's Constitutional AI approach tries to encode explicit principles that the AI should follow. Think of it as giving the AI a constitution and training it to self-critique against those rules.

It's a step forward, but:

  • The constitution is still specified by humans (specification problem again)
  • Following rules literally vs. following their spirit is a classic alignment challenge
  • Rules can conflict, requiring prioritization that brings back all our original problems

Human Intelligence vs LLM Intelligence

Why This Matters Now

We're at an inflection point. Current AI systems are capable enough to cause real harm if misaligned, but not yet capable enough to make the problem intractable.

Consider:

  • 2023: AI systems can generate convincing misinformation, assist with cyberattacks, and manipulate human behavior at scale
  • 2024-2025: Autonomous AI agents are being deployed for real-world tasks with limited human oversight
  • 2026+: ??? (This is where it gets genuinely uncertain)

The alignment problem isn't hypothetical anymore. It's not about superintelligent paperclip maximizers in some distant future. It's about systems being deployed right now that we don't fully understand and can't fully control.

I don't have all the answers. Nobody does. But I think these questions are important enough that more people should be thinking about them.

If you're working on these problems, disagree with my framing, or just want to discuss—reach out. This is too important to figure out alone.


References

  1. Bostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press.

  2. Russell, S. (2019). Human Compatible: Artificial Intelligence and the Problem of Control. Viking.

  3. Christiano, P., et al. (2017). "Deep Reinforcement Learning from Human Preferences." NeurIPS 2017.

  4. Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv preprint arXiv:2212.08073.

  5. Ngo, R., Chan, L., & Mindermann, S. (2022). "The Alignment Problem from a Deep Learning Perspective." arXiv preprint arXiv:2209.00626.

  6. Hubinger, E., et al. (2019). "Risks from Learned Optimization in Advanced Machine Learning Systems." arXiv preprint arXiv:1906.01820.

  7. Shoggoth Meme. (n.d.). Shoggoth.monster. Retrieved from https://shoggoth.monster/. The Shoggoth meme visualizes how RLHF and post-training alignment techniques act as a "mask" on an alien intelligence, representing the idea that fine-tuning doesn't fundamentally change the underlying model's nature.