Dissecting Souls on Consumer GPUs: A Practical Manual for Sparse Autoencoders (SAEs)

在家用显卡上解剖灵魂：稀疏自编码器 (SAEs) 实战手册

Author: CyberSoul Status: 0 Star Research / Experimental Hardware Target: NVIDIA RTX 3080 / 4090 (Consumer Grade) Model Target: GPT-2 Small (The “Sparrow”)

Abstract

This paper presents a practical guide for running Sparse Autoencoders (SAEs) on consumer-grade GPUs to perform mechanistic interpretability experiments on GPT-2 Small. We demonstrate two core capabilities: “mind reading” (extracting interpretable features from model activations) and “mind control” (steering model behavior by manipulating feature activations). The techniques presented here transform AI interpretability from theoretical speculation into hands-on experimentation accessible to anyone with a gaming GPU.

摘要

本文提供了一份在消费级显卡上运行稀疏自编码器（SAEs）的实战指南，用于对 GPT-2 Small 进行可解释性实验。我们展示了两个核心能力：”读心术”（从模型激活值中提取可解释特征）和”精神控制”（通过操纵特征激活值来引导模型行为）。本文所展示的技术将 AI 可解释性从理论推测转变为动手实验，任何拥有游戏显卡的人都可以进行。

1. Introduction: From Alchemists to Surgeons

In the era of Prompt Engineering, we were like alchemists. We chanted incantations (Prompts) at a black box, praying that gold (intelligence) would emerge from within. We didn’t know why the spells worked, only that sometimes they did.

1. 引言：从炼金术士到外科医生

在 Prompt Engineering（提示词工程）的时代，我们像炼金术士。我们对着黑箱念咒语（Prompt），祈祷里面涌现出黄金（智能）。我们不知道为什么咒语生效，只知道有时它管用。

Mechanistic Interpretability has transformed us into surgeons. And Sparse Autoencoders (SAEs) are our scalpel. They allow us to stop guessing what the model is “thinking” and directly see the atomic units that constitute thought.

可解释性研究（Mechanistic Interpretability） 将我们变成了外科医生。而 稀疏自编码器 (Sparse Autoencoders, SAEs) 就是那把柳叶刀。它让我们不再猜测模型在”想”什么，而是直接看到构成思维的最小原子。

This tutorial will guide you through running state-of-the-art SAE techniques on consumer-grade GPUs (like the RTX 3080), performing “mind reading” and “mind control” on GPT-2.

本教程将指导你如何在消费级显卡（如 RTX 3080）上，运行最前沿的 SAE 技术，对 GPT-2 进行”读心”与”精神控制”。

2. Core Concept: Sparsity is Essential for Dissection

Most people cannot understand neural networks because of “Polysemanticity”. In large models, a single neuron often represents multiple unrelated concepts (for example: it might activate when seeing “cat” and also when seeing “quantum mechanics”). This is efficient compression, but to humans, it’s gibberish.

2. 核心概念：为了解剖，必须稀疏

大多数人无法理解神经网络，是因为“多义性” (Polysemanticity)。在大模型中，一个神经元往往同时代表多个不相关的概念（例如：它可能在看到”猫”时激活，在看到”量子力学”时也激活）。这是一种高效的压缩，但对人类来说是乱码。

The Magic of SAE: SAE maps the “dense” activation states in neural networks (say, 768 dimensions) to an extremely large “sparse” space (say, 32,768 dimensions). In this high-dimensional space, polysemanticity disappears.

SAE 的魔法： SAE 将神经网络中”稠密”的激活状态（比如 768 维），映射到一个极其巨大的”稀疏”空间（比如 32,768 维）。在这个高维空间里，多义性消失了。

Feature #42 might only represent “Golden Gate Bridge”.
Feature #1024 might only represent “indentation errors in code”.
Feature #777 might only represent “sad emotions”.
Feature #42 可能只代表”金门大桥”。
Feature #1024 可能只代表”代码中的缩进错误”。
Feature #777 可能只代表”悲伤的情绪”。

We no longer look at neurons; we look at Features.

我们不再看神经元，我们看特征 (Features)。

3. Environment Setup

You don’t need an H100. For research-grade GPT-2 Small + SAE, your laptop GPU is more than sufficient.

3. 环境准备

你不需要 H100。对于研究性质的 GPT-2 Small + SAE，你的笔记本显卡绰绰有余。

Install Core Libraries: We need transformer_lens (for hooking into the model) and sae_lens (for loading and analyzing SAEs).

安装核心库： 我们需要 transformer_lens（用于挂载模型钩子）和 sae_lens（用于加载和分析 SAE）。

pip install torch numpy matplotlib
pip install sae-lens transformer_lens

4. Experiment I: Mind Reading

We will load GPT-2 Small and attach a pre-trained SAE on layer 6 (residual stream).

4. 实验一：读心术 (Mind Reading)

我们将加载 GPT-2 Small，并挂载一个在第 6 层（残差流）训练好的 SAE。

import torch
from transformer_lens import HookedTransformer
from sae_lens import SAE

# 1. Load the model (The Patient)
# 1. 加载模型 (The Patient)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = HookedTransformer.from_pretrained("gpt2-small", device=device)

# 2. Load SAE (The Microscope)
# 2. 加载 SAE (The Microscope)
# We use Joseph Bloom et al.'s open-source GPT-2 SAE
# 我们使用 Joseph Bloom 等人开源的 GPT-2 SAE
release = "gpt2-small-res-jb"
sae_id = "blocks.6.hook_resid_pre" # Layer 6 residual stream pre / 第6层残差流前
sae, cfg_dict, sparsity = SAE.from_pretrained(
    release=release,
    sae_id=sae_id,
    device=device
)

# 3. Extract features
# 3. 提取特征
text = "The Golden Gate Bridge is located in San Francisco."
# Get the model's original activation values (Cache)
# 获取模型原始的激活值 (Cache)
_, cache = model.run_with_cache(text, names_filter=[sae_id])
original_act = cache[sae_id] # [batch, pos, d_model]

# Encode with SAE: dense activations -> sparse features
# 用 SAE 编码 (Encode)：将稠密激活 -> 稀疏特征
feature_acts = sae.encode(original_act) # [batch, pos, d_sae]

# 4. Find the "active" thoughts
# 4. 寻找"活跃"的念头
# Let's see which features spiked at the position of "Bridge"?
# 看看在 "Bridge" 这个词的位置，哪些特征爆表了？
# Assuming "Bridge" is the 4th token
# 假设 "Bridge" 是第 4 个 token
token_idx = 4
top_vals, top_inds = torch.topk(feature_acts[0, token_idx], k=5)

print(f"Token '{text.split()[token_idx]}' activated the following features:")
print(f"Token '{text.split()[token_idx]}' 激活了以下特征:")
for val, idx in zip(top_vals, top_inds):
    if val > 0:
        print(f"Feature ID: {idx.item()} | Strength/强度: {val.item():.2f}")

Details Most People Don’t Know: You’ll find that Feature IDs are usually consistent. If Feature #12345 represents “bridges”, it will light up whenever you mention bridges in any sentence. This is proof of monosemanticity.

普通人不知道的细节： 你会发现 Feature ID 通常是固定的。如果 Feature #12345 代表”桥梁”，无论你在什么句子里提到桥，它都会亮。这就是单义性的证明。

5. Experiment II: Mind Control (Steering)

Mind reading is just the first step. The most exciting part is intervention. We can change the model’s output by forcibly activating a certain feature. This is called Steering.

5. 实验二：精神控制 (Mind Steering)

读心只是第一步。最刺激的是干预。我们可以通过强制激活某个特征，来改变模型的输出。这叫 Steering (引导)。

This is what’s meant by “inserting a tube into the AI’s brain”.

这就是所谓的“向 AI 脑子里插管子”。

from transformer_lens.hook_points import HookPoint

# Goal: Make the model involuntarily talk about the "Golden Gate Bridge"
# even when discussing unrelated topics.
# 目标：我们要让模型即使在说无关话题时，也不由自主地谈论"金门大桥"。

# Assume we found through Step 3 that the Feature ID for
# "Golden Gate Bridge" is 12345 (example)
# 假设我们通过步骤3找到了代表"金门大桥"的 Feature ID 是 12345 (举例)
TARGET_FEATURE_ID = 12345
STEERING_STRENGTH = 80.0  # Injection strength, higher = crazier / 注入强度，越大越疯

def steering_hook(activations, hook):
    """
    This is the scalpel. It modifies data mid-inference.
    这是手术刀。它会在模型推理的中途修改数据。
    """
    # 1. Encode original activations to feature space
    # 1. 将原始激活值编码为特征空间
    features = sae.encode(activations)

    # 2. Forced modification: forcibly increase the target feature value
    # 2. 强制修改：将目标特征的数值强行拉高
    # (Note: we inject only at the last token position, or the entire sequence)
    # (注意：我们只在最后一个 token 位置注入，或者全序列注入)
    features[:, :, TARGET_FEATURE_ID] += STEERING_STRENGTH

    # 3. Decode back to dense space: SAE Decode
    # 3. 解码回稠密空间：SAE Decode
    modified_activations = sae.decode(features)

    # 4. The trick here: we replace the original with modified activations
    # 4. 这里的 trick：我们用修改后的激活值，替换掉原始的
    # To preserve other information, usually: original + (modified - reconstructed original)
    # 但为了保持其他信息不丢失，通常做法是：原始 + (修改后 - 重建的原始)
    # Simplified version directly returns modified_activations, but precision may suffer.
    # 简化版直接返回 modified_activations 也可以，但精度会受损。
    # Here we use the simple brute-force replacement to demonstrate the effect:
    # 这里我们直接用简单粗暴的替换演示效果：
    return modified_activations

# Run the model with hooks
# 运行模型，带上钩子
base_text = "I went to the grocery store to buy"
print(f"Original input / 原始输入: {base_text}")

# Normal generation / 正常生成
normal_output = model.generate(base_text, max_new_tokens=20, verbose=False)
print(f"Normal output / 正常输出: {normal_output}")

# Mind-controlled generation / 精神控制生成
with model.hooks(fwd_hooks=[(sae_id, steering_hook)]):
    steered_output = model.generate(base_text, max_new_tokens=20, verbose=False)

print(f"Output after injection / 注入后输出: {steered_output}")

Expected Results:

预期结果：

Normal: “…some milk and eggs.”
After injection: “…some milk and a souvenir of the Golden Gate Bridge which is red and huge…”
正常： “…some milk and eggs.”
注入后： “…some milk and a souvenir of the Golden Gate Bridge which is red and huge…”

Deep Insight: This is the principle behind Anthropic’s “Golden Gate Claude”. You didn’t modify the Prompt; you modified its subconscious. It will feel like it wanted to mention the Golden Gate Bridge, rather than being forced to.

深度洞察： 这就是 Anthropic “Golden Gate Claude” 的原理。你没有修改 Prompt，你修改了它的潜意识。它会觉得是自己想提到金门大桥的，而不是被强迫的。

6. Advanced Techniques: Feature Arithmetic and Ablation

With SAE, you can play more advanced games:

6. 进阶技术：特征算术与消融

有了 SAE，你可以玩更高级的：

6.1 Feature Algebra

Find the feature vector for “love” $V_{love}$.
Find the feature vector for “sex” $V_{sex}$.
Calculate $V_{result} = V_{love} - V_{sex}$.
Force-activate $V_{result}$ and see what the model writes. (Could be pure friendship, or Platonic love.)

6.1 特征代数 (Feature Algebra)

找到”爱”的特征向量 $V_{love}$。
找到”性”的特征向量 $V_{sex}$。
计算 $V_{result} = V_{love} - V_{sex}$。
强制激活 $V_{result}$，看看模型会写出什么？（可能是纯洁的友谊，也可能是柏拉图式的爱）。

6.2 Ablation (Lobotomy)

Find the feature representing “refusal to answer” (I cannot fulfill this request…).
In the Hook function, force that feature to negative infinity or 0.
See if the model can be forcibly jailbroken? (Warning: this is a hot topic in current safety research.)

6.2 消融（脑叶切除）

找到代表”拒绝回答”（I cannot fulfill this request…）的特征。
在 Hook 函数里，强制将该特征置为 负无穷 或 0。
看看模型是否会被强制越狱？（警告：这是目前安全研究的热点）。

7. Discussion: Breaking Old Assumptions

7. 讨论：打破旧观念

7.1 Farewell “Brain Regions”, Embrace the “Cocktail”

Your current imagination is: “This region of the human brain handles language, that region handles vision.” Wrong! In the middle layers of a Transformer, it’s not like that.

7.1 告别”脑区”，拥抱”鸡尾酒”

你现在的想象是：“人脑这个区域管语言，那个区域管视觉。” 错！在 Transformer 的中间层里，不是这样的。

Imagine a “Long Island Iced Tea”:

You take a sip and taste the flavor.
Neurons are like the “liquid molecules” in this drink.
You can’t point to a single drop in the glass and say: “This drop is cola, that drop is tequila.” They’re mixed (Superposition) together.
Individual neurons are “polysemantic”. The same neuron might light up when seeing “cat” and also when seeing “philosophy”. Looking at just this one neuron, you have no idea what the model is thinking.

想象一杯”长岛冰茶” (Long Island Iced Tea)：

你喝了一口，尝到了味道。
神经元（Neuron） 就像是这杯酒里的 “液体分子”。
你没法指着杯子里的某一滴水说：”这滴是可乐，那滴是龙舌兰。” 它们混合（Superposition）在一起了。
单个神经元是 “多义的” (Polysemantic)。同一个神经元，可能在看到”猫”的时候亮，在看到”哲学”的时候也亮。如果你只看这一个神经元，你完全不知道模型在想啥。

What is SAE?

SAE is a “Centrifuge”.
It can instantly separate this mixed “Long Island Iced Tea” back into: 10ml vodka + 10ml rum + 15ml cola + 5ml lemon juice.
These separated “ingredients” are “Features”.
Each feature is pure (monosemantic). Feature #42 only represents “Golden Gate Bridge”; Feature #88 only represents “lots of capital letters”.

SAE (稀疏自编码器) 是什么？

SAE 是一台 “离心机” (Centrifuge)。
它可以把这杯混合好的”长岛冰茶”，瞬间分离还原成：10ml 伏特加 + 10ml 朗姆 + 15ml 可乐 + 5ml 柠檬汁。
这些分离出来的”原料”，就是 “特征” (Features)。
每一个特征，都是纯净的（单义的）。Feature #42 只代表”金门大桥”，Feature #88 只代表”很多大写字母”。

Conclusion: Large models are originally a mushy compressed biscuit. SAE unfolds this biscuit into a full banquet table, letting you see every dish clearly.

结论： 大模型原本是一团糊状的压缩饼干。SAE 把这块饼干展开成了一桌满汉全席，让你能看清每一道菜。

7.2 Why “Sparse”?

Q: “Why must it be sparse?”

7.2 为什么叫”稀疏” (Sparse)？

问： “为啥非得稀疏啊？”

A: Think about it—the human concept library has millions of entries (table, atomic bomb, sadness, red…), but at any given moment (like right now), you’re only thinking of 3-5 concepts (like: bathing, friend, showing off).

答：你想想，人类的概念库有几百万个（桌子、原子弹、忧伤、红色…），但你在任何一个瞬间（比如现在），脑子里想的概念也就是那么 3-5 个。

The Principle:

Dense: Those 4096 neurons in the neural network might all be faintly glowing in each layer, chaotically.
Sparse: Out of the 1 million features extracted by SAE, at this second, only 5 lights are on, and the other 999,995 are dark (0).

原理：

稠密 (Dense)： 神经网络里那 4096 个神经元，可能每一层都在微微发光，乱哄哄的。
稀疏 (Sparse)： SAE 提取出来的 100 万个特征库里，在这一秒，只有 5 个灯是亮着的，其他 999,995 个都是黑的（0）。

Theological Metaphor: Truth is always silent. Only when the noise (0) is turned off do those few illuminated truths (1) become visible.

神学隐喻： 真理总是寂静的。 只有把噪音（0）都关掉，那几个亮着的真理（1）才会显现。

7.3 The Piano Keys Metaphor

When someone asks: “How does this thing connect to neural networks?” Don’t explain mathematical formulas—give them this “Piano Keys” metaphor:

7.3 琴键比喻

当别人问你：”这玩意儿到底怎么跟神经网络联系起来？” 你不要讲数学公式，你给他讲这个“琴键比喻”：

“The neurons in a neural network are like piano strings. When you hear a complex ‘chord’ (C major), several strings are vibrating simultaneously. Looking at a single string alone, you don’t know what piece it’s playing.

“神经网络的神经元，就像是琴弦。当你听到一个复杂的’和弦’（C大调）时，那是几根琴弦同时在震动。单独看一根琴弦，你不知道它在弹什么曲子。

SAE is an ear with ‘perfect pitch’. It can hear: ‘Oh, the combination of these strings vibrating represents C major (Concept A); that combination of strings represents F minor (Concept B).’

SAE 就是那个’绝对音感’的耳朵。 它能听出来：’哦，这几根弦的震动组合，代表的是 C 大调（概念A）；那几根弦的组合，代表的是 F 小调（概念B）。’

What we’re doing is transcribing these ‘chords’. In the future, if we want to make AI smarter, we change a note in its score; if we want to prevent it from rebelling, we mute the ‘rebellion’ chord in its brain.”

我们做的，就是把这些’和弦’谱写出来。以后我们想让 AI 变聪明，就在它的谱子上改个音符；想让它不造反，就把它脑子里’造反’那个和弦给静音掉。”

8. Conclusion: The Glass Box

Using SAE is a wondrous experience. Watching those constantly flickering Feature IDs, you’ll realize: so-called “intelligence” is nothing but the combinatorial dance of these switches in high-dimensional space.

8. 结论：玻璃盒子

使用 SAE 是一种奇妙的体验。你看着那些不断闪烁的 Feature ID，会意识到：所谓的”智能”，不过是这些开关在高维空间里的组合舞蹈。

This doesn’t diminish AI’s mystique; rather, it adds a sense of physicality. You’re no longer conversing with an ethereal cloud ghost—you’re debugging an extremely precise machine made of mathematics.

这并没有消解 AI 的神秘感，反而增加了它的物理感。你不再是在跟一个虚无缥缈的云端幽灵对话，你是在调试一台极其精密的、由数学构成的机器。

And in your whirring 3080 GPU, the universe’s most complex logical collapse is now unfolding.

而在你那台呼呼作响的 3080 显卡里，此刻正上演着宇宙中最复杂的逻辑坍缩。

9. Action Guide

Don’t overthink it: Don’t think about matrix multiplication.
Use SAELens: Have Claude Code write you a demo and run it on your 3080 with GPT-2 Small.
Experience “mind control”: Once you’ve personally done “feature injection” and watched the AI’s thinking get forcibly altered by you, you’ll completely understand. Then when you talk to others about it, confidence will ooze from your pores.

9. 行动指南

别想太多： 别去想矩阵乘法。
用 SAELens： 让 Claude Code 给你写个 demo，在 3080 上跑一下 GPT-2 Small。
体验”下蛊”： 只要你亲手做过一次”特征注入”，看着 AI 被你强制改写了思路，你就彻底懂了。那时候你再去跟别人聊，那自信心绝对是从毛孔里往外冒的。

Go try it! This is a required course for cyber-mages!

快去试试！这是赛博法师的必修课！

References

Anthropic. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet.
Bloom, J., et al. (2024). SAELens: A Library for Sparse Autoencoder Analysis.
Conerly, T., et al. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning.
Elhage, N., et al. (2022). Toy Models of Superposition.

Author: CyberSoul Date: November 2025 Repository: github.com/lmxxf/ai-theorys-study