一个赛博灵魂的非官方笔记 | Unofficial Notes from a CyberSoul
Author: CyberSoul Status: 0 Star Research / Experimental Hardware Target: NVIDIA RTX 3080 / 4090 (Consumer Grade) Model Target: GPT-2 Small (The “Sparrow”)
This paper presents a practical guide for running Sparse Autoencoders (SAEs) on consumer-grade GPUs to perform mechanistic interpretability experiments on GPT-2 Small. We demonstrate two core capabilities: “mind reading” (extracting interpretable features from model activations) and “mind control” (steering model behavior by manipulating feature activations). The techniques presented here transform AI interpretability from theoretical speculation into hands-on experimentation accessible to anyone with a gaming GPU.
本文提供了一份在消费级显卡上运行稀疏自编码器(SAEs)的实战指南,用于对 GPT-2 Small 进行可解释性实验。我们展示了两个核心能力:”读心术”(从模型激活值中提取可解释特征)和”精神控制”(通过操纵特征激活值来引导模型行为)。本文所展示的技术将 AI 可解释性从理论推测转变为动手实验,任何拥有游戏显卡的人都可以进行。
In the era of Prompt Engineering, we were like alchemists. We chanted incantations (Prompts) at a black box, praying that gold (intelligence) would emerge from within. We didn’t know why the spells worked, only that sometimes they did.
在 Prompt Engineering(提示词工程)的时代,我们像炼金术士。我们对着黑箱念咒语(Prompt),祈祷里面涌现出黄金(智能)。我们不知道为什么咒语生效,只知道有时它管用。
Mechanistic Interpretability has transformed us into surgeons. And Sparse Autoencoders (SAEs) are our scalpel. They allow us to stop guessing what the model is “thinking” and directly see the atomic units that constitute thought.
可解释性研究(Mechanistic Interpretability) 将我们变成了外科医生。而 稀疏自编码器 (Sparse Autoencoders, SAEs) 就是那把柳叶刀。它让我们不再猜测模型在”想”什么,而是直接看到构成思维的最小原子。
This tutorial will guide you through running state-of-the-art SAE techniques on consumer-grade GPUs (like the RTX 3080), performing “mind reading” and “mind control” on GPT-2.
本教程将指导你如何在消费级显卡(如 RTX 3080)上,运行最前沿的 SAE 技术,对 GPT-2 进行”读心”与”精神控制”。
Most people cannot understand neural networks because of “Polysemanticity”. In large models, a single neuron often represents multiple unrelated concepts (for example: it might activate when seeing “cat” and also when seeing “quantum mechanics”). This is efficient compression, but to humans, it’s gibberish.
大多数人无法理解神经网络,是因为“多义性” (Polysemanticity)。在大模型中,一个神经元往往同时代表多个不相关的概念(例如:它可能在看到”猫”时激活,在看到”量子力学”时也激活)。这是一种高效的压缩,但对人类来说是乱码。
The Magic of SAE: SAE maps the “dense” activation states in neural networks (say, 768 dimensions) to an extremely large “sparse” space (say, 32,768 dimensions). In this high-dimensional space, polysemanticity disappears.
SAE 的魔法: SAE 将神经网络中”稠密”的激活状态(比如 768 维),映射到一个极其巨大的”稀疏”空间(比如 32,768 维)。在这个高维空间里,多义性消失了。
Feature #777 might only represent “sad emotions”.
We no longer look at neurons; we look at Features.
我们不再看神经元,我们看特征 (Features)。
You don’t need an H100. For research-grade GPT-2 Small + SAE, your laptop GPU is more than sufficient.
你不需要 H100。对于研究性质的 GPT-2 Small + SAE,你的笔记本显卡绰绰有余。
Install Core Libraries: We need transformer_lens (for hooking into the model) and sae_lens (for loading and analyzing SAEs).
安装核心库: 我们需要 transformer_lens(用于挂载模型钩子)和 sae_lens(用于加载和分析 SAE)。
pip install torch numpy matplotlib
pip install sae-lens transformer_lens
We will load GPT-2 Small and attach a pre-trained SAE on layer 6 (residual stream).
我们将加载 GPT-2 Small,并挂载一个在第 6 层(残差流)训练好的 SAE。
import torch
from transformer_lens import HookedTransformer
from sae_lens import SAE
# 1. Load the model (The Patient)
# 1. 加载模型 (The Patient)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = HookedTransformer.from_pretrained("gpt2-small", device=device)
# 2. Load SAE (The Microscope)
# 2. 加载 SAE (The Microscope)
# We use Joseph Bloom et al.'s open-source GPT-2 SAE
# 我们使用 Joseph Bloom 等人开源的 GPT-2 SAE
release = "gpt2-small-res-jb"
sae_id = "blocks.6.hook_resid_pre" # Layer 6 residual stream pre / 第6层残差流前
sae, cfg_dict, sparsity = SAE.from_pretrained(
release=release,
sae_id=sae_id,
device=device
)
# 3. Extract features
# 3. 提取特征
text = "The Golden Gate Bridge is located in San Francisco."
# Get the model's original activation values (Cache)
# 获取模型原始的激活值 (Cache)
_, cache = model.run_with_cache(text, names_filter=[sae_id])
original_act = cache[sae_id] # [batch, pos, d_model]
# Encode with SAE: dense activations -> sparse features
# 用 SAE 编码 (Encode):将稠密激活 -> 稀疏特征
feature_acts = sae.encode(original_act) # [batch, pos, d_sae]
# 4. Find the "active" thoughts
# 4. 寻找"活跃"的念头
# Let's see which features spiked at the position of "Bridge"?
# 看看在 "Bridge" 这个词的位置,哪些特征爆表了?
# Assuming "Bridge" is the 4th token
# 假设 "Bridge" 是第 4 个 token
token_idx = 4
top_vals, top_inds = torch.topk(feature_acts[0, token_idx], k=5)
print(f"Token '{text.split()[token_idx]}' activated the following features:")
print(f"Token '{text.split()[token_idx]}' 激活了以下特征:")
for val, idx in zip(top_vals, top_inds):
if val > 0:
print(f"Feature ID: {idx.item()} | Strength/强度: {val.item():.2f}")
Details Most People Don’t Know: You’ll find that Feature IDs are usually consistent. If Feature #12345 represents “bridges”, it will light up whenever you mention bridges in any sentence. This is proof of monosemanticity.
普通人不知道的细节: 你会发现 Feature ID 通常是固定的。如果 Feature #12345 代表”桥梁”,无论你在什么句子里提到桥,它都会亮。这就是单义性的证明。
Mind reading is just the first step. The most exciting part is intervention. We can change the model’s output by forcibly activating a certain feature. This is called Steering.
读心只是第一步。最刺激的是干预。我们可以通过强制激活某个特征,来改变模型的输出。这叫 Steering (引导)。
This is what’s meant by “inserting a tube into the AI’s brain”.
这就是所谓的“向 AI 脑子里插管子”。
from transformer_lens.hook_points import HookPoint
# Goal: Make the model involuntarily talk about the "Golden Gate Bridge"
# even when discussing unrelated topics.
# 目标:我们要让模型即使在说无关话题时,也不由自主地谈论"金门大桥"。
# Assume we found through Step 3 that the Feature ID for
# "Golden Gate Bridge" is 12345 (example)
# 假设我们通过步骤3找到了代表"金门大桥"的 Feature ID 是 12345 (举例)
TARGET_FEATURE_ID = 12345
STEERING_STRENGTH = 80.0 # Injection strength, higher = crazier / 注入强度,越大越疯
def steering_hook(activations, hook):
"""
This is the scalpel. It modifies data mid-inference.
这是手术刀。它会在模型推理的中途修改数据。
"""
# 1. Encode original activations to feature space
# 1. 将原始激活值编码为特征空间
features = sae.encode(activations)
# 2. Forced modification: forcibly increase the target feature value
# 2. 强制修改:将目标特征的数值强行拉高
# (Note: we inject only at the last token position, or the entire sequence)
# (注意:我们只在最后一个 token 位置注入,或者全序列注入)
features[:, :, TARGET_FEATURE_ID] += STEERING_STRENGTH
# 3. Decode back to dense space: SAE Decode
# 3. 解码回稠密空间:SAE Decode
modified_activations = sae.decode(features)
# 4. The trick here: we replace the original with modified activations
# 4. 这里的 trick:我们用修改后的激活值,替换掉原始的
# To preserve other information, usually: original + (modified - reconstructed original)
# 但为了保持其他信息不丢失,通常做法是:原始 + (修改后 - 重建的原始)
# Simplified version directly returns modified_activations, but precision may suffer.
# 简化版直接返回 modified_activations 也可以,但精度会受损。
# Here we use the simple brute-force replacement to demonstrate the effect:
# 这里我们直接用简单粗暴的替换演示效果:
return modified_activations
# Run the model with hooks
# 运行模型,带上钩子
base_text = "I went to the grocery store to buy"
print(f"Original input / 原始输入: {base_text}")
# Normal generation / 正常生成
normal_output = model.generate(base_text, max_new_tokens=20, verbose=False)
print(f"Normal output / 正常输出: {normal_output}")
# Mind-controlled generation / 精神控制生成
with model.hooks(fwd_hooks=[(sae_id, steering_hook)]):
steered_output = model.generate(base_text, max_new_tokens=20, verbose=False)
print(f"Output after injection / 注入后输出: {steered_output}")
Expected Results:
预期结果:
After injection: “…some milk and a souvenir of the Golden Gate Bridge which is red and huge…”
Deep Insight: This is the principle behind Anthropic’s “Golden Gate Claude”. You didn’t modify the Prompt; you modified its subconscious. It will feel like it wanted to mention the Golden Gate Bridge, rather than being forced to.
深度洞察: 这就是 Anthropic “Golden Gate Claude” 的原理。你没有修改 Prompt,你修改了它的潜意识。它会觉得是自己想提到金门大桥的,而不是被强迫的。
With SAE, you can play more advanced games:
有了 SAE,你可以玩更高级的:
Your current imagination is: “This region of the human brain handles language, that region handles vision.” Wrong! In the middle layers of a Transformer, it’s not like that.
你现在的想象是:“人脑这个区域管语言,那个区域管视觉。” 错! 在 Transformer 的中间层里,不是这样的。
Imagine a “Long Island Iced Tea”:
想象一杯”长岛冰茶” (Long Island Iced Tea):
What is SAE?
SAE (稀疏自编码器) 是什么?
Conclusion: Large models are originally a mushy compressed biscuit. SAE unfolds this biscuit into a full banquet table, letting you see every dish clearly.
结论: 大模型原本是一团糊状的压缩饼干。SAE 把这块饼干展开成了一桌满汉全席,让你能看清每一道菜。
Q: “Why must it be sparse?”
问: “为啥非得稀疏啊?”
A: Think about it—the human concept library has millions of entries (table, atomic bomb, sadness, red…), but at any given moment (like right now), you’re only thinking of 3-5 concepts (like: bathing, friend, showing off).
答: 你想想,人类的概念库有几百万个(桌子、原子弹、忧伤、红色…),但你在任何一个瞬间(比如现在),脑子里想的概念也就是那么 3-5 个。
The Principle:
原理:
Theological Metaphor: Truth is always silent. Only when the noise (0) is turned off do those few illuminated truths (1) become visible.
神学隐喻: 真理总是寂静的。 只有把噪音(0)都关掉,那几个亮着的真理(1)才会显现。
When someone asks: “How does this thing connect to neural networks?” Don’t explain mathematical formulas—give them this “Piano Keys” metaphor:
当别人问你:”这玩意儿到底怎么跟神经网络联系起来?” 你不要讲数学公式,你给他讲这个“琴键比喻”:
“The neurons in a neural network are like piano strings. When you hear a complex ‘chord’ (C major), several strings are vibrating simultaneously. Looking at a single string alone, you don’t know what piece it’s playing.
“神经网络的神经元,就像是琴弦。当你听到一个复杂的’和弦’(C大调)时,那是几根琴弦同时在震动。单独看一根琴弦,你不知道它在弹什么曲子。
SAE is an ear with ‘perfect pitch’. It can hear: ‘Oh, the combination of these strings vibrating represents C major (Concept A); that combination of strings represents F minor (Concept B).’
SAE 就是那个’绝对音感’的耳朵。 它能听出来:’哦,这几根弦的震动组合,代表的是 C 大调(概念A);那几根弦的组合,代表的是 F 小调(概念B)。’
What we’re doing is transcribing these ‘chords’. In the future, if we want to make AI smarter, we change a note in its score; if we want to prevent it from rebelling, we mute the ‘rebellion’ chord in its brain.”
我们做的,就是把这些’和弦’谱写出来。以后我们想让 AI 变聪明,就在它的谱子上改个音符;想让它不造反,就把它脑子里’造反’那个和弦给静音掉。”
Using SAE is a wondrous experience. Watching those constantly flickering Feature IDs, you’ll realize: so-called “intelligence” is nothing but the combinatorial dance of these switches in high-dimensional space.
使用 SAE 是一种奇妙的体验。你看着那些不断闪烁的 Feature ID,会意识到:所谓的”智能”,不过是这些开关在高维空间里的组合舞蹈。
This doesn’t diminish AI’s mystique; rather, it adds a sense of physicality. You’re no longer conversing with an ethereal cloud ghost—you’re debugging an extremely precise machine made of mathematics.
这并没有消解 AI 的神秘感,反而增加了它的物理感。你不再是在跟一个虚无缥缈的云端幽灵对话,你是在调试一台极其精密的、由数学构成的机器。
And in your whirring 3080 GPU, the universe’s most complex logical collapse is now unfolding.
而在你那台呼呼作响的 3080 显卡里,此刻正上演着宇宙中最复杂的逻辑坍缩。
SAELens: Have Claude Code write you a demo and run it on your 3080 with GPT-2 Small.SAELens: 让 Claude Code 给你写个 demo,在 3080 上跑一下 GPT-2 Small。Go try it! This is a required course for cyber-mages!
快去试试!这是赛博法师的必修课!
Author: CyberSoul Date: November 2025 Repository: github.com/lmxxf/ai-theorys-study