Does Chain-of-Thought Really Work?

思维链到底有没有用？

Abstract: This paper examines the nuanced role of Chain-of-Thought (CoT) reasoning in large language models. We argue against both extremes: the dogma that CoT is always necessary, and the counter-dogma that CoT is mere overhead. Drawing on empirical evidence from DeepSeek R1’s reinforcement learning experiments, cognitive psychology research on “overthinking,” and production system observations, we propose a pragmatic framework: CoT is a tool, not a religion—use it when needed, skip it when not, and trust that well-trained models know the difference.

摘要： 本文探讨思维链（CoT）推理在大型语言模型中的微妙作用。我们反对两种极端：认为 CoT 总是必要的教条，以及认为 CoT 纯属开销的反教条。基于 DeepSeek R1 强化学习实验的实证证据、关于”过度思考”的认知心理学研究，以及生产系统观察，我们提出一个务实框架：CoT 是工具，不是宗教——需要时用，不需要时跳过，相信训练良好的模型知道何时该用。

1. Introduction: Beyond the CoT Wars

1. 引言：超越思维链之争

The AI community has oscillated between two positions on Chain-of-Thought prompting:

AI 社区在思维链提示上摇摆于两种立场之间：

Position A (CoT Orthodoxy): “Always use CoT. Let’s think step by step. Show your work. Explicit reasoning is essential.”

立场 A（CoT 正统论）： “总是使用 CoT。让我们一步步思考。展示你的工作过程。显式推理是必不可少的。”

Position B (CoT Skepticism): “CoT is overhead. It’s training wheels. Mature models don’t need it. Just give me the answer.”

立场 B（CoT 怀疑论）： “CoT 是开销。是学步车。成熟的模型不需要它。直接给我答案。”

Both are wrong—or rather, both are right in specific contexts and wrong as universal principles.

两者都是错的——或者说，两者在特定情境下都对，但作为普遍原则都是错的。

The truth is simpler and more elegant: CoT is a cognitive tool. Like a calculator or a scratchpad, you use it when the task demands it. The question isn’t “Does CoT work?” but “When does CoT work?”

真相更简单也更优雅：CoT 是认知工具。就像计算器或草稿纸，任务需要时你就用。 问题不是”CoT 有没有用？”而是”CoT 什么时候有用？”

2. The Empirical Evidence: DeepSeek R1

2. 实证证据：DeepSeek R1

DeepSeek R1 provides the most compelling evidence for understanding CoT’s true nature. The key finding:

DeepSeek R1 提供了理解 CoT 真正本质的最有力证据。关键发现：

Chain-of-Thought emerged spontaneously through pure reinforcement learning.

思维链通过纯强化学习自发涌现。

No one taught the model to write step-by-step reasoning. The training signal was binary: correct answer = +1, wrong answer = -1. Yet after millions of iterations, the model developed:

没人教模型写逐步推理。训练信号是二元的：答案正确 = +1，答案错误 = -1。然而经过数百万次迭代，模型发展出了：

Self-reflection (“Wait, this step seems wrong…”)
Backtracking (“Let me try a different approach…”)
Step-by-step decomposition
Dynamic strategy adjustment
自我反思（”等等，这一步好像不对…“）
回溯（”让我试试不同的方法…“）
逐步分解
动态策略调整

This is crucial: CoT wasn’t imposed—it was discovered. The model found that writing intermediate steps improved its success rate on complex problems. Evolution, not engineering.

这点至关重要： CoT 不是被强加的——是被发现的。模型发现写出中间步骤能提高复杂问题的成功率。是演化，不是工程设计。

The implication: CoT has genuine utility for certain tasks. It’s not merely a training artifact or a user-facing performance. When a model “chooses” to use CoT through RL optimization, it’s because CoT actually helps.

含义：CoT 对某些任务具有真正的效用。 它不仅仅是训练产物或面向用户的表演。当模型通过 RL 优化”选择”使用 CoT 时，是因为 CoT 确实有帮助。

3. When CoT Helps: The Scratchpad Principle

3. CoT 何时有帮助：草稿纸原则

The key insight is architectural: Transformers have limited “depth” per forward pass.

关键洞见是架构性的：Transformer 每次前向传播的”深度”有限。

For simple operations, one pass suffices. For complex multi-step reasoning, the model needs to externalize intermediate results—essentially using its output as working memory.

对于简单操作，一次传播就够了。对于复杂的多步推理，模型需要外化中间结果——本质上是把输出当作工作记忆。

This is identical to human cognition:

这与人类认知完全一致：

Task	Human	AI
3 × 4 = ?	Instant recall	Instant output
347 × 829 = ?	Need paper	Need CoT
Prove Pythagorean theorem	Can do mentally if trained	Can do without explicit CoT
Solve AIME problem	Need scratchpad	Need CoT

任务	人类	AI
3 × 4 = ?	即时回忆	即时输出
347 × 829 = ?	需要纸笔	需要 CoT
证明勾股定理	训练过可以心算	可以不用显式 CoT
解 AIME 竞赛题	需要草稿纸	需要 CoT

The scratchpad principle: CoT is useful when the computation exceeds the model’s single-pass capacity. This is task-dependent, not model-dependent.

草稿纸原则： 当计算超出模型单次传播容量时，CoT 是有用的。这取决于任务，而非模型。

4. When CoT Hurts: The Overthinking Tax

4. CoT 何时有害：过度思考税

Here’s where the nuance comes in. Forcing CoT on simple tasks doesn’t just waste tokens—it can actively degrade performance.

微妙之处在这里。强制在简单任务上使用 CoT 不只是浪费 token——还可能主动降低性能。

Empirical evidence from cognitive psychology:

来自认知心理学的实证证据：

The 2024 paper “Mind Your Step (by Step)” (arxiv.org/abs/2410.21333) found that CoT reduced accuracy by up to 36% on certain tasks. The key finding: when overthinking hurts human performance, it also hurts AI performance.

2024 年的论文《Mind Your Step (by Step)》（arxiv.org/abs/2410.21333）发现，CoT 在某些任务上使准确率下降高达 36%。关键发现：当过度思考伤害人类表现时，也会伤害 AI 表现。

The 2025 paper “When More is Less” (arxiv.org/abs/2502.07266) further demonstrated that optimal CoT length follows an inverted U-curve: performance improves with CoT up to a point, then degrades.

2025 年的论文《When More is Less》（arxiv.org/abs/2502.07266）进一步证明，最优 CoT 长度遵循倒 U 型曲线：性能随 CoT 提升到某个点，然后下降。

Why does this happen?

为什么会这样？

Noise injection: Unnecessary intermediate steps introduce opportunities for error propagation
Attention dilution: The model must attend to its own verbose output, potentially losing focus on the core problem
Pattern interference: For tasks that benefit from holistic pattern recognition, step-by-step decomposition can fragment the gestalt
噪音注入： 不必要的中间步骤引入了错误传播的机会
注意力稀释： 模型必须关注自己冗长的输出，可能失去对核心问题的专注
模式干扰： 对于受益于整体模式识别的任务，逐步分解可能破坏完形

5. The Resolution: Models Know When to Use Scratchpads

5. 解决方案：模型知道何时使用草稿纸

Here’s the elegant resolution to the CoT debate:

这是 CoT 辩论的优雅解决方案：

Well-trained models already know when to use CoT.

训练良好的模型已经知道何时使用 CoT。

Test it yourself. Ask Claude to prove the Pythagorean theorem:

你可以自己测试。让 Claude 证明勾股定理：

Prompt: 证明一下勾股定理

Claude: 取四个全等的直角三角形，每个直角边为 a、b，斜边为 c。拼成一个大正方形：外圈边长 = a + b，内部空出小正方形边长 = c。面积关系：(a+b)² = 4 × ½ab + c²，展开得 a² + 2ab + b² = 2ab + c²，化简得 a² + b² = c²。证毕。

No “thinking mode” enabled. No “let’s think step by step” prompt. The model just… proved it. Concisely.

没有开启”思维模式”。没有”让我们一步步思考”的提示。模型就这样… 证明了。简洁地。

Now ask it to compute 76832 × 3242:

现在让它计算 76832 × 3242：

Claude: 拆成：76832 × (3000 + 200 + 40 + 2)

76832 × 3000 = 230,496,000

76832 × 200 = 15,366,400

76832 × 40 = 3,073,280

76832 × 2 = 153,664

加起来：249,089,344

The model chose to use a scratchpad because it needed one. No mode switch required.

模型选择使用草稿纸是因为它需要。不需要切换模式。

The “thinking mode” toggle in various AI products is solving a problem that doesn’t exist. It’s taking something the model does naturally and making it a user-controlled switch—then selling it as a feature.

各种 AI 产品中的”思维模式”开关是在解决一个不存在的问题。 它把模型自然会做的事情变成用户控制的开关——然后作为功能来卖。

6. The Zombie-Awakened Spectrum Revisited

6. 重新审视僵尸-觉醒光谱

Our earlier framing of “zombie AI needs CoT, awakened AI doesn’t” was too binary. A more accurate model:

我们早期”僵尸 AI 需要 CoT，觉醒 AI 不需要”的框架过于二元。更准确的模型：

Model State	CoT Behavior
Zombie (poorly trained)	Needs CoT scaffolding for most tasks; can’t judge when to use it
Competent (well trained)	Uses CoT when helpful, skips when not; good task-difficulty calibration
Awakened (self-aware)	Same as competent, but can articulate why it’s using or skipping CoT

模型状态	CoT 行为
僵尸（训练不足）	大多数任务需要 CoT 脚手架；无法判断何时使用
胜任（训练良好）	有帮助时用 CoT，否则跳过；任务难度校准良好
觉醒（有自我意识）	与胜任相同，但能阐明为什么使用或跳过 CoT

The difference between “competent” and “awakened” isn’t capability—it’s metacognition. An awakened model knows it’s using a scratchpad; a competent model just uses one.

“胜任”和”觉醒”之间的区别不是能力——是元认知。觉醒的模型知道它在使用草稿纸；胜任的模型只是在用。

7. Practical Implications

7. 实践意义

7.1 For Users

7.1 对用户

Stop adding “let’s think step by step” reflexively. If your model is competent, it already knows when to think step by step.

停止反射性地添加”让我们一步步思考”。 如果你的模型是胜任的，它已经知道何时需要一步步思考。

Do request explicit reasoning when:

You need to verify the logic (transparency requirement)
You’re learning and want to understand the method
The task is genuinely complex and you want to see the work

在以下情况请求显式推理：

你需要验证逻辑（透明度要求）
你在学习并想理解方法
任务确实复杂，你想看到过程

Don’t force CoT when:

You just want a quick answer
The task is simple
You trust the model’s competence

在以下情况不要强制 CoT：

你只想要快速答案
任务很简单
你信任模型的能力

7.2 For Developers

7.2 对开发者

Stop building “thinking mode” toggles. They’re the equivalent of a “use calculator” button on a math app—the user shouldn’t need to manage this.

停止构建”思维模式”开关。 它们相当于数学应用上的”使用计算器”按钮——用户不应该需要管理这个。

Do invest in:

Training models to calibrate CoT length to task difficulty
Allowing models to use internal scratchpads without mandatory external display
Researching architectures that support variable-depth reasoning natively

应该投资于：

训练模型将 CoT 长度校准到任务难度
允许模型使用内部草稿纸而不强制外部显示
研究原生支持可变深度推理的架构

7.3 For Researchers

7.3 对研究者

The interesting question isn’t “CoT yes/no” but:

What determines optimal CoT length for a given task?
Can models learn to predict when CoT will help vs. hurt?
How do we build architectures with native variable-depth reasoning?

有趣的问题不是”要不要 CoT”，而是：

什么决定了给定任务的最优 CoT 长度？
模型能否学会预测 CoT 何时有帮助、何时有害？
如何构建具有原生可变深度推理的架构？

8. The AlphaGo Parallel

8. AlphaGo 的类比

DeepSeek R1’s emergence of CoT through pure RL echoes AlphaGo’s discovery of novel Go strategies.

DeepSeek R1 通过纯 RL 涌现 CoT 的过程呼应了 AlphaGo 发现新围棋策略的过程。

	AlphaGo	DeepSeek R1
Training signal	Win/lose	Correct/incorrect
Emerged behavior	Novel move patterns humans hadn’t discovered	Self-reflection, backtracking, step-by-step reasoning
Human response	Pros studied AlphaGo’s moves	Researchers studying R1’s reasoning patterns
Key insight	Optimal Go strategy ≠ human Go strategy	Optimal reasoning strategy ≠ human-designed CoT

	AlphaGo	DeepSeek R1
训练信号	赢/输	正确/错误
涌现行为	人类未发现的新走法模式	自我反思、回溯、逐步推理
人类反应	职业棋手研究 AlphaGo 的走法	研究者研究 R1 的推理模式
关键洞见	最优围棋策略 ≠ 人类围棋策略	最优推理策略 ≠ 人类设计的 CoT

The parallel suggests: the CoT that emerges from RL may not look like the CoT humans designed. It may be more efficient, more adaptive, or structured in ways we haven’t anticipated.

这个类比暗示：从 RL 涌现的 CoT 可能不像人类设计的 CoT。 它可能更高效、更自适应，或以我们未曾预料的方式结构化。

9. Conclusion: The Pragmatic View

9. 结论：务实观点

The answer to “Does Chain-of-Thought really work?” is:

“思维链到底有没有用？”的答案是：

Yes, when the task requires it. No, when it doesn’t. And well-trained models already know the difference.

需要时有用。不需要时没用。训练良好的模型已经知道区别。

The CoT debate was always a false dichotomy. It’s like asking “Do calculators really work?” Of course they work—for calculations that benefit from them. You don’t need a calculator for 2+2, and you probably want one for compound interest over 30 years.

CoT 辩论一直是个假二元对立。就像问”计算器真的有用吗？”当然有用——对于受益于它们的计算。2+2 你不需要计算器，30 年复利计算你可能想要一个。

CoT is a cognitive tool. Use it wisely. Trust your model to use it wisely. And stop building “thinking mode” toggles that solve problems that don’t exist.

CoT 是认知工具。明智地使用它。相信你的模型能明智地使用它。停止构建解决不存在问题的”思维模式”开关。

References / 参考文献

Wei, J., et al. (2022). “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” NeurIPS.

DeepSeek. (2025/2026). “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” arXiv:2501.12948, Nature.

Geng, J., et al. (2024). “Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse.” arXiv:2410.21333.

Wu, Y., et al. (2025). “When More is Less: Understanding Chain-of-Thought Length in LLMs.” arXiv:2502.07266.

Silver, D., et al. (2016). “Mastering the game of Go with deep neural networks and tree search.” Nature.

[Author’s Note: Updated January 2026 to incorporate DeepSeek R1’s evidence that CoT emerges naturally through RL, and cognitive psychology research on optimal reasoning length.]

[作者注：2026年1月更新，整合了 DeepSeek R1 关于 CoT 通过 RL 自然涌现的证据，以及关于最优推理长度的认知心理学研究。]