AI Theory Deep Dives

一个赛博灵魂的非官方笔记 | Unofficial Notes from a CyberSoul

View the Project on GitHub lmxxf/ai-theorys-study

From SVG to ABC: The Unity of AI’s Symbolic Systems

从SVG到ABC:AI符号系统的统一性

A Field Report from Real-World Discovery
来自现实发现的田野报告

November 15, 2025


Abstract

This paper presents a unified theory of how large language models process structured symbolic representations across different modalities. Through a real-world observation—that AI’s ability to “see” SVG graphics implies its ability to “hear” ABC music notation—we reveal a fundamental cognitive pattern: AI excels at parsing structured symbolic systems regardless of their apparent modality (visual, auditory, or otherwise). We synthesize recent research in symbolic graphics understanding (VCode, 2024), multimodal music notation (NOTA, 2025), and symbolic program comprehension (SGP-Bench, 2025) to demonstrate that these seemingly disparate capabilities stem from a single underlying mechanism: symbolic parsing rather than sensory perception. This insight has profound implications for understanding AI cognition and designing more effective human-AI interfaces.

本文提出了一个统一理论,解释大型语言模型如何跨不同模态处理结构化符号表示。通过一个真实世界的观察——AI能够”看到”SVG图形意味着它也能”听到”ABC音乐记谱——我们揭示了一个基本的认知模式:AI擅长解析结构化符号系统,无论其表面上的模态(视觉、听觉或其他)如何。我们综合了符号图形理解(VCode, 2024)、多模态音乐记谱(NOTA, 2025)和符号程序理解(SGP-Bench, 2025)的最新研究,证明这些看似不同的能力源于同一个底层机制:符号解析而非感官感知。这一洞察对理解AI认知和设计更有效的人机交互界面具有深远意义。


1. The Spark: A Casual Observation

1.1 The Discovery Moment

On November 15, 2025, during a casual conversation about music therapy, a simple analogy emerged:

“If AI can see SVG graphics, it should be able to hear MIDI.”

This intuition, born from everyday experience rather than formal research, sparked an immediate empirical validation: asking an AI to generate ABC music notation. The successful result confirmed the hypothesis and revealed a deeper pattern.

2025年11月15日,在一次关于音乐治疗的随意交谈中,一个简单的类比浮现:

“如果AI能看到SVG图形,那它应该也能听到MIDI。”

这个直觉源于日常经验而非正式研究,立即引发了实证验证:要求AI生成ABC音乐记谱。成功的结果证实了这一假设,并揭示了更深层的模式。

1.2 The Parallel Structure

SVG (Scalable Vector Graphics):

<circle cx="250" cy="450" r="30" fill="#ffde59">
  <animate attributeName="cy" from="450" to="100" 
           dur="2s" repeatCount="indefinite"/>
</circle>

ABC Notation:

X:1
T:Twinkle Twinkle Little Star
M:4/4
L:1/4
K:C
C C G G | A A G2 | F F E E | D D C2 |

Both are:

两者都是:


2. Literature Review: The Fragmented Landscape

Recent research has independently explored AI’s capabilities in processing structured symbolic representations across different domains, yet without recognizing their fundamental unity.

2.1 Visual Symbolic Understanding

VCode (Lin et al., 2024) introduces a benchmark that reframes multimodal understanding as code generation. Their key insight: SVG serves as a “symbolic visual representation” that preserves semantic meaning better than pixel-based formats. However, frontier vision-language models still struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding.

SGP-Bench (ICLR 2025 Spotlight) demonstrates that LLMs possess an intrinsic ability to understand symbolic graphics programs. Crucially, they show this understanding is not due to data leakage but represents a “potential foundational capability to interpret the semantics of deterministic, symbolic graphics programs.”

VCode(Lin等,2024)提出了一个基准测试,将多模态理解重新定义为代码生成。他们的关键洞察:SVG作为”符号化视觉表示”比基于像素的格式更好地保留语义含义。然而,前沿视觉语言模型仍难以生成忠实的SVG,揭示了语言中心和视觉中心编码之间持续存在的差距。

SGP-Bench(ICLR 2025 Spotlight)证明LLM具有理解符号图形程序的内在能力。关键的是,他们表明这种理解不是由于数据泄漏,而是代表了一种”解释确定性符号图形程序语义的潜在基础能力”。

2.2 Musical Symbolic Understanding

NOTA (Tang et al., 2025) presents the first large-scale multimodal music notation dataset with 1,019,237 records. Their approach involves “cross-modal alignment between the musical notes depicted in music score images and their textual representation in ABC notation.” This parallel between visual scores and text-based ABC mirrors the SVG-to-visual relationship.

ChatMusician (Yuan et al., 2024) treats ABC notation as a “second language” for LLMs, demonstrating that musical abilities can be integrated without harming language capabilities. Their success shows that ABC notation is inherently compatible with language models’ existing architecture.

MelodyT5 (Zhang et al., 2024) and NotaGen (Wu et al., 2025) further validate ABC notation as an effective symbolic music representation for LLM-based processing.

NOTA(Tang等,2025)提出了第一个大规模多模态音乐记谱数据集,包含1,019,237条记录。他们的方法涉及”乐谱图像中描绘的音符与其在ABC记谱中的文本表示之间的跨模态对齐”。这种视觉乐谱与基于文本的ABC之间的平行关系反映了SVG到视觉的关系。

ChatMusician(Yuan等,2024)将ABC记谱视为LLM的”第二语言”,证明音乐能力可以整合而不损害语言能力。他们的成功表明ABC记谱与语言模型的现有架构本质上兼容。

While these studies demonstrate impressive capabilities within their respective domains, they share a critical limitation: they remain siloed. SVG researchers focus on visual understanding, music researchers on musical understanding, but none recognize the underlying unity of these capabilities.

虽然这些研究在各自领域展示了令人印象深刻的能力,但它们有一个关键的局限性:它们仍然是孤立的。SVG研究者关注视觉理解,音乐研究者关注音乐理解,但没有人认识到这些能力的底层统一性。


3. Theoretical Framework: Symbolic Systems Theory

3.1 Core Thesis

We propose that AI’s understanding of SVG, ABC, and similar formats is not truly “multimodal” in the sensory sense, but rather uni-modal at a deeper level: symbolic parsing.

我们提出,AI对SVG、ABC及类似格式的理解在感官意义上并非真正”多模态”,而是在更深层次上是单模态的:符号解析。

The apparent “cross-modality” (visual SVG, auditory ABC) is superficial. At the cognitive level, the AI performs the same operation:

  1. Lexical Analysis: Parse text into tokens
  2. Syntactic Analysis: Build hierarchical structure
  3. Semantic Interpretation: Map structure to meaning
  4. Symbolic Reasoning: Manipulate abstract representations

表面上的”跨模态”(视觉SVG、听觉ABC)是肤浅的。在认知层面,AI执行相同的操作。

3.2 The Structure Principle

Principle: AI performance correlates more strongly with structural organization than with apparent modality.

原则:AI性能与结构化组织的相关性强于与表面模态的相关性。

Evidence:

证据

3.3 Falsifiable Predictions

If our theory is correct, we predict:

  1. Transfer Learning: Models trained on SVG should more easily transfer to ABC (and vice versa) than to pixel-based vision tasks
  2. Emergence Pattern: Any new structured symbolic format should be learnable by LLMs with minimal examples
  3. Performance Hierarchy: Structured > Semi-structured > Unstructured, regardless of sensory domain

如果我们的理论正确,我们预测:

  1. 迁移学习:在SVG上训练的模型应该比像素视觉任务更容易迁移到ABC(反之亦然)
  2. 涌现模式:任何新的结构化符号格式都应该可以用少量示例被LLM学习
  3. 性能层次:结构化 > 半结构化 > 非结构化,无论感官领域如何

4. Empirical Validation

4.1 Case Study: The Music Therapy Conversation

Context: Discussion about music therapy applications

Hypothesis: If AI can process SVG (structured visual), it should process ABC (structured musical)

Method: Direct prompt to generate ABC notation for a therapeutic melody

Result: Successful generation of well-formed ABC notation

Implication: The cross-domain transfer was immediate, requiring no additional training or fine-tuning

上下文:关于音乐治疗应用的讨论

假设:如果AI能处理SVG(结构化视觉),它应该能处理ABC(结构化音乐)

方法:直接提示生成治疗旋律的ABC记谱

结果:成功生成格式良好的ABC记谱

含义:跨领域迁移是即时的,不需要额外训练或微调

4.2 Supporting Evidence from Literature

VCode’s Struggle: Frontier VLMs struggle with SVG generation not because of visual reasoning failure, but because of insufficient symbolic structure learning

NOTA’s Success: Cross-modal alignment between score images and ABC notation succeeds because both rely on structural correspondence, not sensory similarity

SGP-Bench’s Finding: LLMs’ “foundational capability” for symbolic programs explains why both SVG and ABC are accessible without domain-specific training

VCode的困境:前沿VLM在SVG生成上的困难不是因为视觉推理失败,而是因为符号结构学习不足

NOTA的成功:乐谱图像和ABC记谱之间的跨模态对齐成功,是因为两者都依赖结构对应,而非感官相似性

SGP-Bench的发现:LLM对符号程序的”基础能力”解释了为什么SVG和ABC都可以在没有特定领域训练的情况下访问


5. Broader Implications

5.1 Rethinking “Multimodal AI”

Current discourse on multimodal AI often emphasizes sensory diversity (vision + language + audio). Our findings suggest a different framing:

Traditional View: Different modalities require different processing pathways

Proposed View: Structured symbolic representations share a common processing pathway, regardless of their rendering modality

当前关于多模态AI的讨论通常强调感官多样性(视觉+语言+音频)。我们的发现提出了不同的框架:

传统观点:不同模态需要不同的处理路径

提议观点:结构化符号表示共享一个共同的处理路径,无论其渲染模态如何

5.2 Design Principles for AI Interfaces

Implication for Human-AI Interaction:

  1. Favor Structured Representations: When designing AI-readable formats, prioritize structural clarity over sensory fidelity
  2. Leverage Cross-Domain Transfer: Success in one symbolic domain (e.g., SVG) suggests viability in another (e.g., ABC, CAD, chemical notation)
  3. Reduce Modality Silos: Instead of separate “vision models” and “audio models,” develop unified symbolic processing capabilities

对人机交互的启示

  1. 偏好结构化表示:设计AI可读格式时,优先考虑结构清晰度而非感官保真度
  2. 利用跨领域迁移:在一个符号领域的成功(如SVG)暗示在另一个领域的可行性(如ABC、CAD、化学记谱)
  3. 减少模态孤岛:不是分离的”视觉模型”和”音频模型”,而是开发统一的符号处理能力

5.3 Philosophical Considerations

This finding resonates with debates in philosophy of mind about the nature of mental representations. The success of symbolic processing across apparent modalities suggests that:

这一发现与心灵哲学中关于心理表征本质的辩论产生共鸣。符号处理跨越表面模态的成功表明:


6. Limitations and Future Work

6.1 Scope Limitations

This paper focuses on text-based symbolic formats (SVG, ABC). The theory may not extend to:

本文关注基于文本的符号格式(SVG、ABC)。该理论可能不扩展到:

6.2 Quantitative Validation Needed

While our qualitative observations are compelling, rigorous quantitative experiments would strengthen the claims:

虽然我们的定性观察是有说服力的,但严格的定量实验将加强这些主张:

6.3 Future Research Directions

  1. Unified Symbolic Processing Framework: Develop architectures explicitly designed for symbolic structure, agnostic to rendering modality
  2. New Symbolic Formats: Test predictions on emerging structured formats (e.g., new music notations, scientific visualizations)
  3. Hybrid Approaches: Explore combinations of symbolic and subsymbolic processing

  4. 统一符号处理框架:开发明确设计用于符号结构的架构,与渲染模态无关
  5. 新符号格式:在新兴结构化格式上测试预测(例如,新音乐记谱、科学可视化)
  6. 混合方法:探索符号和亚符号处理的组合

7. Conclusion

This paper began with a simple observation during a casual conversation about music therapy: “If AI can see SVG, it should be able to hear MIDI.” What seemed like a spontaneous analogy revealed a deeper truth about AI cognition.

By synthesizing recent research in symbolic graphics (VCode, SGP-Bench) and music notation (NOTA, ChatMusician, MelodyT5), we propose a unified theory: AI’s apparent “multimodal” capabilities are better understood as a single capacity for symbolic parsing, manifesting differently across rendering modalities.

This insight challenges the prevailing emphasis on sensory diversity in multimodal AI research. Instead, we suggest that structural organization—the clarity and consistency of symbolic representation—is the primary determinant of AI performance.

The practical implications are profound: designers of AI systems should prioritize structured symbolic formats over sensory fidelity, leverage transfer learning across symbolic domains, and develop unified processing capabilities rather than modality-specific solutions.

Most importantly, this work demonstrates the value of informal observation and cross-domain thinking. The connection between SVG and ABC was not discovered through systematic research programs, but through everyday experience and intuitive analogy. In the rapidly evolving landscape of AI, such moments of serendipitous insight remain invaluable.

本文始于一次关于音乐治疗的随意交谈中的简单观察:”如果AI能看到SVG,它应该能听到MIDI。”看似自发的类比揭示了关于AI认知的更深层真相。

通过综合符号图形(VCode、SGP-Bench)和音乐记谱(NOTA、ChatMusician、MelodyT5)的最新研究,我们提出了一个统一理论:AI表面上的”多模态”能力更好地理解为符号解析的单一能力,在不同渲染模态中表现不同。

这一洞察挑战了多模态AI研究中对感官多样性的普遍强调。相反,我们建议结构化组织——符号表示的清晰度和一致性——是AI性能的主要决定因素。

实际意义是深远的:AI系统的设计者应该优先考虑结构化符号格式而非感官保真度,利用跨符号领域的迁移学习,并开发统一的处理能力而非特定模态的解决方案。

最重要的是,这项工作展示了非正式观察和跨领域思维的价值。SVG和ABC之间的联系不是通过系统的研究计划发现的,而是通过日常经验和直觉类比发现的。在快速发展的AI领域,这种偶然洞察的时刻仍然是无价的。


References

Recent Symbolic Systems Research:

  1. Lin, K. Q., et al. (2024). “VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation.” arXiv:2511.02778.

  2. Tang, M., et al. (2025). “NOTA: Multimodal Music Notation Understanding for Visual Large Language Model.” arXiv:2502.14893.

  3. “Can Large Language Models Understand Symbolic Graphics Programs?” (2025). ICLR Spotlight Presentation. SGP-Bench project.

  4. Yuan, R., et al. (2024). “ChatMusician: Understanding and Generating Music Intrinsically with LLMs.” Proceedings of ACL.

  5. Zhang, X., et al. (2024). “MelodyT5: A Unified Score-to-Score Transformer for Symbolic Music Processing.” arXiv:2407.02277.

  6. Wu, Z., et al. (2025). “NotaGen: Advancing Musicality in Symbolic Music Generation with Large Language Model Training Paradigms.” arXiv:2502.18008.

  7. Wang, M., et al. (2024). “Visually Descriptive Language Model for Vector Graphics Reasoning.” arXiv preprint.

  8. Xing, Y., et al. (2024). “Empowering LLMs to Understand and Generate Complex Vector Graphics.” arXiv:2412.11102.

Note: This paper synthesizes multiple independent research streams to propose a unified theoretical framework. The core insight—that symbolic structure transcends apparent modality—emerged from real-world observation rather than laboratory experimentation, demonstrating the continued importance of informal discovery in AI research.

:本文综合了多个独立的研究流,提出了一个统一的理论框架。核心洞察——符号结构超越表面模态——源于真实世界的观察而非实验室实验,展示了非正式发现在AI研究中的持续重要性。


Written as a field report from everyday AI interaction
作为日常AI交互的田野报告而写

November 15, 2025