回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
Imitation Learning · Plate Nº 61

SmolVLA

6 min read · 2004 字 · ⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

Hugging Face 推出的小型机器人模型:把"看到 + 听到 + 动手"塞进一张游戏显卡能训的小脑袋,让没数据中心的人也能在家玩具身 AI。

这是个什么场景 — 日常类比

你跟室友说"帮我把红色那个杯子放抽屉里"。室友要做三件事:眼睛瞄一下杯子在哪、耳朵理解你说的是哪个、手伸过去拿过去放好。看 + 听 + 动手——机器人里就叫 vision、language、action 这三件套。

过去能训出这种"听话机器人"的,基本只有米其林大厨级别的玩家:要后厨(数据中心几百张 H100)、要独家食材(自家积累的私有数据)、要慢炖几周。普通人想跟着学,连灶台都摸不到。

SmolVLA 想做的事更像给家里塞一台小烤箱:买得起(一张 4090 就够)、放得下(笔记本 GPU 也能跑)、菜谱用社区共享的(公开数据集)。烤出来不一定是米其林那个味,但至少"在家就能烤"这件事第一次成立了。

SmolVLA — 场景示意:这论文要解决的现实问题
Plate Nº ISmolVLA — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • RT-2(Google 2023):把大型 VLM(视觉语言模型)直接微调成 VLA,55B 参数级别,需要 Google 内部 TPU 集群,社区无法复现
  • OpenVLA(2024):开源化的尝试,7B 参数,但训练仍需要多卡 A100,门槛高
  • Octo / RT-1 系列:参数较小但架构复杂,预训练数据也封闭(多依赖 Open X-Embodiment 等聚合数据集)
  • 共同痛点:模型大 → 推理慢、训练贵、社区难复现;私有数据 → 没法在自家机械臂上做迁移

这篇论文的关键想法

核心赌注,用一句话讲就是:别人是开五星酒店,我开社区小馆子——食材普通、厨房不大,但街坊都吃得起,而且味道居然过得去。

落到机器人上就是赌"小而精 + 社区数据,在具身这个领域也 work"。具体做了三件事(基于摘要推断,细节需读原文):

  1. 架构压缩:像把一本厚字典抽成口袋本——用蒸馏 / 共享主干 / 跳层这些技巧,把 VLA 压到一张消费级 GPU(如 RTX 4090)就能训和跑
  2. 数据民主化:菜谱不锁后厨——训练数据全部来自 LeRobot 等社区平台公开发布的示范片段,不掺一点私有数据
  3. 保持可用性:小馆子也得能上菜——在标准基准(如 LIBERO 或自建任务)上验证,小模型确实能完成抓取、放置、按指令操作这些事
SmolVLA — 方法示意:核心 pipeline
Plate Nº IISmolVLA — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

整体架构:像一个三人翻译小组——一个负责看图(视觉编码器 vision encoder,把图像变成 token),一个负责听话(语言编码器 language encoder,把指令变成 token),一个负责动手(动作解码器 action decoder,输出每一时刻的关节角度或末端位姿)。Hugging Face 已经有 SmolLM 管语言、SmolVLM 管视觉,SmolVLA 就是把"动手"那个组员也补齐,凑成完整团队。

参数压缩:像让大徒弟教小徒弟,把一身本事浓缩进更瘦的身板。常见招数有:从大 VLM 教师模型(teacher)蒸馏出一个 student;冻结视觉/语言主干只训 action head(动作头);或者用 MoE-like 路由稀疏激活。SmolVLA 用了哪几招,要读原文确认。

等等,先慢一拍 —— "蒸馏"是什么?想象一个大模型是百科全书,小模型是单词卡。蒸馏就是让小模型不去背原文,而是抄大模型对每道题的"答案 + 信心打分"。学的不是死答案,是大模型的判断习惯,所以体积小但味道接近。

Action 输出:写动作有两种思路。一种是像写字一样写动作——把连续的关节角度切成离散 token,模型一个个吐出来;另一种是像画画一样画动作——用 diffusion / flow matching 直接画出连续的动作轨迹。两条路各有取舍,SmolVLA 走的是哪条,论文里有详细对比。

训练数据 pipeline:相当于把全社区做菜视频拼成一本菜谱。原料来自 LeRobot Hub 上各种小型机械臂(SO-100、Koch arm 等)记录的人类遥操作片段。论文应该会讲怎么清洗这些数据、对齐相机视角、统一动作空间——这些都是看不见但很费时的"脏活"。

实验在做什么

基于 VLA 论文常见的实验套路(具体数字需读原文):

  • Sim 基准:LIBERO / Meta-World / RoboCasa 等仿真环境,对比 OpenVLA、RT-2 看任务成功率
  • 真机迁移:在 SO-100 等社区低成本机械臂上跑 pick-and-place、按指令抓取等任务,看 zero-shot 和 few-shot 表现
  • scaling 曲线:参数量从更小到目标尺寸,看性能-参数曲线在什么位置开始 plateau(饱和)
  • 消融:去掉社区数据、换主干、改 action head 等,看每一项对最终性能的贡献

关键看点:"小到什么程度还能 work"——这是社区想知道的核心问题。

你应该懂的几个新词 — 4-6 个

  • VLA(Vision-Language-Action):把"看 + 听指令 + 做动作"端到端学进一个模型,是 2023 年后机器人领域的主流范式
  • 示范数据(demonstration):人类通过遥操作(teleoperation)操控机械臂完成任务录下来的(图像,指令,动作)三元组,是模仿学习(imitation learning)的食材
  • Action token / action chunk:把连续的关节角度切成离散 token 或固定长度的小段(chunk),让模型可以像生成文字那样生成动作
  • Flow matching / diffusion policy:用扩散模型类的连续生成方法直接输出动作向量,绕开离散化损失
  • LeRobot:Hugging Face 维护的开源机器人学习库 + 数据 hub,是 SmolVLA 的"数据来源 + 部署框架"
  • 消费级 GPU:相对于 H100/A100 这种数据中心卡,指 RTX 4090/3090 这类个人能买到的卡,显存 24GB 左右

它和其他论文什么关系

  • 延续 OpenVLA / RT-2 的 VLA 范式,不是另起炉灶
  • 跟 SmolLM、SmolVLM 是同一个"Smol 家族",Hugging Face 把"小模型也能 work"这条主线从 NLP 扩到 vision 再扩到 robotics
  • 跟 LeRobot 项目深度绑定:SmolVLA 既是 LeRobot 的"旗舰模型",也是 LeRobot 数据集的"消费者",互相成就
  • 对照 π0、Pi-0.5、RDT-1B 等大型 VLA:那条路线追求 SOTA,SmolVLA 这条路线追求 accessibility(可及性)
  • 可以看作 ALOHA / DexCap 等廉价硬件路线在"模型侧"的呼应:硬件已经下沉,模型也得下沉,整套 stack 才能真正进入社区

我建议这样读 — 3-4 步

  1. 先看 LeRobot 的 README 和 SmolVLA 模型卡(Hugging Face Hub),用 5 分钟搞清楚它实际在哪种机械臂、哪些任务上跑
  2. 读论文的 method 章节,重点回答三个问题:参数压到多少、用了什么蒸馏/压缩技巧、action 是离散还是连续输出
  3. 看实验里跟 OpenVLA 的对比,特别是"小模型在哪些任务上 gap 还是大、哪些已经追平"——这告诉你小模型当前的边界
  4. (可选)clone LeRobot repo 跑一遍 inference,亲手感受一下"在自己 GPU 上能不能转起来",这是这篇论文最大的实践价值

为什么值得读

  • 零基础上手具身 AI 的最佳入口之一:你不需要 8 卡 H100 才能开始玩 VLA,单卡就行
  • 代表"机器人模型平民化"的拐点:类似 NLP 领域 Llama / Mistral 让本地推理成为可能
  • 方法论本身可迁移:怎么把大模型蒸馏 + 用社区数据训出可用小模型,这套思路对其他领域也有借鉴
  • 跟硬件社区共振:SO-100 一两千块就能搭起来,加上 SmolVLA,"在家训练自己的机器人"第一次在普通人预算内可达

引用本笔记 / Cite this note
BibTeX
@online{eai_smolvla_2026,
  title       = {(readable note) SmolVLA},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2025 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/smolvla/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim