回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
Datasets & Benchmarks · Plate Nº 30

CALVIN

7 min read · 2327 字 · ⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

CALVIN 是一把"机器人听话考试"的尺子:人说一段话,机器人要在桌上一步接一步把活干完,34 个小任务统一打分。

这是个什么场景 — 日常类比

你周末在家煮泡面,对正在客厅刷手机的室友喊一嗓子:

"顺手把红杯子放水槽里,再把厨房灯关了。"

这一句话其实藏了好几步连续动作:起身、走过去、拿杯子、放进水槽、再绕去关灯。每一步都得先做完上一步——杯子还没拿起来就没法放下,人还没走到开关旁就没法关灯。人做这事儿不过脑子,但要让机器人听懂这一句话、然后按顺序把所有动作做完,就难了。

CALVIN 干的就是这件事:给机器人一张桌子,上面放着抽屉、积木、按钮、一个小 LED 灯,你跟它说一句话,它得照着做。和那种"一次只让你拧一颗螺丝"的简单测试不一样,CALVIN 逼算法同时处理三件事——听懂指令、把指令拆成几步、一步接一步做对

CALVIN — 场景示意:这论文要解决的现实问题
Plate Nº ICALVIN — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • 短任务为主的基准:早期机器人学习数据集(如 Meta-World、RLBench)任务大多是单步或短动作,比如"开抽屉"或"按按钮",缺少把多步串起来的真实感。
  • 不带语言指令:很多操作基准用任务 ID 或目标图片作为条件输入,机器人不需要理解人话。
  • 只在仿真里玩:仿真任务和真实指令分布脱节,模型学到的策略很难迁移到"自然语言用户"场景。
  • 演示数据规模小:早期很多方法靠几百条演示训练,难以训练大模型,也难以做语言泛化评测。
  • 缺统一评测协议:每篇论文自己定义指标,结果不可比;CALVIN 想做"机器人版的 GLUE"。

这篇论文的关键想法

像考一个学生"会不会举一反三"——单独教过他切菜、煮水、装盘,期末要看他能不能"做一道完整的菜"。CALVIN 把"机器人听话"也当成这种**组合泛化(compositional generalization,把学过的零件拼成没见过的组合)**问题来设计:

  1. 数据是连续的演示流:像看一整段做饭录像,里面切菜煮水装盘连着发生,而不是把每一步剪成独立短视频喂给机器人。
  2. 指令用自然人话:人类标注员事后给视频片段配字幕,机器人必须学会从"人说的句子"映射到"具体动作",不能靠任务编号偷懒。
  3. 考试时强制连做 5 步:评测要求机器人连续完成 5 个子任务才算一轮满分,错一步整轮就 GG,逼模型扛住误差越积越大的压力。
  4. 换环境再考一次:训练用 A/B/C 三个房间,考试可能换到没见过的 D 房间(颜色、桌面纹理、物体位置都不一样),看它认不认人。
CALVIN — 方法示意:核心 pipeline
Plate Nº IICALVIN — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

仿真平台与场景——像在电脑里搭了个迷你厨房模型。CALVIN 用 PyBullet(一种物理仿真引擎,可以理解为"游戏里的物理世界")搭了张虚拟桌子,机器人是 Franka Panda 七自由度机械臂(七个关节都能转,灵活度接近人胳膊),桌上有滑动门、抽屉、按钮、LED 灯、几块红绿蓝不同形状的积木。一共做了 4 套场景 A/B/C/D,桌面颜色和材质都略有差别,故意做出"换房间"的效果。

任务分解与标注——像剪一部纪录片再配字幕。研究者先让一个脚本策略(scripted policy,人手写规则的"自动驾驶"程序)在仿真里疯狂跑、录下超长的连续操作视频;然后人类标注员只对其中一部分片段写句子("打开抽屉""把红积木放到滑动门上"),剩下没标注的视频留着给模型自学。整套基准一共定义了 34 个子任务。每条数据具体多长、标注比例多少需查原文。

等等,先慢一拍——"标注"是什么?
就是给一段视频配上一句对应的自然语言指令,让机器人能学到"这句话 = 这串动作"的对应关系。没标注的视频虽然没字幕,但还能帮模型熟悉"动作长什么样"。

输入输出与控制接口——像给机器人配了眼睛、关节感和耳朵。每一步它能看到摄像头 RGB 图像、夹爪现在张开还是合上、自己各关节的角度(这个内部感觉叫 proprioception,本体感觉,相当于"我闭着眼也知道自己手在哪"),再加一句文字指令;输出是 7 个数字——末端往哪挪 + 夹爪开还是合。这套接口设计得很通用,端到端模仿学习或分层规划方法都能直接接上来跑。

评测协议——像高考"连环题",错一题整道大题作废。测试时考官一次性安排 5 条指令,机器人做完第 1 条才发第 2 条,中间任何一步搞砸后面就不发了。论文报告"做完 1 步""做完 2 步"……一直到"做完 5 步"的成功率,让你一眼看出"越做到后面越垮"的衰减曲线。这是 CALVIN 最有辨识度的设计。

实验在做什么

论文(按摘要 + 公开资料推断)做了两类对照:

  • 基线方法对比:跑了几条经典 baseline,包括行为克隆(BC)、目标条件 BC、加上语言 embedding 的变体,看它们在 5 步串联评测下的表现,多数会从第 1 步的较高成功率快速衰减到第 5 步接近 0。
  • 泛化设置:训练用 A+B+C 三个环境,测试用 D(未见环境),观察分布偏移下成功率掉多少;同时测试"语言泛化"——见过的指令换措辞、换物体颜色等。

具体每条 baseline 的 5 步成功率、人类标注规模、未标注数据规模等数字需读原文。

你应该懂的几个新词 — 4-6 个

  • language-conditioned manipulation(语言条件操作):机器人接受自然语言指令作为额外输入,把"做什么"从硬编码任务 ID 变成"读人话"。
  • long-horizon(长时序):一次任务跨越很多个时间步,且子目标之间有依赖关系;比"按一下按钮"复杂得多。
  • compositional generalization(组合泛化):见过 A、B 单独的指令,能否在没见过的"先 A 再 B"组合上正确执行;CALVIN 的 5 步评测就是直接测这个。
  • imitation learning / behavior cloning(模仿学习/行为克隆):用专家演示作监督信号训策略,最朴素的版本是"看到状态 → 预测动作"的回归。
  • proprioception(本体感觉):机器人对自己关节角度、末端位姿的内部感知,相当于"我自己手在哪",是策略输入的一部分。
  • scripted policy(脚本策略):人手工写规则跑出演示数据,不是学出来的;CALVIN 用它生成大规模未标注流。

它和其他论文什么关系

  • 与 RLBench / Meta-World:CALVIN 把任务粒度从"单步"提到"多步串联",并强制语言条件,定位是"长时序+语言"的补位。
  • 与 BC-Z、Hiveformer 等语言条件操作工作:这些工作通常是方法论文,CALVIN 提供它们一个统一评测床。
  • 与 RT-1、RT-2 等大模型路线:CALVIN 的 5 步评测对大模型友好(语言理解强),是后来许多 VLA(Vision-Language-Action)模型常用的 sanity check。
  • 与 LIBERO、SimplerEnv 等后续基准:后辈基准沿袭"语言+长时序"思路,但加入更多任务、更真实物理或更接近真机分布。

我建议这样读 — 3-4 步

  1. 先看图 1 + 评测协议章节:搞清楚"5 步串联评测"具体怎么发指令、怎么判定成功,这是基准的灵魂。
  2. 跳到环境与任务列表:浏览 34 个子任务的语言模板和初始状态,建立"它到底在测什么"的具体感。
  3. 看一眼 baseline 表:观察 5 步成功率衰减曲线,会立刻意识到长时序为什么难。
  4. 可选:扫数据收集与标注流程:如果你打算用这个数据集训自己的模型,必须搞清楚标注语言的分布与 split 划分。

为什么值得读

CALVIN 是 2022 年开始事实上的"语言条件操作长时序基准默认选择",后续大量 VLA、扩散策略、hierarchical planner 论文都拿它当起点。读它的好处不是学一个新方法,而是校准你对"长时序操作有多难"的直觉——看着第 1 步 70% 成功率掉到第 5 步个位数,你就明白为什么后来的工作都在拼命想办法解决误差累积、子目标分解、语言对齐这些问题。对零基础学习者来说,这篇是建立"操作基准是怎么回事"心智模型的好入口。

引用本笔记 / Cite this note
BibTeX
@online{eai_calvin_2026,
  title       = {(readable note) CALVIN},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2022 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/calvin/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim