回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
Diffusion Policy · Plate Nº 48

pi_0.5: VLA with Open-World Generalization

7 min read · 2353 字 · ⭐⭐⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

让机器人第一次走进一个陌生人家,也能听懂"收拾下厨房"然后自己一步步把活干完。

这是个什么场景 — 日常类比

你第一次去朋友家做客,朋友说"帮我把饭后桌子收一下"。你从来没进过这个厨房,但你照样能干:

  • 厨房大概长啥样(你脑子里有"厨房常识")
  • "收桌子"该干哪几件事(把碗端去水池 / 擦桌面 / 把垃圾扔掉)
  • 一个没见过的怪形杯子,伸手怎么抓也大概有数

人类觉得这事再普通不过,但对机器人来说是道大坎:以前的机器人策略基本只在"训练时去过的那间厨房"里好用,搬一台到新房子,立刻抓瞎——找不到柜子、不知道脏盘子在哪、连杯子都拿不稳。pi_0.5 干的就是这件事:让机器人也能像人一样"换个新家也能开干"。

pi_0.5 — 场景示意:这论文要解决的现实问题
Plate Nº Ipi_0.5 — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • RT-1 / RT-2:把"看到啥 + 听到啥指令 → 输出动作"训成一个大 Transformer,但数据还是限定场景,跨房间泛化弱
  • OpenVLA:开源 7B VLA,离散 action token,在 Open-X-Embodiment 上做指令微调,泛化在受控场景下不错,但没把"语义子任务分解"这层显式建模
  • pi_0(本作前身):用 flow matching / diffusion 头预测连续动作 chunk,跨多种机器人形态联训,已经是 frontier 水平,但仍主要在演示分布内表现好
  • SayCan / Inner Monologue 系:靠 LLM 做高层规划,但底层执行器和高层规划是两段拼接,不是端到端
  • 传统 BC(Behavior Cloning):单任务、单环境、堆遥操数据,换个家就崩

这篇论文的关键想法

核心赌注:一个模型 + 杂七杂八的数据一起练 = 换房子也能干活。具体三招:

  1. 多机器人遥操数据——像让一个新人厨师在十家不同厨房轮岗:每家灶台高度、刀具大小都不一样,待几个月后,他练出来的是"切菜"这件事本身的手感,而不是"我家灶台前的肌肉记忆"。多种机械臂(embodiment,机器人本体)的轨迹混在一起喂,模型学到的是抽象的"怎么操作物体"。
  2. 网页规模图文 / VQA 数据——像让机器人在干活之前先刷了几年小红书和百度百科:它早就在网上见过"杯子长啥样、脏衣服一般在洗衣篮、洗洁精摆水池边"。这些常识来自把网页图文、视觉问答(VQA)数据一起塞进训练,让模型继承 VLM(视觉语言模型)的"世界知识"。
  3. 语义子任务(semantic subtask)标注——像教学徒前先让他自己念出"我下一步要干啥"。把"打开柜子"拆成"走到柜子前 / 抓把手 / 往外拉"这种自然语言步骤,模型既学高层"接下来该干嘛",也学低层"这一步手怎么动",分解能力直接焊进同一个网络里,而不是另外挂一个 LLM 在旁边发指令。

直觉上:把"网页里的世界常识 + 多机器人的动作手感 + 把大任务拆小的本事"压成一个脑子,到了陌生房子也能现场起手就干。

pi_0.5 — 方法示意:核心 pipeline
Plate Nº IIpi_0.5 — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

架构骨架——像一个"嘴巴 + 手"组合:嘴巴是一个预训练好的 VLM(视觉语言模型,负责看图听话),手是一个动作专家头(action expert,负责把"看到的+听到的"翻译成连续的关节动作)。给它当前画面 + 一句指令,它输出未来 H 步动作。这套骨架 pi_0 已经搭好了,pi_0.5 没大改。

等等,先慢一拍 —— flow matching(流匹配)和 diffusion(扩散)是啥? 都是"给定一团随机噪声,逐步去噪 → 得到目标"的训练目标。区别是 diffusion 学每一步加多少噪音、flow matching 直接学"从噪声到目标"的速度方向,更直接更省算力。这里只要知道:动作头不是一次性硬猜动作,而是像雕塑家从一团乱泥逐步刻出动作。

数据混配(co-training,异构联训)——像一个学生同时上三门课不分开:(1) 机器人遥操轨迹(带动作标签,练手);(2) 网页图文 / VQA(没动作,只让"嘴巴"那部分继续涨常识);(3) 语义子任务样本(输入"打开柜子",输出"走到柜子前 / 抓把手 / 拉开",专门练"分解任务"这件事)。三种数据按某个比例混在同一批里训,配比需读原文。

推理时的两层调用——像厨师做菜先念菜谱再下锅:执行长程任务时,模型先在文本侧念出下一个子任务("先走到柜子"),再以这句子任务为条件,让动作头输出对应的一段动作。这一段干完,再念下一句。等于把 SayCan 那种"上面一个 LLM 派活、下面一个策略干活"的两段拼接,整个塞进同一个网络里端到端跑。

训练规模与 embodiment:跨多种机械臂 + 双臂移动平台一起训,场景涵盖厨房、卧室、客厅等真实家居。具体机器人型号、数据小时数、模型参数量需读原文。

实验在做什么

主打的是未见过的真实家庭中的长程任务:研究员把机器人搬进训练时没出现过的真实房子,让它做"清理桌面 / 整理床铺 / 把脏衣服放进洗衣机"这类长程多步任务。这种"真去陌生人家里做"的设定,比传统 benchmark(LIBERO / SimplerEnv / 单一实验室桌面)严苛得多。

预期对比对象:pi_0、OpenVLA、消融掉子任务标注的版本、消融掉网页数据的版本。指标大概率是任务成功率 + 子任务完成率 + 跨房子的方差。具体数字需读原文。

你应该懂的几个新词 — 4-6 个

  • VLA(Vision-Language-Action):把视觉、语言、动作三模态合一的模型,VLM 的机器人表亲。RT-2 是开山,pi_0 / OpenVLA 是当前 frontier
  • Open-world generalization:模型在训练分布外的场景(新房子、新物体、新指令组合)也能干活。区别于 benchmark 上"测试集和训练集同分布"的封闭评测
  • Co-training(异构联训):不同模态、不同任务、不同标签结构的数据混在同一个 batch 里训。难点是任务间互相干扰,平衡比例和损失权重是脏活
  • Semantic subtask(语义子任务):长程任务拆成的自然语言中间步骤。"做早餐" → "煎蛋 / 煮咖啡 / 烤面包"。pi_0.5 把它当成一种新的训练信号
  • Flow matching:训练连续输出(动作)的一种目标,和 diffusion 是亲戚但更直接——直接学从噪声到目标的速度场。pi_0 系采用
  • Embodiment:机器人本体。不同 embodiment = 不同关节、不同夹爪、不同自由度。多 embodiment 联训是为了学到本体无关的操作先验

它和其他论文什么关系

  • pi_0:直系前作。pi_0 解决"多机器人 + diffusion 动作头"的训练框架,pi_0.5 在它基础上加了网页数据和语义子任务这两道菜
  • RT-2 / OpenVLA:同代 VLA 竞品。RT-2 闭源,OpenVLA 开源,pi 系是 Physical Intelligence 公司的旗舰,做工业级真实家居场景
  • SayCan / Inner Monologue:高层规划路线的代表。pi_0.5 把它们的"LLM 分解任务"思路端到端化,不再是两个模型拼接
  • Open-X-Embodiment / DROID / Bridge V2:多机器人数据集的地基。pi 系训练数据的来源之一
  • Diffusion Policy / Consistency Policy:动作头侧的方法谱系。pi_0/pi_0.5 用的 flow matching 是这条线的延伸

我建议这样读 — 3-4 步

  1. 先回顾 pi_0 笔记,确认你理解 flow matching 动作头 + 多 embodiment 联训这俩支柱
  2. 读 pi_0.5 摘要 + intro + 方法图,重点看"数据混配比例"和"子任务怎么标注/训练"这两段——这是它和 pi_0 的真正差异
  3. 跳实验章节,只看真实家居那张主表 + 消融:消融掉子任务 / 消融掉网页数据,性能掉多少?这告诉你创新点的真实贡献
  4. 最后看 limitation:这种规模训练通常有"长尾任务仍然失败 / 推理延迟高 / 安全边界"的问题,作者怎么自陈

为什么值得读

  • 它是 2025 年 VLA 的 state-of-the-art 之一,理解它等于摸到了 frontier 的边
  • "开放世界泛化"是 embodied AI 的圣杯之一,pi_0.5 给出了一个具体可参考的配方:异构数据 + 语义子任务 + 多 embodiment
  • 方法论上非常"工程派":没有华丽新结构,靠数据混配 + 任务设计推动性能。这种"工程胜过新算法"的研究方式本身值得学
  • 如果你做机器人/具身智能方向,pi 系是必读路线:pi_0 → pi_0.5 → 后续工作大概率继续沿这条线走
  • 对你(编程零基础学习者)的价值:不需要复现,但理解"为什么把 VLM 知识 + 机器人动作 + 任务分解塞一起"这个直觉,能让你看后面所有 VLA 论文都有锚点

引用本笔记 / Cite this note
BibTeX
@online{eai_pi05_2026,
  title       = {(readable note) pi_0.5: VLA with Open-World Generalization},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2025 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/pi05/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim