回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
High-Level Planning · Plate Nº 84

VoxPoser

6 min read · 2090 字 · ⭐⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

VoxPoser 让大模型给机器人画两张 3D 地图:红色地方要去,灰色地方要躲,机器人照着地图走出动作,全程不训练新模型。

这是个什么场景

你跟外卖小哥说:"帮我把这杯奶茶放阳台桌左边,别被狗碰到,路过客厅时离婴儿床远一点。" 小哥会脑补一张房间地图:阳台桌左边画个"目的地"圆圈,狗窝和婴儿床各画个"绕行"红圈,然后挑一条路绕过去。

这条指令里其实混着三种信息:

  • 目标("放阳台桌左边"——某个 3D 位置要被靠近)
  • 约束("别被狗碰到"——某些区域要被避开)
  • 偏好("温的"——速度/姿态等隐性参数)

机器人也要做同样的事,但难点在:以前的做法是工程师提前把每种动词都写成 API("放在哪""避开什么"),动词没列进去就抓瞎。VoxPoser 换了个思路:让大模型当场对着房间画那张"红圈+绕行圈"的地图,机器人顺着地图走。地图不是预先准备的,是 LLM 现场画的——指令变了,地图就跟着变。

VoxPoser — 场景示意:这论文要解决的现实问题
Plate Nº IVoxPoser — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • 行为克隆 / RT-1 / RT-2 路线:收集大量 (语言, 图像, 动作) 三元组,训练端到端策略。问题:每个新动词都要新数据。
  • SayCan / Code-as-Policies:让 LLM 把指令拆成预定义技能(pick / place / open)的组合。问题:受限于技能库的边界,没见过的组合容易失败。
  • 传统运动规划 + 手写代价函数:每个任务由工程师设计 cost function。问题:写不动,泛化不了。
  • 基于学习的世界模型 + RL:训练成本极高,sim-to-real 难。
  • 关键缺口:上述路线要么"动作端"要么"任务端"硬编码,缺少一个能把开放语言直接映射到 3D 空间几何约束的桥梁。

这篇论文的关键想法

核心 insight 是:LLM 已经懂"靠近/避开/经过/对齐"这些空间动词,VLM 已经懂场景里有哪些物体,缺的只是把这两件事翻译成机器人能用的几何表达

VoxPoser 的赌注是——这个翻译不需要再训一个模型,而是让 LLM 直接生成"调用 VLM 找物体 + 在 3D 体素网格上写值"的 Python 代码。代码跑完,得到两张体素场(voxel field):

  • Affordance map(亲和力场):值越高代表越想去
  • Constraint map(代价/约束场):值越高代表越要避

然后用一个无优化(zero-shot)的运动规划器,在两张场上做梯度下降式的轨迹合成。整个 pipeline 没有任务专属训练。

VoxPoser — 方法示意:核心 pipeline
Plate Nº IIVoxPoser — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

第一段:LLM 当指挥官,写代码而不是写动作。 给定一条自然语言指令("把抽屉里的瓶子放到水槽旁边,但别碰到刀"),VoxPoser 把指令喂给 LLM,让它输出一段 Python 代码。代码里会调用一组预定义的"原语函数":detect(物体名) 返回 VLM 给的 3D 位置 / mask;get_empty_voxel_map() 给一张空体素场;然后 LLM 在这张场上写值——例如在水槽附近写一个高斯峰(吸引),在刀的位置写一个倒高斯(排斥)。

第二段:VLM 当眼睛,把语言锚到几何上。 LLM 不直接看图,它发指令"找瓶子在哪里",由 OWL-ViT / CLIP 类的开放词汇检测器在 RGB-D 图上定位,再投影回 3D 得到坐标。这一步把"瓶子"这个抽象 token 变成体素索引 (i, j, k)。

第三段:体素场合成 + 规划器执行。 两张体素场叠加成一个总的代价场 C(x) = -Affordance(x) + λ·Constraint(x)。一个简单的轨迹优化器(论文里用 greedy + model predictive control 类思路)从机器人当前位置出发,在场上找一条总代价最小的路径。因为场是稠密的,规划器不需要符号级别的子目标。

第四段:闭环 + 动态更新。 执行过程中,场景变化(被推动的物体、新出现的障碍)通过周期性重新调用 VLM 检测来更新体素场——这让 VoxPoser 在动态环境(人手干扰、物体被移动)里仍能纠错。具体重规划频率和场分辨率需读原文。

实验在做什么

论文在仿真和真机上都做了实验。仿真用 RLBench 等基准评估"自由形式指令"的成功率,与 Code-as-Policies、传统 BC 等基线对比。真机用桌面机械臂(Franka 类)做"开抽屉、避开人手、按颜色分类、跟随移动目标"等任务。

亮点:

  • 任务可以是训练数据里完全没见过的组合(zero-shot 长尾)
  • 在动态干扰下仍能完成(因为场会重算)
  • 与 SayCan 类方法相比,无需预定义技能库

具体成功率数字、任务条数、与各基线的对比百分比需读原文。

你应该懂的几个新词 — 4-6 个

  • Voxel field(体素场):把 3D 空间切成均匀小方块(体素),每个方块存一个标量。可以理解成"3D 版的灰度图"。
  • Affordance map(亲和力图):值越大代表"这里越值得去/越适合做某动作"。词源来自 Gibson 的 affordance 心理学——"环境对动作的可供性"。
  • Constraint map(约束/代价图):和 affordance 互补,值越大代表越要避开。
  • Open-vocabulary detection(开放词汇检测):传统检测器只认训练时见过的类(COCO 80 类),开放词汇检测器(OWL-ViT、Detic)能识别任意名词。VoxPoser 靠它把"那个红色的杯子"变成一个 box。
  • Zero-shot motion planning(零样本运动规划):规划器本身不需要任务专属训练,给定 cost field 就能搜出轨迹。
  • LLM-as-code-writer:不让 LLM 直接输出动作,让它输出可执行代码——可读、可调试、可组合。源自 Code-as-Policies。

它和其他论文什么关系

  • 直接前辈:Code-as-Policies(同组工作,2022)——LLM 写代码调技能;VoxPoser 把"技能"换成了"几何场操作",更细粒度。
  • 同期对照:SayCan(2022)——LLM 选技能,技能库受限;VoxPoser 不要技能库。
  • 共用工具:VLM 检测部分和 PaLM-E、CLIPort、F3RM 等"语言锚到 3D"工作共享思路。
  • 后继发展:ReKep(2024)、Copa、ManipLLM 等把"几何约束"思想推得更远——从体素场扩展到关键点关系、SDF 等表达。
  • 互补路线:扩散策略(Diffusion Policy)、OpenVLA、π0 走的是"训练大策略"路线,VoxPoser 走的是"零训练 + 几何中间表达"路线。两条路线在 2024-2025 开始融合(用 VLM 写 cost、再用扩散采轨迹)。

我建议这样读 — 3-4 步

  1. 先看 Figure 1 + Figure 2:理解"LLM 写代码 → 体素场 → 规划器"三段式 pipeline。这是论文的灵魂图,看懂了就抓住 80%。
  2. 跳到方法的 prompt 示例:作者一定贴了 LLM 实际收到的 prompt 和输出代码。逐行对照"自然语言 → 代码 → 体素操作"的映射,体会"为什么 LLM 能做这件事"。
  3. 看实验里的失败案例:论文一般会分析 LLM 写错代码、VLM 检测错物体的情况——这些是这条路线真实的天花板。 4.(可选)对照 ReKep 论文读:ReKep 是 VoxPoser 的精神续作,对比能看出"体素场 → 关键点约束"的演化逻辑。

为什么值得读

VoxPoser 是 2023 年"LLM + 机器人"路线里少数同时满足三个条件的工作:不训练新策略 / 支持开放语言 / 真机能跑。它的方法论价值不止于具体技术——更在于提出了一种范式:"让基础模型生成中间表达(geometric field),而不是直接生成动作"。这个思想在后续两年衍生出一整支研究分支(ReKep、Copa、关键点约束系列),是理解 2024+ 操控研究的钥匙。

对零基础学习者,它还是一篇罕见的"读完就懂为什么 LLM 能帮机器人"的论文——不像端到端 VLA 那样像黑盒,VoxPoser 的每一步都看得见、能 debug、能换组件。即使后来的 SOTA 不再用体素场,理解这套思路对设计任何"基础模型 + 控制"系统都有直接启发。

引用本笔记 / Cite this note
BibTeX
@online{eai_voxposer_2026,
  title       = {(readable note) VoxPoser},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2023 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/voxposer/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim