3D-VLA
本笔记基于摘要 + 公开资料,未读全文。
一句话讲什么(TL;DR)
让机器人除了看平面照片,还能"摸到"立体形状;动手前先在脑里画一张"做完后的样子",再照着画面去动。
这是个什么场景
你在厨房想拿冰箱后面那瓶被推到深处的酱油。
如果只看一张正面照片,你只能判断"酱油大概在那个区域",但伸手过去经常会撞到前面的牛奶——因为照片告诉你"它在那儿",没告诉你"它离你多远"、"它和牛奶谁前谁后"。这就是 2D 图像 的盲点。
但你真实拿酱油时,眼睛看到的是立体世界:深度、遮挡、前后关系都有。这种带有 xyz 三维坐标的"几何快照",就是 3D 点云(point cloud,一堆带空间坐标的小点拼出的立体形状)。
再想一层:你伸手之前,脑子里其实闪过一个画面——"等下我手指会绕过牛奶、虎口正好夹住酱油瓶颈"。这个 想象出来的下一秒 帮你提前调整路径。3D-VLA 就是要让机器人也学会"看立体 + 先想象 + 再动手"这套组合拳。

之前的人怎么做的 — 3-5 bullet
- RT-2、OpenVLA 等主流 VLA:输入是 2D RGB 图像 + 语言指令,直接输出动作 token。深度信息要么靠模型脑补,要么完全没有。
- 2D 视觉模型:对遮挡、深度、空间关系("杯子在书后面")的理解不稳,长程操控容易出错。
- 3D 感知工作(3D-LLM 等):能理解 3D 场景,但通常停在"问答 / 描述"层级,没接到动作输出。
- World Model 路线:有人尝试让模型预测下一帧画面(video prediction),但很少把它和动作生成、3D 几何统一在一个框架里。
- 缺口:把"3D 输入 + 未来观测预测 + 动作输出"三件事拧成一个 VLA,是这篇论文要补的位。
这篇论文的关键想法
核心三件事:
3D 输入:像把平面菜谱图换成 3D 立体模型。模型不再只看 RGB 平面图,而是把 RGB-D(彩色图 + 深度图)或多视角图像先转成点云,再用 3D encoder(基于 Point Transformer / 3D-LLM 系的编码器,把点云压成一串 token)喂给大模型。
生成未来观测:像下棋时先在脑里推演"如果走这步,棋盘会变成什么样"。模型输出不只是动作,还输出"假如执行这条动作,未来场景会长什么样"——可以是预测的图像、目标姿态点云、或场景描述。
以未来观测为桥接:以前是"听指令 → 直接出招",现在是"听指令 → 心里画个目标图 → 照着目标图调动作"。多了中间这张想象图,长程任务里就有了可以分段对齐的子目标。

它怎么做的(方法)
第一步:先备菜——构造 3D 训练数据。像厨师备料,没有菜就做不出饭。论文整理 / 标注了一批带 3D 信息的具身数据集(embodied dataset,机器人执行任务时的录像 + 标注),关键字段:RGB-D 帧、相机内外参、目标物体的 3D box(立体包围盒)、动作轨迹、语言指令。这样模型一次能同时见到"语言 + 3D 场景 + 动作 + 未来场景"四件套。具体数据规模和来源需读原文。
第二步:搭骨架——一个主厨带几个副手。主厨是大语言模型(LLM)做 backbone,前面接一个 3D encoder(专门把点云 / RGB-D 翻译成 LLM 听得懂的 token),后面挂多个 head(不同任务的输出口):一个出动作 token,一个出未来观测(可以是 image token、3D goal token 等多种形态)。一次前向,同时端出"动作"和"想象"两道菜。
等等,先慢一拍——token 是什么? 你可以把它理解成 LLM 嘴里的"词"。LLM 只会嚼一串串词,所以图像、点云都得先被翻译成"词"才能被吃进去。
第三步:练手——多任务联合训练。像同时教学生三门课,互相补:
- 语言对齐:让 3D token 和文字对齐(类似 3D-LLM 的做法),好让模型听懂"杯子在书后面"对应哪团点云。
- 未来观测预测:拿数据集里的下一帧 / 目标帧当标准答案,逼模型练"脑补能力"。
- 动作回归:拿人类专家的真实操作轨迹当标准答案,练"手稳"。
把"未来观测"挂成 auxiliary loss(辅助损失,相当于副科考试),模型为了考好这门副科,必须在脑子里建一小块"如果我这么动,世界会怎么变"的小型世界模型。
第四步:上场——推理时怎么用。给定语言 + 当前 3D 观测,模型先在内部画一张"未来想象图",然后基于"现在 + 想象的未来"一起解码出动作。这相当于 具身版的 chain-of-thought(思维链,模型先写中间步骤再出答案)——只是这里的中间步骤不是文字推理,而是一帧想象出来的下一秒场景。
实验在做什么
主要看三类实验(具体数据集、指标、数字需读原文核对):
- 3D 推理 / 问答:在 3D 场景理解 benchmark 上对比纯 2D VLA,验证"加 3D 输入"的收益。
- 目标生成 / 未来观测预测:模型预测的下一帧画面或目标姿态,和真值对比的视觉质量、物理合理性。
- 下游操控任务:在仿真(如 RLBench / CALVIN 类)或真机上跑长程操控,对比 RT-2 / OpenVLA 的成功率,重点看"长程"和"需要空间推理"的任务。
预期结论:3D 输入 + 未来观测预测两个 ingredient 各自有增益,叠在一起更好;尤其在涉及深度、遮挡、长程的任务上。
你应该懂的几个新词 — 4-6 个
- VLA(Vision-Language-Action):把视觉编码器、语言模型、动作输出头串成一个端到端模型,输入图像 + 语言指令,输出机器人动作 token。
- 点云(point cloud):场景的 3D 表示,由一堆 (x, y, z, 可能附带颜色 / 法向量) 的点组成,相当于 3D 版的"像素图"。
- RGB-D:普通彩色图(RGB)加上深度图(D,每个像素到相机的距离);可以由 RGB-D 直接反投影出点云。
- 未来观测预测(future observation prediction):模型不只预测动作,还预测"执行后场景的样子",提供额外监督和规划信号。
- 3D-LLM 风格 encoder:把 3D 点云压成一串 token 喂给 LLM 的桥梁模块,常见做法是先抽 3D 特征(Point Transformer 等),再做 3D-to-text token 投影。
- 具身链式思考(embodied chain-of-thought):把推理中间步骤做成"想象的未来场景"而非纯文本,让模型的思考贴着物理世界走。
它和其他论文什么关系
- RT-2 / OpenVLA:3D-VLA 的直接对手;同样是 VLA 范式,区别在输入维度(2D vs 3D)和是否显式生成未来观测。
- 3D-LLM:本质上是这篇的"上游"——3D-LLM 解决了"3D 场景能不能进 LLM"的问题;3D-VLA 把它扩展到动作输出。
- World Model 系(Dreamer / GR-1 / 1X World Model):都强调"先建未来再做决策",3D-VLA 的"未来观测"思路和这条线一脉相承,但 backbone 是 LLM 而不是纯 RL agent。
- 3D Diffusion Policy:从扩散视角处理 3D 输入 + 动作;和 3D-VLA 是不同 backbone 的"3D-aware policy"两条路。
我建议这样读 — 3-4 步
- 先看一眼 RT-2 / OpenVLA 的方法图,理解 VLA 的标准长相(语言 + 图像 → 动作 token)。这是 3D-VLA 改造的起点。
- 抓住"加了什么":对比"普通 VLA",3D-VLA 加了 (a) 3D encoder (b) 未来观测 head。看论文方法图时,就盯这两个加项。
- 跳过 3D-LLM 风格的细节:如果你不打算复现,3D encoder 内部结构可以略过,记住"它把点云变 token 给 LLM 吃"就够。
- 重点看实验里"长程 / 空间任务"那部分:3D 和未来预测的真正价值在这些任务上才显现,短程抓取上和 2D VLA 拉不开差距。
为什么值得读
- 范式信号:VLA 从 2D 走向 3D 是大势,3D-VLA 是这个方向上比较早、比较完整的一篇。
- 方法借鉴:把"未来观测"作为辅助任务,是个轻量但有效的设计,可以迁移到其他 policy 训练里。
- 生态位:处在 VLA × 3D × World Model 三条线交叉点,读完能同时理顺这三块的关系,性价比高。
- 延伸阅读跳板:读它的对比表能顺手发现 RT-2、OpenVLA、3D-LLM、3D Diffusion Policy 等周边工作,是个好入口节点。
◼
引用本笔记 / Cite this note
@online{eai_3d_vla_2026,
title = {(readable note) 3D-VLA},
author = {Zhou, Jason},
year = {2026},
note = {Note on a 2024 paper},
howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/3d-vla/}},
organization = {Embodied AI Reading Station}
}
All 156 papers (full index)
- 1. LLaVA: Visual Instruction Tuning
- 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
- 3. SayCan: Do As I Can, Not As I Say
- 4. OpenVLA: An Open-Source Vision-Language-Action Model
- 5. VLAS: VLA Model With Speech Instructions
- 6. MLA: Multisensory Language-Action Model
- 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
- 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
- 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
- 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
- 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
- 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
- 13. Creating speech zones with self-distributing acoustic swarms
- 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
- 15. SoundStream: An End-to-End Neural Audio Codec
- 16. AudioLM
- 17. Conformer
- 18. Dual-path RNN
- 19. EnCodec
- 20. Meta-StyleSpeech
- 21. MusicLM
- 22. Robust Speech Recognition via Large-Scale Weak Supervision
- 23. SeamlessM4T
- 24. Stable Audio
- 25. Universal Source Separation with Weakly Labelled Data
- 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
- 27. RLBench: The Robot Learning Benchmark & Learning Environment
- 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
- 29. BridgeData V2
- 30. CALVIN
- 31. LIBERO
- 32. RH20T
- 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
- 34. DROID
- 35. Open X-Embodiment
- 36. RoboCasa
- 37. SimplerEnv
- 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
- 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
- 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
- 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
- 42. DiT-Policy
- 43. Diffusion Policy Policy Optimization (DPPO)
- 44. Affordance-based Robot Manipulation with Flow Matching
- 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
- 46. FAST: Efficient Action Tokenization for VLA
- 47. pi_0: Vision-Language-Action Flow Model
- 48. pi_0.5: VLA with Open-World Generalization
- 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
- 50. Generative Adversarial Imitation Learning
- 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
- 52. AnyTeleop
- 53. Behavior Transformers: Cloning k Modes with One Stone
- 54. Implicit Behavioral Cloning
- 55. RoboCat
- 56. ALOHA 2
- 57. DexCap
- 58. HumanPlus
- 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
- 60. Mobile ALOHA
- 61. SmolVLA
- 62. Universal Manipulation Interface
- 63. Behavior Generation with Latent Actions (VQ-BeT)
- 64. ImageBind: One Embedding Space To Bind Them All
- 65. Connecting Touch and Vision via Cross-Modal Prediction
- 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
- 67. AudioPaLM
- 68. FROMAGe: Grounding LLMs to Images
- 69. OneLLM
- 70. X-VLM: Multi-Grained Vision Language Pre-Training
- 71. Tactile Beyond Pixels (Sparsh-X)
- 72. Sparsh: Self-supervised Touch Representations
- 73. Tactile-VLA
- 74. TLA: Tactile-Language-Action
- 75. Code as Policies: Language Model Programs for Embodied Control
- 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
- 77. LLM+P: Empowering LLMs with Optimal Planning
- 78. PaLM-E: An Embodied Multimodal Language Model
- 79. ProgPrompt
- 80. ChatGPT for Robotics
- 81. GenSim
- 82. RoboFlamingo
- 83. Tree-Planner
- 84. VoxPoser
- 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
- 86. Can WiFi Estimate Person Pose?
- 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
- 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
- 89. High Resolution Point Clouds from mmWave Radar
- 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
- 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
- 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
- 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
- 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
- 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
- 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
- 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
- 98. Habitat: A Platform for Embodied AI Research
- 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
- 100. DexMV
- 101. Habitat 2.0
- 102. ManiSkill
- 103. ProcTHOR
- 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
- 105. BEHAVIOR-1K
- 106. Habitat 3.0
- 107. Isaac Lab
- 108. MuJoCo Playground
- 109. RT-1: Robotics Transformer for Real-World Control at Scale
- 110. 3D Diffusion Policy (DP3)
- 111. Octo: An Open-Source Generalist Robot Policy
- 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
- 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
- 114. 3D-VLA
- 115. DexVLA
- 116. GR-2: Generative Video-Language-Action Model
- 117. OpenHelix
- 118. OpenVLA-OFT
- 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
- 120. RoboMamba
- 121. SpatialVLA
- 122. TinyVLA
- 123. TraceVLA: Visual Trace Prompting
- 124. Learning Transferable Visual Models From Natural Language Supervision
- 125. Flamingo: a Visual Language Model for Few-Shot Learning
- 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
- 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
- 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
- 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
- 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
- 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
- 133. Improved Baselines with Visual Instruction Tuning
- 134. OBELICS
- 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
- 136. Sigmoid Loss for Language Image Pre-Training
- 137. What matters when building vision-language models?
- 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
- 139. The Llama 3 Herd of Models
- 140. LLaVA-NeXT-Interleave
- 141. LLaVA-OneVision: Easy Visual Task Transfer
- 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
- 143. Pixtral 12B
- 144. Dream to Control: Learning Behaviors by Latent Imagination
- 145. World Models
- 146. DayDreamer
- 147. Mastering Atari with Discrete World Models
- 148. Dreamer V3: Mastering Diverse Domains through World Models
- 149. Transformers are Sample-Efficient World Models
- 150. TWM: Transformer-based World Models
- 151. 1X World Model Challenge
- 152. Cosmos World Foundation Model Platform
- 153. GAIA-1
- 154. Genie: Generative Interactive Environments
- 155. Navigation World Models
- 156. UniSim