回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
End-to-End VLA · Plate Nº 110

3D Diffusion Policy (DP3)

6 min read · 2157 字 · ⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

教机器人擦桌子,不给它看照片,改给它看带深度的 3D 点云。结果只用 10 段录像就够学会一个新任务。

这是个什么场景

想象你刚搬进新家,要教一个不会做家务的弟弟擦桌子。你有两种教法:

  • 教法 A(普通照片):你拍了一段你擦桌子的录像给他看。他记住的是"画面里出现这种花纹时,手就这么挪"。问题是——明天换到客厅,桌子换成深色木头,灯光也偏黄,他立马就懵了,因为画面长得不一样了。
  • 教法 B(戴上 3D 眼镜):你让他戴一种能看出"东西离自己多远"的眼镜。他记住的是"桌面就是我前方 30cm 那一片平的地方"。换到客厅、换张桌子他都不慌——平面还是那块平面,几何形状没变。

机器人学动作也是一样的两难:用普通摄像头(2D 图像)就是教法 A,换个房间就翻车;如果能直接看到三维形状(3D 点云,sparse point cloud),就是教法 B,对外观变化更稳。

DP3 干的事就是给机器人换上"3D 眼镜",再配上扩散模型(Diffusion Model,一种擅长一笔一笔"涂"出连续动作的生成模型),去预测手该怎么动。

3D Diffusion Policy (DP3) — 场景示意:这论文要解决的现实问题
Plate Nº I3D Diffusion Policy (DP3) — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • Behavior Cloning + 2D 图像(如 BC-RNN、Robomimic 系列):拿摄像头 RGB 图当输入,神经网络回归动作。问题是数据量需求大,泛化差。
  • Diffusion Policy(CoRL 2023):把动作生成建模成去噪过程(denoising),动作多模态(multimodal)问题处理得好。但仍然吃 2D 图像,对外观和视角敏感。
  • Implicit Behavior Cloning / Energy-Based Models:能力理论上不弱,但训练不稳定,工程上不如扩散模型友好。
  • 基于 3D 的方法(PerAct、C2F-ARM 等):用 voxel 或 point cloud + Transformer,但通常需要多视角 RGB-D + 较重的网络,且没有把"扩散"和"3D"结合起来做策略学习。
  • 共同痛点:要么吃数据,要么不鲁棒,要么训练不稳定。

这篇论文的关键想法

类比一下:原来的 Diffusion Policy 像个"看着照片学开车"的学员,DP3 等于把它的眼睛从普通相机换成了"激光测距眼镜",开车的脑子(决策网络)一点没动。

一句话:保留 Diffusion Policy 的"动作去噪"框架不动,把它的视觉编码器(visual encoder,负责"看"的那部分网络)换成一个非常轻的 3D 点云编码器

为什么这个组合好用:

  • 扩散模型负责"动作侧"——一个动作可以有好几种合理走法(比如绕左还是绕右),扩散模型对这种"多模态"轨迹处理得好,10 条示教也能学。
  • 3D 点云负责"感知侧"——它只关心几何形状,天然不受光照、桌布颜色、相机摆放角度(extrinsics,相机外参)影响,而几何才是任务真正关心的东西。
  • 作者刻意选了稀疏点云 + 极简 MLP 编码器,而不是重型的 PointNet++ / Transformer。直觉上"模型越大越聪明",但在只有 10 条数据的场景下,模型太大反而会"死记硬背"(过拟合)——少即是多。

可以理解为:把"硬记画面 → 学动作"的长链条,缩短成"看懂形状 → 学动作"。链条短了,所需数据也少了。

3D Diffusion Policy (DP3) — 方法示意:核心 pipeline
Plate Nº II3D Diffusion Policy (DP3) — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

输入处理:单视角 RGB-D 相机捕获的深度图反投影成点云,然后做 farthest point sampling(FPS)下采样到一个固定的稀疏数量(比如几百到一千多个点;具体数字需读原文)。点云在机器人基座坐标系下表达,相当于天然做了视角对齐。

视觉编码器:一个非常浅的 MLP(多层感知机)作用在每个点上,再接一个简单的池化(pooling)得到一个紧凑的几何特征向量。作者论文里反复强调:编码器越简单,小样本下越稳。这点和 2D 视觉里"用 ResNet-50 大力出奇迹"完全相反。

策略主体(policy backbone):沿用 Diffusion Policy 的 1D 卷积 U-Net(或 Transformer 变体),把"几何特征 + 机器人本体状态(proprioception)"作为条件,去噪生成一段未来动作序列(action chunk)。训练目标是标准的 DDPM/EDM 噪声回归损失。

部署:推理时从纯噪声开始,迭代去噪几步(比 DDPM 原始 1000 步少很多,通常用 DDIM 或更快的采样器)得到动作序列,按 receding horizon 方式执行前若干步再重规划。

实验在做什么

DP3 在仿真和真机上都做了大量任务,规模具体数字需读原文,但结构大致是:

  • 任务集:覆盖多个仿真 benchmark(如 Adroit、MetaWorld、DexArt 之类的灵巧操作任务)和真机任务,强调任务多样性。
  • 样本效率:每个任务只用 10 条人类示教,对比 baseline(2D Diffusion Policy、BC-RNN、IBC 等)在同等数据下的成功率。
  • 泛化测试:换场景、换物体颜色/纹理、换相机视角、加干扰物,看成功率下降多少。这是 3D 表示最能体现优势的地方。
  • 消融(ablation):换不同点云编码器(轻 MLP vs PointNet vs 重 Transformer)、不同点数、是否加颜色信息等。一个反直觉的结论是"加颜色反而变差"——再次印证小样本下少即是多。

你应该懂的几个新词 — 4-6 个

  • Point cloud(点云):一组 3D 点的集合,每个点至少有 (x, y, z)。从 RGB-D 相机的深度图反投影就能得到。
  • Farthest Point Sampling (FPS):从一团点里挑出"互相离得最远"的若干个,做下采样。比随机采样更能保留几何结构。
  • Diffusion Policy:把策略学习建模成"从噪声里去噪出动作序列"的扩散模型,CoRL 2023 那篇是 SOTA 之一。
  • Action chunk / Receding horizon:一次预测未来若干步动作(比如 16 步),但只执行前几步(比如 8 步),然后重新预测。借鉴自 ACT/MPC 思想。
  • Proprioception(本体感知):机器人自己关节角度、末端位姿等状态,不依赖外部传感器。
  • DDIM / EDM:扩散模型的快速采样器,把推理步数从 1000 降到几十甚至个位数,部署关键。

它和其他论文什么关系

  • 直接前作Diffusion Policy——DP3 把它的视觉输入换掉,骨架保留。读 DP3 之前必须先理解 DP。
  • 后作 / 同期 3D 系列iDP3(Improved DP3)进一步在人形机器人上做大规模真机;Equibot 把等变性(equivariance)加进 3D 策略。
  • 2D 同期对手:ACT(Mobile ALOHAACT (ALOHA))走的是 Transformer + 双臂 + 大量数据的路线,思路和 DP3"小样本 + 3D"几乎正交。
  • VLA 大模型路线OpenVLAπ0 用大模型 + 海量数据卷泛化;DP3 代表的是另一条路——结构化感知 + 小数据。两条路线在 2024-2026 之间是 manipulation 领域的两大风格。
  • 3D 表示派系:和 PerAct、RVT 那种 voxel 路线相比,DP3 选稀疏点云 + 极轻编码器,是"反向工程化"的代表。

我建议这样读 — 3-4 步

  1. 先读 diffusion-policy.md:DP3 几乎所有动作侧设计都是继承的,没这个底子读 DP3 会看不懂为什么 U-Net 那么搭。
  2. 看 DP3 论文 Section 3(方法)+ Figure 2(pipeline):重点看点云怎么进、编码器多简单、条件怎么注入扩散模型。
  3. 跳到实验里的"泛化"和"消融"两节:这是 DP3 真正值钱的部分——为什么 3D 比 2D 鲁棒、为什么不加颜色、为什么轻编码器更好。
  4. 可选:扫一眼 iDP3 看 2024 下半年这条线怎么发展到人形机器人,理解 DP3 的影响力。

为什么值得读

  • 样本效率的存在性证明:在"机器人学习要 10 万条数据"的叙事下,DP3 用 10 条示教做到一些任务,这本身是个强信号——表示形式比数据量更关键。
  • 反直觉的"少即是多":轻编码器 > 重编码器、纯几何 > 几何+颜色。这两个发现在小样本机器人学习里反复被后续工作复现。
  • 工程友好:单视角 RGB-D + 一个 MLP + 一个 U-Net,组件都不重,复现门槛低,是入门 3D manipulation 的极好起点。
  • 占位:在 VLA / 大模型路线之外,DP3 代表了"结构化先验 + 小数据"这条路。理解机器人学习全景必须读它。

引用本笔记 / Cite this note
BibTeX
@online{eai_dp3_2026,
  title       = {(readable note) 3D Diffusion Policy (DP3)},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2024 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/dp3/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim