回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
Datasets & Benchmarks · Plate Nº 34

DROID

7 min read · 2308 字 · ⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

全球 18 家实验室一起拍机器人干活的视频,凑出 7.6 万段、564 个真实场景,让机器人不再只会"自家桌子上那点活"。

这是个什么场景

想象你只在自家厨房教过小孩擦桌子。他在自家擦得飞起,可一到奶奶家、到小区会议室,灶台高度变了、抹布颜色变了、光线也不一样,他立刻愣住——这其实就是机器人长期以来的窘境。

  • 过去的训练数据像是"一个家长在自家厨房反复教孩子叠 5 件衣服":场景固定、光照固定、桌面千篇一律,孩子(模型)学得熟,可一换房间就懵
  • DROID 干的事是"召集全球 18 个家庭,把各自厨房、客厅、办公室、宿舍、洗手间里教孩子拧瓶盖、开抽屉、拿杯子的过程都录下来寄到一起"
  • 等孩子看过这么多种"家",再走进一间没去过的房间,也不至于完全束手无策

它要治的就是机器人学习里"训练数据像温室"这个老毛病——同一只机械臂、同样背景、固定光照,策略一出实验室就崩。

DROID — 场景示意:这论文要解决的现实问题
Plate Nº IDROID — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • 单实验室小规模数据:RT-1、ACT、Diffusion Policy 等大多在 1-3 个实验室、几百到几千段轨迹规模上训练,多样性受限
  • 仿真大规模采集:Isaac Gym / RoboSuite / RLBench 走仿真路线,量大但 sim-to-real 鸿沟难补
  • 跨机构联合数据集:Open X-Embodiment(2023)首次把 22 种机器人、几十个实验室的数据拼在一起,但硬件异构导致动作空间难统一
  • 众包人类示范:BC-Z、RoboNet 等尝试众包,但场景仍偏受控
  • 共性短板:要么"硬件统一但场景单一",要么"场景多但硬件杂乱难训",没人在"统一硬件 × 极度多样真实场景"这条路上把规模做到位

这篇论文的关键想法

用一套硬件标准 + 一套采集协议 + 全球协作,把"硬件统一"和"场景多样"同时拉满。

具体三个支点:

  1. 硬件统一:所有数据采集站都用 Franka Panda 7-DoF 机械臂 + 双 ZED 立体相机 + 一个手腕相机 + Oculus 控制器遥操作。这样动作空间、观测空间一致,下游训练不用做异构对齐
  2. 场景与任务多样:13 国 18 机构每家在自己的真实环境(厨房、办公室、宿舍、洗手间……)采,自然形成 564 个场景、86 项任务的天然分布
  3. 众包规模:累计约 350 小时遥操作演示、约 7.6 万段轨迹,是当时单一硬件下最大的真实机器人数据集之一

它的认知论是:"机器人基础模型缺的不是更聪明的算法,是更接近真实世界分布的数据"——这与 LLM/VLM 时代"scaling data"的逻辑同构。

DROID — 方法示意:核心 pipeline
Plate Nº IIDROID — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

统一硬件平台 — 像连锁店统一菜单。 18 家实验室不是各搭各的,而是按同一份"装机清单"装:Franka Panda 7 自由度机械臂、Robotiq 夹爪、两个 ZED 2 立体相机(拍全景)+ 一个 ZED Mini(绑在手腕上拍特写)、一个 Oculus Quest 2 头显当遥控器。每家店的菜(数据)虽然口味不同,但厨具一样,回头客(模型)才不用学一次换一套。

等等,先慢一拍 — 这里的"遥操作(teleoperation)"是什么?说白了就是人戴着 VR 头显当"提线木偶师",手怎么动机械臂就怎么动,电脑把人的动作和机械臂看到的画面一起录下来当教材。

遥操作与采集协议 — 像录烹饪教学视频。 操作员戴上 Oculus,用手柄牵着机械臂的"手腕"在空间里走 6D 位姿(位置 + 朝向),机械臂用阻抗控制柔顺地跟随。每段演示都同步录下 RGB 画面 + 深度 + 本体感觉(关节角度/速度)+ 动作指令,再配一句自然语言任务描述,比如 "put the mug in the sink"(把杯子放进水槽)。

任务与场景设计自由 — 像让各分店自报招牌菜。 论文没硬性规定"必须采哪 86 个任务",只给出几个大类——pick-and-place(拿起放下)、articulated object manipulation(开抽屉/开门这类带轴的操作)、tool use(用工具)、deformable(操作毛巾、衣服这种会变形的东西)——剩下让各机构按自家场景自由发挥,事后再聚类打标签。这种"自下而上"长出来的多样性,正是数据集贴近真实世界的关键。

质量控制与发布 — 像总店审核加盟店上传的视频。 数据汇到中心仓库前要过自动校验(轨迹长度、相机帧率、标注完整度)和人工抽查;最终以标准格式(HDF5 + RLDS)开源,还附赠一个 Diffusion Policy 在 DROID 上预训练好的模型,作为别人对照用的 baseline。

实验在做什么

论文核心实验回答两个问题:DROID 的规模和多样性是否真的提升了下游策略的泛化?

  • 预训练 + 微调对照:在 DROID 上预训练 Diffusion Policy,再在新场景/新任务上做少样本微调,对比"从零训练"和"在 Open X-Embodiment 上预训练"两种 baseline。论文报告 DROID 预训练在新环境下成功率显著领先(具体数字需读原文)
  • 场景外推:在数据集中没出现过的真实环境(合作机构外的第三方场景)测试 zero-shot 与 few-shot 性能
  • 数据规模消融:用 25%、50%、100% 的 DROID 数据训练,看性能是否随规模单调提升——这是验证"scaling law 在机器人数据上成立"的关键证据
  • 任务类别消融:分析哪些任务类(如 deformable、tool use)从多样性中受益最多

你应该懂的几个新词 — 4-6 个

  • Franka Panda:一款 7 自由度协作机械臂,研究界事实标准之一,因控制接口开放、阻抗控制好用而被广泛采用
  • 遥操作(teleoperation):人通过控制器(手柄/VR/外骨骼)实时驱动机器人完成任务,机器人录下的轨迹作为示范
  • 模仿学习(Imitation Learning, IL):从人类示范学策略,最常见是行为克隆(Behavior Cloning),DROID 的主要用法
  • Open X-Embodiment(OXE):2023 年 Google 牵头的跨机器人联合数据集,DROID 的主要对照与互补对象
  • RLDS(Reinforcement Learning Datasets):Google 推的机器人/RL 数据标准格式,跨数据集训练的事实标准
  • Diffusion Policy:用扩散模型生成动作序列的策略类,DROID 论文用它做预训练 baseline

它和其他论文什么关系

  • 上游/前置:RT-1(2022)首次证明大规模真实数据 + Transformer 能学通用操作;Open X-Embodiment(2023)开启跨机构协作范式。DROID 是这条线的"硬件统一版加强版"
  • 同期对照:Mobile ALOHA(2024)走"廉价硬件 + 高质量小数据"路线,DROID 走"标准硬件 + 大规模多样数据"路线,是真实机器人数据的两条互补路径
  • 下游应用:OpenVLA、π0 等 2024-2025 年的机器人基础模型把 DROID 列为关键预训练源之一;DROID + OXE 几乎是当下"想训通用 VLA(Vision-Language-Action)模型"的默认数据组合
  • 数据 vs 算法之争:和 Diffusion Policy、ACT 这类"算法侧"工作互补——DROID 论证"数据侧也要 scale",两条线合起来才是机器人基础模型的完整图景

我建议这样读 — 3-4 步

  1. 先读 Abstract + Figure 1(10 分钟):看清楚"13 国 / 18 机构 / 7.6 万段 / 564 场景 / 86 任务"这组数字背后的采集图景
  2. 跳到实验章节(30 分钟):重点看"DROID 预训练 vs OXE 预训练 vs from scratch"那张对照表,建立 DROID 的相对价值感
  3. 回看方法章节(30 分钟):理解硬件标准、遥操作协议、数据格式——如果将来要自己搭采集站或用 DROID 微调,这部分是工程入口
  4. 看附录的任务分类与场景照片(20 分钟):感受 564 个场景的真实多样性,对"机器人数据的真实分布长什么样"建立直觉

如果你时间紧,只读 1+2 即可——3+4 是想动手时再翻。

为什么值得读

  • 数据集是机器人时代的 ImageNet 之一:2024 之后几乎所有通用机器人模型论文都会引用 DROID,不读一遍方法部分会缺一块基础设施常识
  • 理解"机器人 scaling"的入门读物:它把"data scaling 在机器人上是否成立"这个问题用实证回答了一次,是把 LLM 时代的 scaling 思维迁移到具身的关键参考
  • 工程参考价值高:硬件清单、采集协议、数据格式是现成的"机器人数据采集 starter kit",自己组实验室直接抄
  • 领域协作范式样本:13 国 18 机构怎么做数据治理、质量控制、版本发布——这本身是一种科研工程实践,值得做大型项目的研究者借鉴

引用本笔记 / Cite this note
BibTeX
@online{eai_droid_2026,
  title       = {(readable note) DROID},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2024 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/droid/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim