回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
End-to-End VLA · Plate Nº 119

RDT-1B: Diffusion Foundation Model for Bimanual Manipulation

7 min read · 2388 字 · ⭐⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

清华团队给双臂机器人配的"大脑":10 亿参数,听一句话就能让两只机械臂配合着倒水、叠衣服。

这是个什么场景 — 日常类比

你试过单手叠衣服吗?基本上叠不起来 —— 一只手按住领子、另一只手翻折袖子,这种事少了一只手就卡住。倒水也是:左手扶杯子、右手拿壶倒,单手只能放下杯子先去拿壶,全程慢半拍还容易洒。

机器人世界一直有这个尴尬:

  • 单臂机器人 = 只有一只手的厨师,递盘子要先放下、再拿起,动作串行
  • 双臂机器人 = 两只手的厨师,但两只手得"知道对方在干嘛",不能互相打架
  • RDT-1B 的目标 = 给这个双手厨师装一个够大的脑子,能听懂你说的话、看懂摄像头画面,然后同时规划两只手的下一步动作;而且这个脑子不是只会一种菜,是先学过各种厨房(预训练),再针对自家双臂硬件微调

为什么这事难?两个坎:

  1. 两只手的动作高度耦合(coordinated action,配合协调),错一步全错 —— 像跳双人舞,一个人踩点错了整支舞就乱
  2. 双臂训练数据特别少 —— 现实里能拿到的公开数据大多是单臂的,等于你只看过单手厨师的视频,却要训出双手厨师
RDT-1B — 场景示意:这论文要解决的现实问题
Plate Nº IRDT-1B — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • Diffusion Policy(2023):在单臂任务上证明"用扩散模型当策略"比 MLP/Transformer 头都稳,但参数量小(百万级),任务专用,没做多任务/多机器人泛化
  • RT-1 / RT-2(Google):把 VLM(视觉语言模型)当作策略骨干,能跨任务,但动作是离散 token、单臂为主
  • Octo(2024):开源跨机器人策略,Transformer 骨干 + 扩散头,做了"多家机器人数据混合预训练"这件事,但规模仍偏小(~100M),双臂场景也不是主战场
  • ALOHA / Mobile ALOHA:双臂硬件 + 模仿学习方案,但策略本身是任务专用的小模型,不能跨任务零样本
  • 共同空白:没人把"扩散策略 + 大规模预训练 + 双臂"三件事拼到一起

这篇论文的关键想法

像 ChatGPT 把"大模型 + 海量预训练"那一套从聊天搬到了机器人,而且坚持用扩散模型来生成动作 —— 不是让模型像打字一样一个字符蹦出动作,而是像画家从一团涂鸦逐步擦出清晰画面。

核心三句话:

  1. 扩散适合画"动作"这张画:双臂动作是连续值、高维度(两条 7 自由度手臂加起来 14 维以上),而且同一个任务可以有好几种合理走法(多峰,multi-modal)。扩散模型本来就擅长在这种"答案不唯一"的空间里采样,比硬把动作切成离散 token(像把油画压成像素方块)损失小
  2. 大就是好,前提是骨架撑得住:参数量推到 1B 级别(同期 OpenVLA 7B、π0 3B 也都验证了 scaling 有效);普通 U-Net 撑不住这么大,得换成 Transformer 风格的扩散骨干(DiT,Diffusion Transformer)
  3. 先广学再专精:用 46 个数据集(各种机器人形态,单臂双臂都有)预训练打通用底子,再用自家双臂数据微调到目标硬件 —— 跟人先读通识再选专业是一个套路
RDT-1B — 方法示意:核心 pipeline
Plate Nº IIRDT-1B — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

架构骨架:像一个"看图听话的画家"。画家眼前摆着几张实时照片(多路 RGB 相机)+ 一张写着关节角度的小卡片(proprioception,本体感,就是机器人自己的关节状态)+ 你的口头指令,他在草稿纸上从一团乱麻线条开始,一笔一笔擦掉噪声,最后画出"未来几秒两只手该怎么动"的动作序列。骨架是 DiT 风格的 Transformer,去噪不在图片上做,而是直接在动作空间里做。

等等,先慢一拍 —— chunk 是啥? 一次预测未来一小段动作(比如未来 1 秒、几十步),而不是只预测下一步。好处是动作连贯不抖动,类似你写字会一笔写完一个字而不是一笔一停。

统一动作空间:好比一个翻译,把不同方言(单臂 6 维、双臂 14 维、其他形态各种维度)翻译成一种"标准普通话"。论文设计了一个统一动作向量格式(physically interpretable unified action space,物理可解释的统一动作空间),给所有机器人定一组固定槽位,每个机器人按自己的关节填进去,没用到的槽位打个 mask 标记。这样 1B 模型才能从一堆五花八门的数据里学到共通规律。

两段式训练:跟学生先读通识、再读专业一样:

  • 预训练(通识课):46 个多机器人数据集(OXE / RH20T / RoboSet 等的子集),任务是"看着观测,把噪声动作还原成合理动作"
  • 微调(专业课):在自家收集的**双臂数据集(论文称 6K+ episodes 量级,具体数字需读原文)**上继续训,让模型熟悉目标硬件的手感和运动学

推理时:你说一句"把杯子递给我" → 模型看一眼画面 → 在草稿纸上跑几步去噪 → 得到一小段双臂动作 → 机器人执行这段 → 再看一眼画面、再去噪、再执行(类似开车每隔几秒重新看路,叫 receding horizon control,滚动时域控制)。

实验在做什么

笔记基于摘要,具体数字需读原文,已知方向:

  • 真机双臂任务:倒水、叠衣服、握手交接、家务类长程操作(long-horizon manipulation),这些都是单臂搞不定或很别扭的场景
  • 零样本/少样本泛化:测对未见过的物体、未见过的指令组合是否还能完成任务
  • scaling 实验:可能对比 RDT-1B vs RDT-小尺寸版本,验证"参数量上去性能确实涨"
  • 对比基线:Octo、ACT(双臂模仿学习经典)、可能还有 Diffusion Policy 的双臂直接训练版
  • 消融:是否预训练(用不用 OXE)、是否扩散头(换成 MLP/MSE 头会怎样)、统一动作空间设计的必要性

你应该懂的几个新词 — 4-6 个

  • bimanual manipulation(双臂操作):两条机械臂协同完成任务,难点是动作耦合 + 数据稀缺
  • diffusion model as policy(扩散策略):把"图像生成"里的去噪扩散搬来当动作生成器,输入观测、输出动作分布的样本;对多峰连续动作建模特别合适
  • DiT(Diffusion Transformer):用 Transformer 替代 U-Net 当扩散骨干的架构,scaling 友好,RDT-1B 就是用类似思路
  • action chunk(动作块):一次预测未来 N 步动作而不是一步,能减少高频抖动,ACT 论文带火的概念
  • foundation model for robotics(机器人基模):在大量多任务多机器人数据上预训练,再微调到下游的范式,对应 LLM 里的 base model
  • unified action space(统一动作空间):把不同机器人形态的动作映射到同一组维度上,让混训成为可能

它和其他论文什么关系

  • 上游:Diffusion Policy(扩散当策略的奠基)、DiT(扩散用 Transformer)、ALOHA(双臂硬件平台)
  • 同辈/竞品
    • Octo(2024):开源跨机器人策略,规模更小,单臂为主
    • OpenVLA(2024):7B 参数,VLM 路线,动作离散 token,单臂
    • π0(Physical Intelligence):3B 流匹配(flow matching)策略,跟 RDT 思路最接近 —— 都走"大模型 + 连续动作生成 + 跨机器人预训练",但 π0 用 flow matching 而非 diffusion
  • 下游影响:是后续国产双臂基模(如自家衍生工作)和 humanoid 全身策略的直接前作;证明"中国团队也能做基模规模的机器人策略"
  • 对比角度:和 RT-2 的关键差异 —— RT-2 把动作压成文本 token 让 VLM 输出,RDT-1B 保留动作的连续性、用扩散显式建模分布

我建议这样读 — 3-4 步

  1. 先看摘要 + 方法图(Fig. 1-2):搞清"输入是什么、输出是什么、骨干长什么样",对照本笔记的"它怎么做的"那一节
  2. 跳到统一动作空间那一节细读:这是论文最有工程价值的设计点,复用到自己跨机器人项目里很有用
  3. 看实验里的 scaling 曲线和消融:确认"1B 是不是真的有必要、预训练的边际收益多大"
  4. 对照读 π0:两篇放一起看,能立刻理解 diffusion vs flow matching 在动作生成上的工程取舍

为什么值得读

  • 现状坐标:2024 年是机器人基模的"GPT-2 时刻",RDT-1B / OpenVLA / π0 / Octo 是同期最重要的几个工作,不读会缺一块拼图
  • 方法论可迁移:扩散策略 + 跨机器人预训练 + 统一动作空间,这三件事的组合方式可以直接套用到任何"动作连续、数据异构"的场景(不限于双臂)
  • 国内团队代表作:清华系的机器人基模工作里影响力最大的之一,理解中国 embodied AI 路线绕不开
  • 难度甜区:⭐⭐⭐⭐ —— 需要懂扩散 + Transformer + 模仿学习,但每一块都不深,是一篇能把"基模 + 机器人"两条线缝起来的论文

引用本笔记 / Cite this note
BibTeX
@online{eai_rdt_1b_2026,
  title       = {(readable note) RDT-1B: Diffusion Foundation Model for Bimanual Manipulation},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2024 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/rdt-1b/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim