回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
Datasets & Benchmarks · Plate Nº 32

RH20T

6 min read · 2079 字 · ⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

机器人数据集,除拍视频外还录了"手感"和"声音":拧瓶盖多大力、咔哒卡到位。147 项任务、11 万段。

这是个什么场景 — 日常类比

教别人做家务,光看视频是不够的。

  • 教学徒拧瓶盖:他光看手势学不会"该用多大劲"——拧太松不动,拧太紧滑丝。
  • 教孩子插 USB:插反了会卡住。"卡住"是用手感觉到的,眼睛只看到"没插进去"。
  • 教新手盖瓶子:那一声"咔哒"是盖到位的信号——但普通视频里听不清。

主流机器人数据集(比如 RT-1、BridgeData)只录了视频和动作,等于只让学徒看视频、不让他摸也不让他听。RH20T 这篇论文做的事,是把"摸"和"听"也加进数据集——多录了力/力矩(force-torque,手上压力和扭力多大)和音频两个通道。它瞄准的是 147 项要"动手感受"的任务,超过 11 万段轨迹(trajectory,机器人从动作开始到结束的状态序列)。

RH20T — 场景示意:这论文要解决的现实问题
Plate Nº IRH20T — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • RT-1 / RT-2(Google):百万级轨迹,但全是 RGB 视频 + 动作,没有力觉
  • BridgeData:跨任务、跨实验室,泛化导向,依然是视觉为主
  • RoboNet:早期合作数据集,几百段轨迹,规模小且模态单一
  • 学术数据集(如 MIME、RoboTurk):通常聚焦单一技能或单一机器人,缺多任务多模态
  • 力觉数据:以前要么只在仿真里收集(无 sim-to-real),要么是单任务小规模(如插拔 USB 的几百段)

共同短板:接触富集任务(拧、插、按、撕)下的真机多模态数据严重缺失

这篇论文的关键想法

三个核心立场:

  1. 接触富集任务必须有力觉和声音——视觉看不到"压力大小"和"咔哒卡入"。
  2. 一次示教泛化(one-shot imitation)才是实用底线——真实场景里没人愿意为每个新任务收集 1000 段示教。
  3. 数据采集平台要标准化、可复制——不是某个实验室的私有 setup,而是"任何实验室都能搭一套同样的",方便后续社区扩展。
RH20T — 方法示意:核心 pipeline
Plate Nº IIRH20T — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

采集平台。像在厨房里架一台多机位拍摄做菜的纪录片:菜板上方俯拍、左右两侧侧拍、操作者第一视角,再加上一支挂在锅边的麦克风。论文搭的工位类似——4 个 RGB-D 相机从不同角度拍(避免被手臂挡住)+ 力/力矩传感器(装在末端执行器,即夹爪根部)+ 麦克风(录接触声音)+ 触觉传感器(部分配置中)。所有传感器时间同步到毫秒级——这是关键。等等,先慢一拍 —— 为什么时间同步这么要紧?因为下游模型学的是"先看到什么、然后摸到什么、最后听到什么"的因果顺序。如果视频比力觉慢半秒,模型会以为"先卡住、后看到接触",学到的就是错的物理。

任务设计。像设计一本"必须动手感受才能完成"的菜谱:147 道任务包括拧瓶盖、插拔 USB、撕胶带、按按钮、用工具、双臂协作。每项任务都至少有一段需要接触富集(contact-rich,全程都和物体抵着发力)的子动作——"力觉用得上"是设计目标而不是顺手附带的副产物。

示教方式。像老师傅手把手教徒弟,但工具是 VR 手柄。主要靠人类遥操作(teleoperation,操作员用手柄/VR 控制器实时操纵机器人,像玩高精度游戏一样)+ 一部分动觉示教(kinesthetic,直接抓住机器人的手腕拽着它走一遍,像握着小孩的手教写字)。每条轨迹同时记录:本体感受(关节角/速度)、视觉、力/力矩、音频、操作员的指令文本。

数据规模与分发。最后像超市开放试吃区一样把所有原料摆出来:量级是 11 万+ 段轨迹,覆盖约 50 种物体和多种机器人本体。配套放出了数据加载、可视化和基线代码,主要支持 imitation learning(模仿学习,让模型抄人类示教的作业)和 one-shot imitation(一次示教就泛化)两种 setup。

实验在做什么

:本节基于摘要级理解,具体数字与对比表需读原文。

主要做三类验证:

  1. 数据集统计验证:任务覆盖度、模态完整度、采集吞吐量(多少分钟一段)。
  2. 基线模型评估:在 RH20T 上跑几个标准模仿学习方法(行为克隆 BC、Diffusion Policy 等),证明加入力觉/音频确实让接触任务的成功率提升——这是数据集论文的"我们这样多模态有用"自证。
  3. One-shot 迁移:在见过的相邻任务上只给 1 段新示教,看模型能不能泛化。这是论文最想强调的故事线。

你应该懂的几个新词 — 4-6 个

  • Contact-rich task:接触富集任务,比如拧瓶盖、插插头——任务全程都在"和物体抵着",不像 pick-and-place 那种"夹起来移动"几乎不需要精细力控
  • Force-torque sensor:力/力矩传感器,通常装在机械臂末端,6 维输出(3 个方向的力 + 3 个方向的扭矩),相当于机器人的"皮肤压力感"
  • Teleoperation:遥操作,人通过 VR 手柄/3D 鼠标实时控制机器人,是当前最高质量示教来源
  • Kinesthetic teaching:动觉示教,直接用手把机器人手臂"拖动"到目标位置,机器人记录轨迹——比遥操作直观但精度低
  • One-shot imitation:一次示教模仿,目标是给模型 1 段新任务的演示,它就能在那个任务上工作(vs 传统方法需要几十几百段)
  • Multimodal alignment:多模态对齐,让视觉/力觉/音频/动作流在时间轴上对齐到同一时钟,是多模态数据集的工程难点

它和其他论文什么关系

  • vs RT-1/RT-X(Google 大数据集):RT-X 是"广度"路线,跨实验室拼数据;RH20T 是"深度+模态"路线,单一标准平台,但模态更全
  • vs DROID(2024 后续大数据集):DROID 在规模和场景多样性上更大,但 RH20T 在接触富集 + 力音频模态上仍是稀缺资源
  • vs Diffusion Policy(学习方法):DP 这种方法证明"数据够好够多就能学会复杂操作",RH20T 提供的就是"够好够多 + 还带力觉"的训练食材
  • 下游影响:很多研究 contact-rich manipulation 的论文(插拔/装配/工具使用方向)会把 RH20T 当作 benchmark 或预训练源
  • 同期工作:MimicGen(数据增强造数据)走的是"少量真实+大量合成"路线;RH20T 是"老老实实采真机"——两条路都有人在走

我建议这样读 — 3-4 步

  1. 先看 teaser 图和任务列表:扫一遍 147 项任务名,建立"哦原来覆盖这些场景"的直觉
  2. 看采集平台示意图:硬件 setup 图最值得看,理解多模态时间同步是怎么做的
  3. 跳读基线实验:重点看"加力觉 vs 不加力觉"的对比表,确认论文核心 claim 站得住
  4. 如果要用数据:去 GitHub/官网读 data loader 文档比读论文更实用——数据集论文的工程细节通常在代码里

为什么值得读

  • 如果你研究 contact-rich manipulation:这是少数公开的、带力觉和音频的真机大规模数据集,几乎是绕不开的资源
  • 如果你研究多模态学习:RH20T 提供了"视觉 + 力觉 + 音频 + 动作"四模态时间同步数据,做模态融合实验的好素材
  • 如果你只是想了解机器人数据集生态:把它和 RT-X、DROID、BridgeData 放一起对比,能快速建立"什么数据集解决什么问题"的地图
  • 历史定位:2023 年 RSS Workshop,处于"大模型 + 大数据"机器人范式刚起来的阶段,是 era=classic 的代表性数据集论文之一

引用本笔记 / Cite this note
BibTeX
@online{eai_rh20t_2026,
  title       = {(readable note) RH20T},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2023 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/rh20t/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim