回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
Simulation & Sim2Real · Plate Nº 100

DexMV

6 min read · 2271 字 · ⭐⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

让机械手学拧瓶盖、倒水太难,DexMV 让算法看人手视频学,把人的动作"翻译"成仿真里机械手能照着练的示范。

这是个什么场景 — 日常类比

你想学做番茄炒蛋。最笨的办法是站灶台前自己瞎试,盐多了少了全靠运气;最贵的办法是请个厨师手把手带你;最划算的办法是打开 B 站搜"番茄炒蛋",看几十个视频自己照着练。

教机械手"拧瓶盖"也是同一个三选一:

  • 自己瞎试:让机械手在仿真里乱挥手,撞对了给奖励 —— 拧瓶盖这种动作太复杂,挥几百万次可能一次都拧不开。
  • 请厨师手把手:雇人戴上数据手套或者用遥操作(teleoperation,远程操控)演示一遍遍,手套一只几万块、采集还累人。
  • 看 B 站视频:直接拿手机拍人手拧瓶盖的视频,让算法看视频学。视频满世界都是、几乎免费 —— 这就是 DexMV 的思路。

唯一麻烦的是:人手 5 个手指 20 多个关节(自由度,DoF),机械手(论文用的 Adroit Hand 大约 30 个关节)长得跟人手不完全一样。所以光"录下来照搬"不行,得做一步"翻译",专业说法叫重定向(retargeting)

DexMV — 场景示意:这论文要解决的现实问题
Plate Nº IDexMV — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • 遥操作 + 行为克隆:用 CyberGlove / VR 控制器采人手数据,再做模仿学习。代表如 Rajeswaran 2017 的 DAPG(Demo Augmented Policy Gradient),但数据采集成本高。
  • 纯 RL from scratch:在 Adroit / 其他灵巧手环境直接 PPO/SAC,奖励工程难、样本效率差,复杂任务(接触多、欠驱动)几乎学不出来。
  • 从单视图视频学操作:早期工作(如 Sermanet 的 TCN)多停留在 2 指夹爪 + 简单 pick-place,没有触及多指灵巧手。
  • Sim-to-real 方向:很多工作直接做 sim-to-real domain randomization(OpenAI 2018 的 Rubik's Cube),但前提是仿真里已经能学出来;DexMV 关心的是"怎么让仿真里先学出来"。

这篇论文的关键想法

一句话:人类操作视频是一种廉价、规模化的灵巧手示范来源,关键是把它"翻译"成仿真里可执行的 demonstration 轨迹

具体三件事打包:

  1. 提供一个仿真平台(基于 MuJoCo / SAPIEN 类的物理引擎,配 Adroit Hand),定义一组多指灵巧手任务(relocate / pour / place inside / open door 之类)。
  2. 提供一条视频 → 示范的 pipeline:人手姿态估计 + 物体姿态估计 + hand-object retargeting。
  3. 对比多种示范驱动的策略学习方法(behavior cloning、DAPG、SOIL 等),证明视频示范能稳定地把 RL 拉出"学不动"的低谷。

第一性原理上:灵巧操作的本质瓶颈是"探索空间太大 + 奖励稀疏",示范是把探索约束到合理流形上的最直接办法;那么示范就不该被遥操作硬件卡死,视频是最便宜的方案。

DexMV — 方法示意:核心 pipeline
Plate Nº IIDexMV — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

整条流水线像把 B 站视频"扒"成机械手的练习教程,分四步走。

Step 1 — 视频采集 + 姿态估计:像照相 app 给人脸打关键点一样,先看懂视频里"手在哪、瓶子在哪"。拍一段普通手机 RGB 视频,手姿态用现成的 hand pose estimator(这一代常用 MANO 模型——一个用主成分压缩过的 3D 人手参数模板);物体 6D 姿态用 PVNet 或类似关键点方法。每一帧输出"手关节 3D 坐标 + 物体位姿"。注意:单目摄像头就够,没用深度相机,所以精度有限。

Step 2 — Hand Retargeting(重定向):像把英文菜谱翻成中文 —— 不能逐字直译,得让最后这道菜味道对。人手 20 多个关节、机械手 30 个关节,关节数和位置对不上,硬抄关节角度只会拧出诡异姿势。DexMV 的办法是写一个优化问题:让机械手的指尖位置和几个关键关节方向尽量贴近人手对应的点 —— 关节本身长得不一样没关系,"指尖摸到的地方"对了就行。

等等,先慢一拍 —— 优化问题是什么?就是给电脑一个目标(比如"机械手指尖和人手指尖距离最小"),让它自己挑关节角度去逼近这个目标,类似你在 Excel 里拖参数让某个数字变最小。

Step 3 — 在仿真里"重放" + 当作示范用:像让学徒先照着师傅录像跟做一遍,不对的地方稍微纠一下。把翻译好的轨迹 (s_t, a_t) 丢进仿真器跑一遍,检查物理上能不能成立(接触常常会偏,要小幅修正)。跑得通的轨迹就当"老师"喂给三种学生算法:BC(行为克隆,最像抄作业,老师怎么动我怎么动)DAPG(一边抄作业一边自己练,把示范当正则项约束 RL)、SOIL(State-Only Imitation Learning)(只看老师"经过了哪些状态",不抄具体动作 —— 正好契合视频里看不到关节力矩这件事)。

Step 4 — 评估:在几个任务上比"白手起家的 RL" / "RL + 视频示范" / "RL + 遥操作示范"三种学法的成功率和完成时间。结论方向:视频示范没遥操作干净,但远好过白手起家,而且采集成本低了一个数量级。

实验在做什么

实验拆成几条线:

  • 任务集:4 个灵巧操作任务(具体名字以原文为准,常见的有 relocate ball / pour into mug / place inside / open door 这类),任务难度递增。
  • 示范来源对比:人类视频 vs 遥操作 vs 无示范。看每种来源对最终成功率的拉动。
  • 方法对比:BC / DAPG / SOIL / 纯 PPO,看哪种算法最能吃掉视频示范这种"含噪"数据。
  • 消融:retargeting 质量的影响、视频条数的影响、姿态估计误差的影响。

具体数字(成功率百分比、所需 episode 数)需读原文。直觉上:视频示范在简单任务上接近遥操作,在复杂任务上有 gap 但仍显著优于 from scratch。

你应该懂的几个新词 — 4-6 个

  • Dexterous Manipulation(灵巧操作):用多指手(不是 2 指夹爪)做接触丰富的操作,比如拧、捏、转。
  • Adroit Hand:UW / Vikash Kumar 提出的 24-30 DoF 仿真灵巧手模型,灵巧操作研究的"标准测试床"。
  • Retargeting(动作重定向):把一个 agent(人手)的运动映射到另一个 agent(机械手),常见于动画、动捕、机器人。
  • DAPG(Demo Augmented Policy Gradient):Rajeswaran 2017,把示范当 BC loss + 策略梯度正则混合训练,灵巧手研究里的经典 baseline。
  • MANO:参数化人手模型(PCA 形式的关节 + 形状),3D 手姿态估计的事实标准。
  • State-Only Imitation Learning(SOIL):只用观测/状态序列做模仿,不要求动作标签 —— 这正好契合视频场景(视频里看不到关节力矩)。

它和其他论文什么关系

  • 上游 / 同代:DAPG(示范驱动 RL 的祖师爷)、Adroit benchmark(任务定义)、HOPE / PVNet(手物姿态估计)。
  • 同期同向:DIME、State-Only Imitation 一脉;以及更早的 RoboNet 思路(用大规模真实视频)。
  • 下游 / 后续:DexCap、DexMimicGen、AnyTeleop 这一支"灵巧手数据采集"的工作都把"视频/动捕 → 仿真示范"这条 pipeline 进一步工程化;H2O / Hand2Robot 这类把人手视频直接转策略的也是同一血统。
  • 生态位:DexMV 是 2021-2022 灵巧手"从视频学示范"这股潮的开山作之一,节点价值高,方法本身现在看不算 SOTA,但定义了问题和 pipeline。

我建议这样读 — 3-4 步

  1. 先看 Section 1-2(intro + related work)+ teaser 图,建立"为什么视频比遥操作香"的直觉,10 分钟搞定。
  2. 跳到方法部分,重点看 retargeting 的优化目标 —— 这是论文里最具体、最值得学的工程细节;姿态估计部分不重要,那是上游模块。
  3. 实验部分只看主表 + 消融 1-2 个,不要陷在具体数字里;记住"视频示范 vs 遥操作 vs scratch"的相对关系即可。
  4. 配套读 DexCap(2024):DexCap 把这条路线做到了真实机器人 + 大规模采集,对比能看清 3 年里的进化。

为什么值得读

  • 节点价值:是"从人类视频学灵巧操作"这条路线的早期里程碑,引用网络密集,读完后看后续 DexCap / AnyTeleop / H2O 都能秒懂上下文。
  • 方法的可迁移性:retargeting 的优化范式不只用于手,也用于人形(HumanPlus、H1-2)和臂手协同;学一次受用多次。
  • 对实习生友好:任务、仿真、示范、模仿学习四件事在一篇里讲清楚,是难得的"灵巧操作总览式"入门论文。
  • 开源生态:DexMV 开源了仿真环境和示范,可以直接跑出 baseline,不用从零搭环境。

DONE: dexmv

引用本笔记 / Cite this note
BibTeX
@online{eai_dexmv_2026,
  title       = {(readable note) DexMV},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2022 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/dexmv/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim