回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
Imitation Learning · Plate Nº 52

AnyTeleop

7 min read · 2280 字 · ⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

用一台普通摄像头拍你的手,机械手就跟着模仿你的动作;换什么型号的机械手都不用重写代码。

这是个什么场景 — 日常类比

想象你在跟朋友视频通话,开了一个"卡通滤镜"——你嘴一动,屏幕里的小狗也跟着张嘴。再想象这个滤镜不是娱乐用,而是真的连着一只远在实验室的机械手:你在摄像头前抓一下空气,那只机械手就真的把杯子抓起来。这就是"遥操作(teleoperation,远程操控)"想做的事。

那为什么要研究它?因为机器人想学会"拿杯子"这种动作,得有人先做几百遍给它看(这叫模仿学习的示范数据采集)。问题是过去采数据的方式都很挑食——要么得亲自握着机器人手腕拖(kinesthetic teaching,物理拖拽教学),要么得戴一双几万块的数据手套,而且换一台机器人就要重新采一遍,特别浪费。

AnyTeleop 想做的是一个"通用滤镜":一个普通摄像头看着人手比划,机械手长什么样都能接——三指、四指、五指、Allegro、Shadow、Leap 都行。便宜、不挑硬件、采的数据还能给别的机器人复用。

AnyTeleop — 场景示意:这论文要解决的现实问题
Plate Nº IAnyTeleop — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • 特定硬件遥操:每个实验室造一套自己的 setup(CyberGlove + Vive tracker + 某型号机械手),论文里跑得飞起,换实验室就重做一遍
  • 基于 VR 控制器:抓握靠按扳机,缺细腻指动;位姿精度受 VR 基站布置影响
  • kinesthetic teaching:把机器人当玩偶拖,对软体/灵巧手不适用,且只能在被教那一台机器上采数据
  • 运动捕捉系统(OptiTrack 等):精度高但贵、要贴 marker、不便携
  • 少量纯视觉手追踪 demo:能追踪但没有打通到任意机械手这一段,重定向(retargeting)写死在某型号上

共同问题:示范数据绑定硬件,换机器人 = 重采数据,模仿学习的"数据可复用性"基本为零。

这篇论文的关键想法

把遥操拆成三个解耦层,每层都尽量"硬件无关":

  1. 手部追踪层:纯 RGB(或 RGBD)摄像头 → 21 个手部关节的 3D 位置。模型可换。
  2. 运动重定向层(retargeting):把人手关节映射到目标机械手的关节空间,靠优化器而不是硬编码。换机械手只需换 URDF + 一些指尖对应关系。
  3. 机械臂控制层:把人手腕的 6D 位姿当末端执行器目标,用机械臂自己的 IK / 控制器跟随。换机械臂只需换 URDF。

核心洞察:示范数据应该是"任务级"的(拿起杯子的轨迹),不应该是"硬件级"的(某型号 17 个关节角的时间序列)。把硬件抽象成可替换模块后,一段视觉遥操采集的轨迹理论上可以重定向到任何机械手上重放。

AnyTeleop — 方法示意:核心 pipeline
Plate Nº IIAnyTeleop — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

手部追踪。先把这一步想成"健身房里的姿势识别 App"——摄像头看着你,吐出你身上 21 个关节的 3D 坐标。AnyTeleop 这一层就是同一件事,只不过看的是手不是全身。系统支持普通 RGB 摄像头(用类似 MediaPipe / FrankMocap 的现成手部估计模型,就是 Google 出的那种手势识别库)和 RGBD(带深度的)摄像头两种。RGB 走 2D 关键点 → lift(抬升)到 3D;RGBD 直接拿点云算腕部更稳。这一层是即插即用的——哪个追踪模型好就换哪个。

运动重定向(retargeting,把动作翻译成另一种身体的语言)。这一步像翻译:你说中文,要让一个只会日语的人做出同样的反应。人手有 5 根手指 26 个关节,机械手可能只有 4 指 16 关节,长度比例都不一样——直接照抄关节角度肯定错。

等等,先慢一拍 — "把动作翻译过去"具体怎么算?

AnyTeleop 把它写成一个优化问题(一种"在限制条件下找最优解"的数学求解器):每一帧画面,求解器都在问"该让机械手的关节怎么转,才能让它的指尖位置最贴近我人手的指尖位置?同时不能让关节超出活动范围,也不能让手指自己撞自己,动作还得连贯不抖"。换一只新机械手时,只需要在 URDF(一种描述机器人结构的文件)里标一下"这个零件是拇指尖、这个是食指尖",求解器自动接管。

机械臂控制。这一步最简单,像"司机跟着导航走":人手腕在空间中的位置和朝向(6D 位姿)就是导航终点,机械臂自己用标准的 IK(逆运动学,由终点反推每个关节怎么转)算法跟过去。系统对 Franka、UR5、xArm 这些常见机械臂都封装好了接口,换臂相当于换一个驱动程序。

仿真 + 真机一体。同一套代码既能在虚拟仿真器(SAPIEN / Isaac 之类)里跑,也能在真机上跑——像游戏开发先在引擎里调好再发布到真实硬件。这种工程化是它敢标"通用"的底层支撑。

实验在做什么

论文实验的目标不是刷 SOTA,而是证明"通用"是真的

  • 多机械手:在同一系统下跑 Allegro、Shadow、Schunk SVH 等多种灵巧手,做相同的抓取/操作任务,看成功率
  • 多机械臂:同样的任务在 Franka、UR、xArm 等不同臂上跑
  • 多任务:抓取、倒水、拧瓶盖、捏小物体等灵巧操作任务
  • 数据可迁移性:用 A 机械手采集的轨迹,重定向回放到 B 机械手上,看完成度
  • 追踪输入对比:单 RGB vs RGBD 对成功率/稳定性的影响

具体成功率数字需读原文,但论文叙事结构是"模块替换都不掉链子 = 系统真的通用",而非单点性能突破。

你应该懂的几个新词 — 4-6 个

  • Teleoperation(遥操作):人远程控制机器人。这里特指"人做动作 → 机器人跟着做",用于采集模仿学习的示范数据
  • Retargeting(运动重定向):把一种身体上的动作(人手)映射到另一种结构上(机械手),关节数 / 比例 / 形态都可能不同。动画行业常用术语
  • Dexterous manipulation(灵巧操作):指多指手做精细任务(拧、捏、转笔),区别于二指夹爪的简单抓取
  • Kinesthetic teaching:手把手拖动机器人采集示范,物理接触式
  • URDF(Unified Robot Description Format):ROS 里描述机器人结构的 XML 文件,记录连杆、关节、限位
  • IK(Inverse Kinematics,逆运动学):给末端目标位姿,求每个关节该转多少度

它和其他论文什么关系

  • DexCap、HumanPlus、ALOHA、GELLO 等遥操/数据采集工作的同时代对手:各自取舍不同——ALOHA 走双臂主从仿造、GELLO 用 3D 打印外骨骼、AnyTeleop 走纯视觉。AnyTeleop 的卖点是最低硬件门槛
  • MediaPipe Hands、FrankMocap、HaMeR 等手部追踪工作:是 AnyTeleop 的上游模块
  • 下游:任何用灵巧手做模仿学习的论文(DexMV、DexPoint、Diffusion Policy on hands)都可以把 AnyTeleop 当数据采集前端
  • 思想亲缘:和 RoboCasa / Open X-Embodiment 等"跨 embodiment 数据共享"思路一脉相承——前者解决数据格式统一,AnyTeleop 解决数据采集端的硬件无关

我建议这样读 — 3-4 步

  1. 先看 demo 视频(项目主页 yzqin.github.io/anyteleop 有),10 秒就能感受"挥手 → 机械手动"的直观效果,比读 abstract 快
  2. 跳到 Method 的 retargeting 小节:这是工程上最有内容的部分,看清优化目标和约束是什么——这决定了它能否泛化到你手头的机械手
  3. 扫实验表:重点看"换机械手 / 换机械臂"这两组对比,验证"通用"标签
  4. 如果你要复现采数据:去 GitHub repo 看 README 的硬件清单,确认你的摄像头 + 机械手组合被支持

为什么值得读

  • 工程范式价值:它把"遥操"从一个孤立 demo 变成一个可复用基础设施,类似当年 ROS 之于机器人控制——系统设计的解耦思想比单点算法更耐看
  • 降低入门门槛:如果你想自己采一份灵巧手数据集,AnyTeleop 是目前最便宜的起点(一个摄像头 + 一台机械手 + 开源代码),不需要 VR 也不需要外骨骼
  • 数据可迁移的早期实践:embodied AI 现在在卷"跨 embodiment 学习",AnyTeleop 在采集端就做了硬件抽象,这种思路在 2023 年还相对新鲜
  • 类比 lessons:读完会理解一个朴素但深刻的 takeaway——让数据脱离硬件,比让模型适配硬件更有杠杆

引用本笔记 / Cite this note
BibTeX
@online{eai_anyteleop_2026,
  title       = {(readable note) AnyTeleop},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2023 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/anyteleop/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim