回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
Imitation Learning · Plate Nº 60

Mobile ALOHA

6 min read · 2235 字 · ⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

给桌面机器人加了一辆小车,让人手把手带它做家务(炒虾、擦桌、洗碗),每招只示范 50 次就能学会。

这是个什么场景 — 日常类比

想象你想教刚来的家政阿姨做你家那道炒虾——你不会丢一本菜谱让她照做,而是站她旁边,扶着她的手切葱、握锅铲、调火,做几遍她自己就会了。

机器人学家务也是这个套路。Mobile ALOHA 的"手把手"叫遥操作(teleoperation,人在后面牵线、机器人在前面当演员):操作员的所有动作(双臂关节角 + 底盘速度)都被 30Hz 录下来,变成一串带时间戳的"动作录像"。模仿学习(imitation learning)就是把这些录像喂给一个神经网络,让它学会"看到这个画面,下一步该怎么动"。

之前的 ALOHA 只能做桌面任务(穿电池、拆拉链),因为它没有腿——锅在灶台上,它够不着。Mobile ALOHA 的核心一招就是给它焊一辆小车,让任务空间从"桌面 30cm"扩展到"整个厨房"。

Mobile ALOHA — 场景示意:这论文要解决的现实问题
Plate Nº IMobile ALOHA — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • 桌面操控为主:原版 ALOHA、RT-1、Diffusion Policy 大多在固定桌面上做拼装、抓取,不涉及全屋移动
  • 移动操作分两半:传统机器人把"导航"(SLAM/规划)和"操作"(抓握/装配)拆开做,移动时不操作,操作时不移动,难做炒菜这种"边走边做"的任务
  • 数据贵且少:真实家务示范需要专人遥操,硬件常贵到 20 万美元以上,数据量上不去
  • 学到的策略脆:少量示范学出来的策略往往一离开演示场景就崩,泛化性差

这篇论文的关键想法

像组装宜家家具一样,把三块"已有零件"拧到一起:

  1. 硬件普惠——把贵的拆掉,把好用的留下。在原 ALOHA(双 6 自由度机械臂 + 主从遥操架构)下面焊一辆 AgileX Tracer 轮式底盘,操作员像推婴儿车一样系在底盘后端,用腰"推"着车走,同时双手操控两条主臂。整套硬件预算压到约 3.2 万美元(同行常用的硬件常贵到 20 万美元以上),开源全套图纸让学术圈能复现
  2. 全身遥操(whole-body teleoperation)——一个人同时演完所有角色。操作员的身体动作 → 底盘速度,操作员的双手 → 双臂关节,14 维动作向量统一录下(2 臂 × 7 + 2 底盘速度),训练时也按这个统一动作空间预测
  3. co-training(联合训练,把新数据和老数据混着学)——像新人厨师一边学新菜一边复习基本功。单独靠 50 条新任务示范学不出稳的策略;论文用现成的静态 ALOHA 数据集(已有大量桌面双臂数据)和新移动数据混在一起训练,复用桌面数据里学到的双臂操控先验,把移动任务成功率拉上去
Mobile ALOHA — 方法示意:核心 pipeline
Plate Nº IIMobile ALOHA — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

硬件层面(像推婴儿车):底盘是力反馈式被动跟随——操作员系在底盘后用腰带着推,底盘上的传感器记录线速度和角速度作为动作。好处是操作员的运动直觉直接迁移到底盘,不用学摇杆。机械臂沿用 ALOHA 的 leader-follower(主从)架构:操作员手里握的"主臂"和机器人身上的"从臂"长得一模一样,操作员动一下,从臂同步动一下,每个关节实时跟随。

数据采集(录综艺):每个任务录大约 50 条示范(具体数字需读原文),动作空间是 14 维(双臂 7 + 7 + 底盘 vx + ω,注意底盘只有 2 维但记成 14 维是把双臂关节占满后剩 2 维给底盘速度)。视觉用 3 个 RGB 相机:左手腕 + 右手腕 + 顶部俯视,相当于厨师视角 + 全景导播位。

策略学习(三种学徒比赛):作者让三种主流模仿学习算法同台 PK——ACT(Action Chunking Transformer,原 ALOHA 的方法)、Diffusion Policy、VINN。每个算法都做"只用新数据"和"co-training(新数据 + 静态 ALOHA 数据)"两组对比。

等等,先慢一拍 — 这里的 ACT 是什么?简单说,它一次预测未来好几步的动作(一个动作 chunk,"块"),而不是一步步走,这样能少犯"走两步偏一点、十步偏一大截"的错。共同结论是 co-training 普遍把成功率从"勉强能用"拉到"接近实用"。

任务清单(家务七连):7 个真实家务长任务——炒虾(抓锅 + 倒油 + 翻炒)、擦红酒渍、用洗碗机、推椅子归位、HiFive(和人击掌)、开柜子放锅、打电梯。每个任务都是分钟级、需要在房间里走动 + 多步骤操作。

实验在做什么

主要回答三个问题:

  1. co-training 有没有用:在 7 个任务上对比"纯新数据"vs"co-training",看成功率提升多少
  2. 算法选择重要吗:ACT、Diffusion Policy、VINN 哪个更适合这种长任务移动操控
  3. 少量数据够不够:50 条示范是不是真能撑起一个能用的策略

具体成功率数字需读原文表格,但论文公开页面提到大多数任务在 co-training 下能到 80%+ 成功率,部分任务达到 90%。从工程视角更值得看的是失败模式分析——哪些步骤最容易崩(通常是抓取的瞬间或底盘转向时的视觉漂移)。

你应该懂的几个新词 — 4-6 个

  • 遥操作(teleoperation):人操控机器人,机器人忠实复制人的动作。和"自主"相对,是数据采集阶段的常见手段
  • 模仿学习(imitation learning):让神经网络从"状态 → 动作"的录像中学,不用强化学习里的奖励函数
  • ACT(Action Chunking Transformer):原 ALOHA 论文提的方法,一次预测未来 k 步动作(动作 chunk),用 Transformer + CVAE 建模,能缓解模仿学习里典型的"复合误差"
  • co-training(联合训练):把不同分布的数据混在一起训一个模型,让稀缺任务借用充足任务的先验
  • whole-body control:传统机器人术语,指同时协调多个执行器(这里是双臂 + 底盘)完成一个目标,避免分阶段调度
  • 复合误差(compounding error):模仿学习的老问题——模型每一步预测都有小误差,几十步后就偏出训练分布,再也回不来。Action chunking 是常见缓解手段

它和其他论文什么关系

  • 承接 ALOHA(同一作者团队,2023):硬件 + ACT 算法直接来自 ALOHA。可以把 Mobile ALOHA 看作"ALOHA + 一辆车 + co-training trick"
  • 对照 RT-2 / Open X-Embodiment:这两条线靠"超大数据 + 大模型"做泛化;Mobile ALOHA 反着来,用"少量高质量遥操数据 + 经典模仿学习"做长任务
  • 延伸 Diffusion Policy:作为 baseline 之一被对比,Mobile ALOHA 的实验结论是 ACT 在这种动作 chunk 长任务上更稳,但 Diffusion Policy 在某些任务上更好
  • 影响后续:2024 后半年 HumanPlus、ALOHA Unleashed、各种"廉价 ALOHA 复现"项目大量沿用 Mobile ALOHA 的硬件设计和 co-training 思路
  • 和 humanoid 路线对比:Mobile ALOHA 是"轮式 + 双臂",HumanPlus / Unitree H1 是双足,路线之争——轮式更稳更便宜,双足更通用

我建议这样读 — 3-4 步

  1. 先看项目主页 mobile-aloha.github.io 的视频,30 分钟看完所有 demo,建立任务难度的直觉(炒虾真的炒、擦桌子真的擦)
  2. 读论文 §3(硬件)+ §4(co-training 公式),这两节是新东西;§2(ACT)已在原 ALOHA 论文里讲透
  3. 跳到实验表格,关注每个任务"纯新数据 vs co-training"的成功率差,体会 co-training 的边际收益
  4. 选感兴趣的失败案例(论文附录或视频里有失败片段)想一想:如果让你改进,会从硬件、数据、算法哪个层面入手?

为什么值得读

  • 路线意义:在 RT-2 之后大家都觉得"机器人通用必须靠大模型 + 大数据",Mobile ALOHA 证明了在受限场景里"少量高质量示范 + 经典 IL"也能做出实用的长任务
  • 工程范本:硬件全开源,BOM 清单 + 装配图都有,是从 0 自己搭一台双臂移动遥操机器人的最佳起点
  • co-training 这招很泛:不只是机器人,任何"新任务示范贵、相关任务数据多"的场景都可以借鉴——多模态、多 embodiment 的迁移正成为新常态
  • 入门友好:硬件直观、任务直观(家务),不像 RL 论文那样一堆奖励 shaping,看视频就知道学到了什么;适合作为模仿学习方向的第一篇深读论文之一

引用本笔记 / Cite this note
BibTeX
@online{eai_mobile_aloha_2026,
  title       = {(readable note) Mobile ALOHA},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2024 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/mobile-aloha/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim