回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
Imitation Learning · Plate Nº 51

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)

7 min read · 2601 字 · ⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

几千美元搭一套双臂遥控器(ALOHA)让人录 50 次示范,机器人就学会一段一段动(ACT),能完成穿扎带这种细活。

这是个什么场景

你在家想教爸妈系鞋带:你拉着他的手系了 50 次。换他自己上手时,会发现你每次系的力度、节奏其实都不太一样——有时手指捏得紧、有时松。如果他只盯着上一秒手在哪、下一秒就贴着模仿,误差会越积越大:第一个结打歪一点点,第二个就歪很多,第三个干脆散架。

更现实的是:这种"手把手教"的过程根本没法大规模做实验。专业版的"教学手套"——也就是机器人圈的双臂示教设备——之前要么贵到几十万美元(一般实验室买不起),要么戴在手上的动捕手套精度差到对不上厘米级的扎带、电池这种小东西。结果就是大家都知道"双手精细操作"重要,但没人能负担得起做研究的入场券

ACT/ALOHA 就是冲着这两件事一起来的:先把"教学手套"砍到几千美元谁都搭得起(ALOHA 硬件),再换一种更稳的学法——别一秒一秒学、一段一段学(ACT 算法),让机器人少看几次也能把活干利索。

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA) — 场景示意:这论文要解决的现实问题
Plate Nº ILearning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA) — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • 大型工业机械臂 + 高端力反馈:精度够,但单套设备几十万美元,普通实验室买不起,更不用说收集大量数据
  • VR/动捕手套遥操:戴在人手上映射到机器人,但手指尺寸、关节自由度对不齐,做不了厘米级精细动作
  • 行为克隆(Behavior Cloning, BC)逐帧预测:每个时刻输入观察、输出下一帧动作,简单但**复合误差(compounding error)**严重——预测略偏 → 下一帧观察略偏 → 越偏越远
  • DAgger 类在线纠偏:让专家在策略跑偏时介入打标签,但需要专家长期 on-call,成本高
  • 离线 RL / IRL(逆强化学习):理论优雅但样本效率低,在精细操作上很难超过简单 BC

这篇论文的关键想法

ACT 把"模仿学习"拆成两个问题来想:

第一个问题:每一步该看多远? 像写字——如果你只盯着笔尖前 1 毫米写,每一笔都微抖一下,写完一行字就歪了。但如果你先在脑子里规划好下一段(比如下 5 个字)的整体走势再下笔,单笔的小抖动就被整段的方向感盖住了。

之前的行为克隆(BC)默认就是"预测下一帧"——只看 1 毫米。ACT 让模型一次输出 k 帧未来动作(一个 chunk,论文里 k 是 100 量级,具体数字需读原文),把短期规划交给网络。

第二个问题:人示范本来就不一致,怎么办? 想象你让 5 个朋友各教你切洋葱:有人从左往右切,有人从中间往两边切,都对。如果机器人傻傻地把所有示范"求平均",最后会做出一个谁都没教过的怪动作(=切歪)。

ACT 套了一个 CVAE(Conditional Variational Autoencoder,条件变分自编码器):训练时偷偷看完整段示范,把"这次是哪种风格"压成一个小标签 z;推理时随机抽一个 z,让机器人每次都按某一种自洽的风格跑完,而不是把所有风格糊在一起。

等等,先慢一拍 — CVAE 是什么? 你可以把它想成一个"风格压缩器"。它把"这一条示范长什么样"压成一个数字串(latent z),生成时再把这串数字翻译回一段轨迹。关键是它只在训练时偷看真值,让模型学会"风格"和"动作"的对应关系;上线时随机抽一个风格就能稳定生成。

合起来就是:一段一段预测 + 用风格标签吸住示范的分歧。再加一个"时序集成(temporal ensembling)"的小技巧——每个时刻其实被预测过很多次(这次预测了未来 100 帧、下次又预测了未来 100 帧、有重叠),把这些重叠的预测加权平均再执行,等于自带降噪。

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA) — 方法示意:核心 pipeline
Plate Nº IILearning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA) — 方法示意:核心 pipeline

它怎么做的(方法)

硬件 ALOHA — 像"提线木偶":人这边握两支主臂、机器人那边是两支从臂,主臂转一个角度从臂就跟着转一个角度,关节直接对齐——操作者用自己最自然的双手姿势就能开机器人。工作区周围装 4 个摄像头(顶视、前视、两个手腕视角)当机器人的"眼睛"。整套硬件约 2 万美元(论文称 "low-cost" 是相对工业级几十万而言;BOM 清单见原文 appendix)。

网络结构 — 像"看图作画":4 张照片 + 当前关节角度 → ResNet 抽特征 → Transformer 编码 → 一次吐出 k 帧未来动作(每帧 14 维:左右臂各 7 个自由度,含夹爪)。CVAE 的编码器只在训练时上岗,把示范轨迹压成 z 喂给 Transformer;上线时直接从标准高斯 N(0, I) 抽一个 z 用。

训练目标 — 像"对答案 + 守规矩":一项是预测动作和真示范的 L1 距离(对答案),一项是 KL 散度(让 z 的分布别长歪、贴近标准高斯,方便上线时随机抽)。完全没有强化学习项,纯监督 + VAE 正则。

推理时的时序集成 — 像"多人投票":时刻 t 时模型预测了 [t, t+k] 的动作;下一时刻 t+1 又预测了 [t+1, t+k+1]。同一个绝对时刻会被预测很多次,ACT 把这些预测指数加权平均再执行,相当于在时间维度上加了一层低通滤波,进一步抹掉抖动。

实验在做什么

论文挑了几个人也得集中注意力的双手任务,覆盖不同难点:

  • 穿扎带(thread the velcro tie):一只手拿扎带、一只手把头穿过环,需要双手协同 + 厘米级对准
  • 撕保鲜膜 / 拆装电池 / 倒乒乓球:考验力控、双臂分工、容错
  • 拍掌、握手、传递物品等接触丰富(contact-rich)的任务

对比对象主要是各种 BC baseline(BC-MLP、BC-RNN、BeT 等)。ACT 在大多数任务上成功率显著高于 baseline,部分任务从 0% 拉到 80%+(具体数字需读原文 Table)。消融(ablation)研究确认两件事:去掉 chunking 退化为逐帧预测会大幅掉分;去掉 CVAE 也会掉,但没有去掉 chunking 那么致命——说明 chunking 是更核心的贡献

数据规模上,每个任务大概 50 条示范(约 10-20 分钟人类操作),属于"少样本模仿"档位。

你应该懂的几个新词 — 4-6 个

  • Action Chunking(动作分块):把"预测下一帧"换成"预测下 k 帧"。核心目的是减少决策频率、降低复合误差、把短期规划交给网络
  • Compounding Error(复合误差):BC 的老问题——每帧的小预测误差会让下一帧观察偏离训练分布,误差像滚雪球一样越滚越大
  • CVAE(Conditional Variational Autoencoder,条件变分自编码器):在 VAE 基础上把"输入条件"也喂进去。这里用来把"人这次的操作风格"压成一个 latent,让生成的轨迹模式自洽
  • Teleoperation(遥操作):人远程操作机器人。ALOHA 用主从臂关节直接映射,是最朴素也最直观的一种
  • Behavior Cloning(BC,行为克隆):监督学习意义上的模仿——给观察、学动作。简单但有复合误差等先天问题
  • Temporal Ensembling(时序集成):把同一时刻被多次预测的动作做加权平均,等于在时间维度做平滑

它和其他论文什么关系

  • 上游:BC(Pomerleau 1989)、DAgger(Ross 2011)这条模仿学习主线。ACT 不在线纠偏,而是从输出结构上解决复合误差,路线更轻
  • 同期对手 — Diffusion Policy(Chi et al. 2023):同样想解决多模态 + 复合误差问题,但用 扩散模型 替代 CVAE 来建模动作分布。两者经常被一起对比,diffusion 拟合分布更强但推理更慢;ACT 更轻量、更快、更易调
  • 下游 — Mobile ALOHA、ALOHA 2、ALOHA Unleashed:同一团队后续把 ALOHA 加上移动底盘、把数据规模拉到上千条示范、扩展到家务任务,ACT 仍是默认基线策略
  • 跨方向 — RT-1 / RT-2 / OpenVLA:这条线是"用海量多任务数据训通用策略 + VLM 主干",与 ACT "单任务、少样本、专精"互补,社区目前在融合两条思路(用大模型当先验 + 用 ACT 类结构做下游精控)

我建议这样读 — 3-4 步

  1. 先看 ALOHA 演示视频 + 网站(项目主页有完整 demo),对"双臂遥操能干什么"有直观感觉,再回来读论文
  2. 跳到 Method 第 2 节看 ACT 网络图:理解"输入 4 图 + 关节 → 输出 k 帧动作"这个 IO 结构最重要,CVAE 细节可以之后再补
  3. 重点读 Ablation 部分:作者自己证明 chunking > CVAE > temporal ensembling 的相对重要性,比 main result 表更有信息量
  4. 可选:读硬件 appendix 看 BOM(物料清单)和搭建说明,对机器人 system paper 的写作格式是很好的范例

为什么值得读

  • 教科书级别的 system paper:硬件 + 算法 + 数据 + 评测一条龙,是"如何写一篇可复现的机器人论文"的范本
  • chunking 的思路被全行业吸收:后续 Diffusion Policy、Mobile ALOHA、RDT-1B 等几乎所有模仿学习工作都默认输出 action chunk 而不是单帧,这个范式转变就是从 ACT 开始的
  • 低成本平台让社区可以真复现:在此之前机器人论文经常因为硬件门槛"看得见摸不着",ALOHA 把门槛拉到学生项目能做的水平,催生了大量后续工作
  • 对 imitation learning 这条赛道是关键节点:在它之前 BC 被认为"太弱、必须上 RL",在它之后大家发现"BC + 合适的输出结构 + 干净示范"已经能解相当多精细任务,重新定义了赛道的上限

引用本笔记 / Cite this note
BibTeX
@online{eai_act_aloha_2026,
  title       = {(readable note) Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2023 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/act-aloha/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim