回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
End-to-End VLA · Plate Nº 111

Octo: An Open-Source Generalist Robot Policy

6 min read · 2256 字 · ⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

第一个真正开源的通用机器人"大脑":先看 80 万段机器人录像学基础动作,你下载回来微调几小时,就能让自家机器人学新活。

这是个什么场景

想象你开了一家连锁咖啡店,全国几百家分店,每家的咖啡机牌子、吧台高度、杯子大小都不一样。

以前店长招新店员的做法是:每家店从零教,从"杯子放哪""怎么按按钮"开始练。学得慢,而且这个店员调到隔壁分店,又得重学一遍。

Octo 想做的事,跟 ChatGPT 学文字是一个套路——先让一个"通用咖啡师"看完全世界所有咖啡店的工作录像,把"抓杯子大概什么手感、按按钮大概什么力度"这种底层感觉学透。然后每家分店拿这个"已经懂基础"的咖啡师,做几小时入职培训(行话叫 fine-tune,微调),就能上岗。

最关键的一点是:这个"基础咖啡师"是公开放在网上的,谁都能下载,谁都能在它身上继续教自己的活。在 2024 年之前,机器人圈几乎没有这种东西——大公司训了也不放出来。

Octo — 场景示意:这论文要解决的现实问题
Plate Nº IOcto — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • RT-1 / RT-2(Google):在大规模机器人数据上训了 Transformer 策略,但权重不开源,外部研究者只能眼馋
  • 每个实验室各训各的:每来一个新机器人或新任务,从头收数据、从头训,几千条数据起步,迁移性差
  • Diffusion Policy 系列:方法很强,但默认是单任务、小规模训练,没有"通用预训练 + 下游微调"的范式
  • Open X-Embodiment 数据集(同期):22 个机构合并机器人数据成一个大池子,但只是数据,没人交付一个"训好的、可继续训的"通用策略
  • 整体痛点:社区缺一个机器人版的 BERT/CLIP——能下载、能改、能比较的统一基线

这篇论文的关键想法

三个核心选择:

  1. 架构上选 Transformer + 模块化输入输出 head:不写死"必须用 RGB + 末端位姿",而是把每种观测(图像、语言、本体感知)和每种动作空间(7-DoF 末端、关节、移动底盘)都做成可插拔的 token 化模块。新机器人来了,只要写一个新 head。
  2. 目标可以是语言也可以是图像:可以告诉它"把红色方块放进盒子"(语言条件),也可以丢一张"目标状态的照片"让它复现(goal-image 条件)。两种 modality 同时训练。
  3. 动作头用扩散(diffusion action head):输出动作不再是回归一个数,而是用 diffusion 生成一段未来动作序列(action chunk),稳定性和多模态性更好——这一点继承自 Diffusion Policy。

加上真正开源这个工程层面的关键决策:模型权重、训练代码、数据处理 pipeline 全部公开,让社区第一次能在同一个起点上做对比实验。

Octo — 方法示意:核心 pipeline
Plate Nº IIOcto — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

输入端的 token 化——像翻译把不同语言都转成同一种"中间语"。机器人面对的输入很杂:摄像头画面、人类口头指令、一张"目标长这样"的照片。Octo 把这三种信息都翻译成同一种格式(token,可以理解成模型能消化的小词块):图像被切成小方块过 ViT,文字过 T5 编码器,目标图也走图像那一套。然后所有 token 拼成一长串,喂给主干。好处是有的训练数据只有文字、有的只有目标图,都能丢进来一起训。

主干和动作头——前面是个"理解大脑",后面接一个"画动作的笔"。主干是一个标准的 causal Transformer(GPT 同款架构,分 27M 和 93M 两个尺寸版本)。动作头则换了个有趣的玩法:不是直接吐出"下一步该转多少度"这个数字,而是用 diffusion(扩散,原来给 AI 画图用的那套技术)从一团噪声里"去噪"出未来一小段动作。

等等,先慢一拍——为什么用扩散来生成动作?因为同一个画面下,机器人合理的动作可能有好几种(左手抓也行右手抓也行),普通方法会被这种"多种正确答案"卡住,扩散正好擅长处理这种多模态分布。

预训练数据——食材来自 Open X-Embodiment 数据集,挑了约 80 万条机器人操作轨迹,包含很多种机械臂、很多种任务。具体怎么配比、怎么清洗,要看原文。

下游微调——论文最想让你记住的卖点。下载完权重,你在自己的机器人上收 100~1000 条新数据(这在机器人圈算很少了),跑几小时微调,就能得到比"从零开始训"更靠谱的策略。论文在多个真机和仿真环境上验证了这件事(具体数字看原文)。

实验在做什么

  • 零样本 / 少样本评估:在没见过的任务上直接跑预训练策略,看能不能完成
  • 下游微调:在 9 个真实机器人 setup 上做微调,比较"Octo 微调"vs"从头训"vs"用 RT-1-X 等 baseline 微调"
  • 架构消融:比较 diffusion head vs 直接回归、不同主干规模、不同 goal modality 的效果
  • 数据规模消融:训练数据从 10 万到 80 万条,看通用性怎么涨
  • 跨形态泛化:训练时见过的机器人 vs 没见过的机器人形态,下游表现差距多大

具体胜率数字、消融表格里的每一栏,都需读原文。

你应该懂的几个新词 — 4-6 个

  • Generalist policy(通用策略):一个网络处理多种机器人、多种任务、多种观测模态,相对于"一个任务一个模型"的 specialist
  • Action chunk(动作块):一次性预测未来 K 步动作而不是只预测下一步,能减少抖动、提升时间一致性,源自 ACT 论文
  • Diffusion head(扩散动作头):用扩散模型生成动作,把"预测一个值"换成"从噪声去噪到一段轨迹",能很好处理多模态分布(同一观测下有几种合理动作)
  • Open X-Embodiment(OXE):2023 年 22 机构联合发布的大规模跨形态机器人数据集,是 Octo 的训练食材
  • Embodiment(形态):机器人本体——什么样的臂、几个关节、什么夹爪。"跨 embodiment 泛化"指换一个机器人还能用
  • Modular input/output(模块化输入输出):观测和动作空间不写死,做成可插拔模块,新机器人来了加一个 adapter 就行

它和其他论文什么关系

  • 接住的: Open X-Embodiment(数据底座)、RT-1/RT-2(Transformer 机器人策略的范式)、Diffusion Policy(动作头的扩散思路)、ACT(action chunk)
  • 同期对手: RT-2-X(Google 的跨形态版本,闭源)、OpenVLA(同样开源、晚几个月、走 LLaVA 风格的视觉-语言主干)
  • 被它启发的: π0(更大规模 + flow matching action head)、π0-FAST、SmolVLA 等后续 VLA 模型,都默认要开源、要可微调,这个"社区契约"很大程度上是 Octo 立的
  • 位置坐标: Octo 在"开源通用策略"这条路上是奠基石;OpenVLA 把基座换成 LLaVA-style;π0 把规模和动作头再升级

我建议这样读 — 3-4 步

  1. 先看 figure 1 和 method 概览图:搞清楚"输入怎么 token 化、主干长什么样、动作头怎么接"这三件事,剩下细节都是修饰
  2. 跳到下游微调实验:这是它和 RT-2 类闭源工作差异最大的地方,看它如何论证"少量数据就能 adapt"
  3. 回头读 diffusion action head 那一节:如果你之前没读过 Diffusion Policy,这里会有点突兀,必要时去读 Diffusion Policy 论文补
  4. 最后扫消融:哪个设计是关键、哪个是工程细节,看消融最快——尤其 goal modality、数据规模、主干规模三个轴

为什么值得读

  • 它是 VLA 时代的开源基线:你在读任何 2024 年后的 VLA 论文(OpenVLA / π0 / SmolVLA / pi05),它们的 baseline、数据 pipeline、tokenizer 选择都和 Octo 有血缘关系
  • 它定义了"可微调通用策略"这个产品形态:不是给你看一个 demo,而是真的给你一个能下载、能改、能在你自己机器人上跑的 checkpoint——这个交付标准之后成了行业默认
  • 方法上集大成:Transformer 主干 + 扩散动作头 + 多模态条件 + action chunk,是 2024 年机器人策略的"标准答案"组合,读它等于一次性把这几个组件的关系理清楚
  • 对零基础学习者友好:架构清晰、组件边界分明、消融做得齐,比读 RT-2 那种又涉及 PaLI-X 又涉及大量闭源细节的论文好入门

引用本笔记 / Cite this note
BibTeX
@online{eai_octo_2026,
  title       = {(readable note) Octo: An Open-Source Generalist Robot Policy},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2024 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/octo/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim