回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
Datasets & Benchmarks · Plate Nº 31

LIBERO

7 min read · 2326 字 · ⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

教机器人学新技能时别忘旧技能。LIBERO 是这事的标准考卷,4 套题分别考空间、物体、目标和综合。

这是个什么场景

家里新来一个家政机器人。周一你教它叠衣服,周二教它洗碗,周三教它整理书架——结果它学会洗碗那天,叠衣服全忘了;再学整理书架,连碗也不会洗了。像极了那种刚学新菜谱就忘了怎么煎蛋的实习生。学界给这个起了个戏剧化的名字:灾难性遗忘(catastrophic forgetting)

那怎么知道哪家公司做的机器人"记性"更好?以前没标准——每家自己造一组任务、自己跑、自己说"我们家最强",谁也不服谁。

LIBERO 干的就是这事:给所有家政机器人出一份统一的考卷。考卷分 4 类题(4 个 task suite):

  • "换了一个厨房,还会不会洗碗?"——考空间泛化
  • "把碗换成杯子,还会不会抓?"——考物体泛化
  • "以前是开抽屉,现在要关抽屉,会不会搞混?"——考目标泛化
  • 长程混合大杂烩——考综合能力

有了这把尺子,大家终于能放在一张表上比。

LIBERO — 场景示意:这论文要解决的现实问题
Plate Nº ILIBERO — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • 单任务 benchmark:Meta-World、RLBench、CALVIN 等更偏"一次性学一组任务",不强调"先学 A 再学 B 时 A 会不会忘"
  • 持续学习(CL)社区:之前主要在图像分类(Split-CIFAR、Permuted-MNIST)上跑,机器人控制这条线的标准化基准缺位
  • 模仿学习 + 视觉伺服:很多机器人 paper 自己造一组任务、自己跑、自己报数,互相不可比
  • 缺少"知识类型"的解耦:之前评估混在一起,没把"空间知识 / 物体知识 / 目标知识"拆开来看模型擅长迁移哪一种
  • 没有大规模专家演示数据集:以前的 CL 基准要么没演示数据,要么只有几条;LIBERO 提供了每任务约 50 条人类遥操作演示

这篇论文的关键想法

老师批改作业时,会把"算错"和"看错题"分开扣分——因为错的原因不一样,补救方法也不一样。LIBERO 的核心想法就是把机器人需要记住的"知识"拆成三类,分门别类地考,再加一个综合套:

  1. 空间知识(LIBERO-Spatial)——像"换了个厨房还能不能找到碗"。物体一样,但摆放位置和桌面布局变了。
  2. 物体知识(LIBERO-Object)——像"碗换成杯子还会不会抓"。场景一样,但物体外观和类别变了。
  3. 任务/目标知识(LIBERO-Goal)——像"以前教你开抽屉,现在要关抽屉"。场景物体都一样,但要做的事变了。
  4. LIBERO-100(综合长程)——大杂烩。90 个短任务训练 + 10 个长程任务测试,模拟真实家里那种"先 A 再 B 再 C"的复杂活。

等等,先慢一拍——为什么要拆这三类?因为以前评估全混在一起,模型考砸了你都不知道它是"记不住位置"还是"认不出新物体"。拆开后才看得到模型擅长哪一种迁移。

第二个关键想法是:同一份卷子两群人都能用。持续学习社区拿它跑 EWC、ER、PackNet 这些经典算法;VLA(Vision-Language-Action 模型)圈子拿它当"5-shot / 10-shot 微调能力"的标准考场。一卷两吃。

LIBERO — 方法示意:核心 pipeline
Plate Nº IILIBERO — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

仿真平台:基于 robosuite + MuJoCo,单臂 Franka Panda 桌面操作。每个任务都有自然语言指令("pick up the alphabet soup and place it in the basket"这类),便于评测语言条件策略。共 130 个任务(4 套合计),每个任务约 50 条人类遥操作演示。具体每套任务的精确数量与时长需读原文。

评估协议:核心指标是成功率(success rate)前向迁移 / 反向迁移(FWT / BWT)。BWT 衡量学了新任务后旧任务掉了多少(就是遗忘量),FWT 衡量学过的旧任务对新任务有没有帮助。论文跑了 PackNet、EWC、Experience Replay 等经典 CL 算法,配合 ResNet/ViT 视觉编码器和 BC-RNN/Transformer 策略头做交叉对照。

网络与训练:方法层面 LIBERO 论文本身偏"评估 + 实证研究",不主推某个新算法。它的贡献是发现:(a) 视觉编码器的预训练(如 R3M)对 FWT 帮助很大;(b) Transformer 策略比 RNN 在长程任务上更稳;(c) 现有 CL 算法对**目标知识(Goal)**这一类最容易遗忘,对空间次之。这些观察是后来 VLA paper 反复引用的"基线参考"。

数据与代码:LIBERO 全部开源,提供 HDF5 格式的演示数据 + 标准训练/评估脚本。这是它能成为事实标准的重要原因——可复现性极高,跑 baseline 几乎是 import + 一条命令。

实验在做什么

论文实验主要回答四个问题:

  • 不同知识类型遗忘程度差多少:在 Spatial / Object / Goal / 100 四套上分别跑同一组算法,看 BWT 曲线
  • 预训练视觉表征值不值:对比 from-scratch、ImageNet 预训练、R3M 预训练在 FWT 上的差距
  • 策略架构选择:BC-RNN vs BC-Transformer,看长程任务表现
  • CL 算法横评:PackNet、EWC、ER 等在不同任务族上各自的强项弱项

具体数字需读原文表格(success rate、FWT、BWT 三栏,每个 suite 一组)。后续 VLA 圈子用 LIBERO 时往往只跑 success rate 这一栏,并把场景固定为"小样本微调"——和原论文的终身学习 setup 不完全一样,但共享同一套任务定义。

你应该懂的几个新词 — 4-6 个

  • 终身学习(lifelong learning / continual learning, CL):模型按时间顺序持续学新任务,要求不忘旧、能用旧帮新。和"多任务学习"区别在于多任务是同时见所有数据,CL 是顺序见。
  • 灾难性遗忘(catastrophic forgetting):神经网络学新任务时旧任务性能急剧下降的现象,是 CL 的核心难题。
  • 任务族 / 任务套(task suite):一组共享某种结构但内部又有变化的任务集合。LIBERO 把它当作"考试题型"。
  • 前向迁移(FWT)/ 反向迁移(BWT):FWT = 学过的任务帮没学的;BWT = 学新的对旧的影响(通常是负数,越接近 0 越不遗忘)。
  • 遥操作演示(teleoperation demonstration):人类用手柄/VR 操控机器人完成任务,记录下来当训练数据。LIBERO 的 ~50 条 / 任务就是这么来的。
  • VLA(Vision-Language-Action 模型):把视觉、语言、动作放进一个大模型(通常基于 VLM 微调),LIBERO 现在主要被 VLA 圈用作微调评估场。

它和其他论文什么关系

  • 上游基础设施:robosuite / MuJoCo(仿真)、R3M(视觉预训练表征)、BC-RNN / RT-1(策略架构原型)
  • 同代基准:CALVIN(语言条件长程,更偏多任务)、Meta-World(强化学习多任务)、RLBench(更工业操作向)。LIBERO 的差异化是显式 lifelong + 知识类型解耦
  • 下游用户(这是它真正爆火的方向)
    • OpenVLA(Stanford 2024)用 LIBERO-Spatial / Object / Goal / 10 测试微调能力,把它当成 VLA 标准卷
    • π0 / π0.5(Physical Intelligence 2024-25)用 LIBERO 验证小样本能力
    • RDT-1B(清华 2024)也跑 LIBERO 对照
    • 很多近一年的"VLA + xxx"论文(diffusion policy 改进、action tokenizer 等)都把 LIBERO 当默认 evaluation suite
  • 后继 / 替代尝试:SimplerEnv(2024)走"真机匹配"路线,目标是让仿真更接近真机;CALVIN 仍是另一个常并列报告的选项

我建议这样读 — 3-4 步

  1. 先看官方 GitHub README + 30s demo 视频(搜 "Lifelong-Robot-Learning/LIBERO")。先建立"4 个 suite 长什么样"的视觉直觉,比读 paper 引言更快。
  2. 跑通一次 baseline:clone 仓库,用 BC-Transformer 在 LIBERO-Object 上跑一遍。这一步会让你理解任务、演示数据格式、评测脚本,比读方法章更扎实。
  3. 回到论文 Section 4-5:看四类知识在不同 CL 算法下的曲线对比,重点关注 Goal suite 为什么最容易遗忘——这是后来很多 paper 切入的角度。
  4. 顺藤摸瓜读 OpenVLA 的 LIBERO 评估表:你会发现"LIBERO 在 VLA 时代的用法"和论文原始的 lifelong setup 有偏移,理解这个偏移就理解了基准如何"被社区改造"。

为什么值得读

  • 它是当前 VLA 微调评估的事实标准之一。读 2024-25 年任何一篇 VLA 论文,几乎都会在实验表里看到 LIBERO 4 个 suite 的成功率——不读原文你只能照抄数字,读了能判断"为什么作者只报 Spatial 不报 Goal"这种小心机
  • 它把"机器人持续学习"这个抽象问题做了一次干净的拆解:空间 / 物体 / 目标三类知识的解耦思路对你设计自己的 ablation 也有启发
  • 复现门槛低。仿真 + 完整代码 + 演示数据全开源,是少有的"读完就能上手"的基准 paper
  • 战略价值:理解 LIBERO 等于理解了一条评估范式——"用任务族而不是单任务衡量泛化"。这种思路在 RoboArena、SimplerEnv 等后续基准里都能看到影子

引用本笔记 / Cite this note
BibTeX
@online{eai_libero_2026,
  title       = {(readable note) LIBERO},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2023 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/libero/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim