回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
Simulation & Sim2Real · Plate Nº 105

BEHAVIOR-1K

6 min read · 1983 字 · ⭐⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

斯坦福搭的"机器人家务考场":1000 道家务题、50 间样板房、9000 多件物品,让所有人用同一把尺子比"机器人到底会不会做家务"。

这是个什么场景

你刚买了一个家政机器人,第一天回家想让它"把脏盘子放进洗碗机、叠好沙发上的毛毯"。问题是——你不敢真让它在自家厨房上手练,碗碎了、冰箱撞凹了,谁赔?

驾校解决人类司机的同样问题,靠的是先在场地里练熟再上路。机器人也需要这样一个"驾校":一栋虚拟的样板间,里面摆好家具、备好杯盘碗筷,让它撞坏一万次也不心疼。

BEHAVIOR-1K 就是这个驾校:50 套不同户型(公寓、别墅、餐厅、办公室)、9000 多件家具和小物件(每个杯子能不能装水、能不能加热都标好了)、1000 道具体题目(从洗盘子到整理床铺)。

而且这 1000 道题不是研究者拍脑袋编的,他们先发了 1461 人的问卷问"你最想让机器人帮你做什么",再从真实答案里挑出最高频、最实用的 1000 个。这是它跟之前 sim 基准最大的差别。

BEHAVIOR-1K — 场景示意:这论文要解决的现实问题
Plate Nº IBEHAVIOR-1K — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • AI2-THOR / iGibson 1.0:场景多但任务定义偏简单(导航、抓取一两件东西),缺长程家务任务
  • Habitat:以导航为主,物体交互能力受限,不擅长"拧开瓶盖"这种细粒度操作
  • RLBench / ManiSkill:任务量大但场景偏 tabletop(桌面单一台面),不是真实家居
  • BEHAVIOR-100(前作):100 个任务 + 15 个场景,规模上不去,物理保真度有限
  • 大多数前作都没认真做过"普通人到底想要机器人做什么"的需求调研,任务集合带研究者偏见

这篇论文的关键想法

核心 insight 有三个:

**第一,让用户出题,不让研究者出题。**就像产品经理做需求调研——先发 1461 人的问卷问"你日常最希望被代劳的活儿是哪些",再从答案里筛出 1000 个高频长尾任务,覆盖清洁、烹饪、整理、护理等 6 大类。

第二,把"做完没"写成可机读的判分公式。像考试用标准答案而不是阅卷老师的感觉来打分——每个任务用逻辑谓词(predicate logic,可以理解为"判定句子")写成初始/目标状态,比如 all(dishes, inside(dishwasher)) 表示"所有盘子都在洗碗机里"。机器人做完后,仿真器自动核对就行,不用人肉打分。

**第三,物体不光长得像,还要"懂自己是什么"。**普通 3D 模型只有外形(mesh、材质);BEHAVIOR-1K 的 9000 件物体还额外标了"能装液体吗"、"能加热吗"、"能折叠吗"——相当于每件道具自带"使用说明书"。配合一套增强版 OmniGibson 仿真器(基于 NVIDIA Omniverse + PhysX 5)就能跑流体、布料、温度等复杂物理。

BEHAVIOR-1K — 方法示意:核心 pipeline
Plate Nº IIBEHAVIOR-1K — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

任务采集:先做大规模在线问卷(1461 人),问"日常生活中你希望被代劳的活动有哪些",拿到自由文本回答;研究者再做聚类、去重、可行性筛选,最终得到 1000 个 BDDL(BEHAVIOR Domain Definition Language,基于 PDDL 扩展)形式化任务定义。每个任务包含初始状态、目标状态、相关物体类别。

场景建模:50 个场景覆盖住宅、餐厅、办公、零售等多种室内环境,部分基于真实房屋扫描重建,部分由专业 3D 美工搭建。每个场景内部所有家具都是可交互的——抽屉能拉开、门能转动、灯能开关——这跟那种只能在表面走动的"装饰性场景"区别很大。

物体资产:9000+ 物体跨 1000+ 类别,每个物体有 mesh、UV 贴图、碰撞体、铰接关节(articulation),还有抽象状态标签(cookable / fillable / foldable 等)。这些标签跟 BDDL 谓词对接,让仿真器知道"杯子能装水"。

仿真器 OmniGibson:在 NVIDIA Omniverse 之上做了二次开发,关键能力包括刚体 + 软体 + 流体的统一物理、PBR 渲染、ROS 接口、多机器人支持(Fetch / Stretch / Tiago / Franka 等)。这是支撑 1000 任务能跑起来的工程底座。

实验在做什么

论文主要不是在拼 SOTA,而是在做基准本身的可行性验证 + baseline 摸底

  • 让现有的 RL / IL(imitation learning,模仿学习)算法在 BEHAVIOR-1K 子集上跑,看完成率
  • 探针式测量:人类遥操作的成功率作为上界,主流算法离这个上界差多远
  • 跨场景泛化:同一个任务换到没见过的房子能不能做
  • 具体数字(成功率、训练步数等)需读原文

预期结论是:当前算法在长程家务任务上完成率非常低,BEHAVIOR-1K 把 embodied AI 的天花板抬得很高,留给后续研究大量空间。

你应该懂的几个新词 — 4-6 个

  • BDDL(BEHAVIOR Domain Definition Language):PDDL(经典 AI 规划语言)的扩展版,用谓词逻辑描述任务的初始/目标状态。比如 inside(apple, fridge) 是一个谓词。
  • Articulated object(铰接物体):有可活动关节的物体,比如能拉开的抽屉、能转动的水龙头。区别于一整块刚体。
  • Predicate(谓词):逻辑学术语,描述对象之间关系的布尔函数。is_open(door) 这种。
  • OmniGibson:BEHAVIOR-1K 配套的仿真器,基于 NVIDIA Omniverse;前身是 iGibson。
  • Embodied AI(具身智能):让 AI agent 拥有"身体",能在物理或仿真世界中感知和行动,区别于纯文字/图像 AI。
  • Long-horizon task(长程任务):需要几十甚至上百步动作才能完成的任务,比如"做一顿早餐"包含取食材、加热、摆盘等多个子任务。

它和其他论文什么关系

  • 前作 BEHAVIOR-100(Srivastava 2021):从 100 任务扩到 1000,场景从 15 扩到 50,物体规模 10 倍提升,是直接迭代关系
  • iGibson 系列:OmniGibson 是 iGibson 的下一代,物理保真度大幅提升(流体、软体)
  • 跟 RT-2 / RT-X 的关系:BEHAVIOR 提供 sim 评测床,RT-X 是真实数据集,二者互补——大模型先在 sim 训练再迁移到真机是常见 pipeline
  • 跟 Habitat 3.0 的关系:Habitat 偏导航 + 简单交互 + 多人协作,BEHAVIOR 偏复杂物理操作,定位错位
  • 被 OpenVLA / RDT-1B 等 VLA 模型当作评测床:作为标准化基准被广泛引用

我建议这样读 — 3-4 步

  1. 先看 Figure 1 + Table 1:感受 1000 任务、50 场景、9000 物体的规模,跟之前的基准对比一目了然
  2. 跳到 Section 3 任务定义:搞清楚 BDDL 怎么写、谓词体系长什么样,这是论文最能复用的部分
  3. 看 Section 5 OmniGibson 仿真器:如果你要自己用这个 benchmark,必须懂仿真器的能力边界(哪些物理支持、哪些不支持)
  4. 最后看 Section 6 实验:看 baseline 算法的失败模式,找自己能切入的研究问题

为什么值得读

如果你做 embodied AI / robot learning,这篇论文的价值不在"思想多新"——它的贡献是给整个领域提供了一把统一的尺子。在 BEHAVIOR-1K 出现之前,每个组用自己的小 benchmark,结果不可比;之后大家可以在同一套任务上 PK,加速整个领域的迭代。

对零基础学习者来说,这篇论文是了解"sim-to-real 这条路上最难的一关到底有多难"的最佳入门——读完你会知道:让机器人完成"叠衣服"这件事,背后需要任务定义、场景建模、物理仿真、感知、规划、控制全栈打通,而每一环都还有大量未解问题。读它不是为了抄方法,而是为了校准对这个领域当前能力的认知

引用本笔记 / Cite this note
BibTeX
@online{eai_behavior_1k_2026,
  title       = {(readable note) BEHAVIOR-1K},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2024 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/behavior-1k/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim