回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
Simulation & Sim2Real · Plate Nº 101

Habitat 2.0

6 min read · 2156 字 · ⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

上一代 Habitat 只能在虚拟房子里走路看;2.0 让小机器人能真的开冰箱、把杯子从厨房拿到客厅做家务。

这是个什么场景 — 日常类比

想象你想训练一个机器人帮你做家务——下班回家让它从冰箱拿瓶汽水放到沙发茶几上。但在真机器人上反复试错几千次太贵也太慢(撞坏的杯子要赔钱、跑一晚才走 100 步),所以研究者干脆在电脑里造一个"虚拟房子"当训练场,让 AI 在里面跑上亿次再迁移到真机。

上一代 Habitat(1.0)就是这样的虚拟房子,但它更像一个只能转头看景的看房 demo——你能让小人在房间里走路、看墙壁、记地图,但柜门是死的、杯子是画上去的,伸手过去什么都抓不起来。Habitat 2.0 把这套虚拟房子升级成真能"过家家"的厨房客厅:冰箱门能拉开、抽屉能滑出、杯子有重量被撞会倒,机器人撞到桌角也会被挡住。

研究者要的就是这种"能动手"的虚拟房子——真正的家庭机器人最终要在物理世界做事,光会看路、记地图远远不够。

Habitat 2.0 — 场景示意:这论文要解决的现实问题
Plate Nº IHabitat 2.0 — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • Habitat 1.0(2019):渲染快、地图多,但场景是静态网格,不能交互,只能跑 PointNav / ObjectNav 这类纯导航任务。
  • AI2-THOR / RoboTHOR:支持开关抽屉、拿放物体,但用的是离散"魔法动作"(teleport-style),不是真物理。
  • iGibson / SAPIEN:开始引入物理和关节物体,但要么场景小,要么仿真速度慢,跑不动 RL 所需的亿级 step。
  • 传统机器人仿真器(Gazebo / MuJoCo / PyBullet):物理强,但没有照片级视觉,也没成套家居场景资产。
  • 结论:在 Habitat 2.0 之前,没人能同时做到"快 + 真物理 + 视觉真实 + 大规模可交互家居"。

这篇论文的关键想法

把"模拟器"当作一个由三层组成的栈来重做:资产层(ReplicaCAD)+ 仿真层(Habitat-Sim 2.0 物理引擎)+ 任务层(HAB)。每一层都为了同一个目标——让 RL agent 能在 GPU 上以超高吞吐做家居物理交互——重新设计:

  • 资产做成铰接的(cabinet 有可动门、抽屉有可滑轨道)
  • 仿真用 Bullet + 自家优化把吞吐推到几千 SPS(steps per second)
  • 任务用一组接近真实生活语义的"重排"长流程(找物体、抓、放、回家),而不是单一短动作

这不是"加个物理就完事",而是把整个 pipeline 重新做了一遍,让具身 AI 第一次能在"长任务 + 真物理 + 视觉真实"里同时被训练和评测。

Habitat 2.0 — 方法示意:核心 pipeline
Plate Nº IIHabitat 2.0 — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

ReplicaCAD:可交互的家居资产。像宜家把整套家具拆成一个个能转动的零件交给你 DIY。基于 Replica 数据集(真实扫描的房间),人工把家具一件件重做成 CAD 风格的、带关节信息的 3D 模型。冰箱不是一坨死网格,而是"机身 + 一个可绕铰链旋转的门";抽屉柜不是一坨死网格,而是"机身 + 几个可沿滑轨平移的抽屉"。这样 agent 才能"打开 → 伸手 → 关上"。

Habitat-Sim 2.0:高吞吐物理仿真。像把一台普通游戏机改装成 1000 倍速快进的训练机——画面一样好看,但同样时间能多跑几千场。在 Habitat 1.0 的渲染基础上接入 Bullet 物理引擎,并大量做工程优化:批渲染、避免 CPU-GPU 拷贝、向量化环境。结果是单 GPU 能跑到接近 10^4 SPS 量级(具体数字需读原文),让端到端 RL 训练在天级别可行。

等等,先慢一拍 — SPS 是什么?SPS = steps per second,仿真器一秒能模拟多少个"动作步"。RL(强化学习)训练动辄要跑亿级动作步,SPS 高一倍,训练时间就少一半。所以"快"在这里不是炫技,是决定一个研究能不能被普通实验室做出来。

Home Assistant Benchmark(HAB)。像驾校给你出的几道考题——不是只让你直行,而是要倒库 + 侧方 + 上坡一气呵成。论文定义了一组家居长任务:例如 SetTable(把碗筷从橱柜拿出摆到桌上)、TidyHouse(把散乱物体放回该放的地方)、PrepareGroceries(把购物袋里的东西归位到冰箱/橱柜)。每个任务都要求 agent 完成一连串"导航 + 开柜 + 抓取 + 放置"的子动作,整体长度可达分钟级。

两类策略基线。一种像新手厨师从切菜到上桌全靠肌肉记忆死磕,一种像老厨师把活拆成"洗 / 切 / 炒 / 摆盘"分别练熟再串起来。论文同时跑了两种 agent:一种是端到端 RL(视觉直接到电机指令),一种是"任务规划 + 技能(skill,子任务的小策略)组合"——先把长任务拆成子技能(pick / place / nav / open),每个子技能单独训练,再用一个高层策略串起来。后者的成功率显著更高,揭示了端到端长任务的难度。

实验在做什么

实验主要回答三个问题:

  1. 仿真够不够快:测了 Habitat-Sim 2.0 的 SPS 吞吐,对比 1.0 和其他主流仿真器,确认它能支撑亿级 step 的 RL 训练(具体数字需读原文)。
  2. HAB 任务有多难:在 SetTable / TidyHouse / PrepareGroceries 上跑端到端 RL 和 hierarchical(技能组合)两种 baseline。结论是端到端基本做不动长任务,hierarchical 也只能在简化设定下达到不算高的成功率,留下了大量空间给后续研究。
  3. 资产和场景的可扩展性:展示 ReplicaCAD 能被布置出多种 layout,agent 学到的策略在新 layout 下的泛化能力。

你应该懂的几个新词 — 4-6 个

  • Embodied AI(具身 AI):agent 不只会输入输出文本,而是有"身体"(在仿真或真实世界里能动),因此要处理感知-动作循环。
  • Rearrangement(重排任务):让 agent 把环境里的物体从初始状态搬到目标状态。是 EAI 社区在 2020 前后逐渐共识的"具身任务原型"。
  • SPS(steps per second):仿真器一秒能模拟多少个环境步。RL 训练亿级 step 时,SPS 直接决定训练要几小时还是几周。
  • Articulated object(铰接物体):带关节的物体,比如能开关的门、能拉出的抽屉。区别于一坨刚体网格。
  • Hierarchical policy(分层策略):高层选"技能"(如 pick),低层执行原子动作(电机指令)。在长任务中常比端到端 RL 稳定。
  • Skill / sub-policy:上面 hierarchical 里说的"低层小策略",每个 skill 解决一个子任务,比如 pick 只管抓。

它和其他论文什么关系

  • 承接 Habitat 1.0(同实验室):1.0 解决"跑得快 + 视觉真",2.0 加上"能动手 + 长任务"。
  • 平行 / 对手:iGibson 2.0、ManiSkill、SAPIEN—— 同期都在做"物理交互家居仿真器",各有取舍(视觉 vs 物理 vs 速度)。
  • 下游催生:Habitat 3.0(人机协作)、HomeRobot、OVMM(Open-Vocabulary Mobile Manipulation)这些更复杂的任务都直接基于 Habitat 2.0 的栈。
  • 和 RoboCasa / SimplerEnv 的关系:后两者更偏"机械臂任务集合 + 真机对齐",Habitat 2.0 偏"全身移动 + 长流程家居"。两条线在 2024-2025 逐渐互补。
  • 和 BEHAVIOR-1K:BEHAVIOR 路线更追求任务多样性(1000 个任务),Habitat 2.0 更追求训练吞吐和 RL friendliness。

我建议这样读 — 3-4 步

  1. 先读 Habitat 1.0 笔记,搞清楚"为什么仿真器要追求 SPS"和"渲染管线长什么样",2.0 的工程贡献才能感受到。
  2. 直接跳 HAB 任务定义那节:看看 SetTable / TidyHouse / PrepareGroceries 具体要 agent 做什么,理解"分钟级长任务"到底有多复杂。
  3. 回头看 ReplicaCAD 的资产例子:理解"铰接物体"在数据层是什么样的(关节、自由度、碰撞体)。
  4. 最后看 baseline 结果:重点不是绝对成功率,而是"端到端 vs hierarchical 的差距"——这个差距塑造了后续两三年(2022-2024)整个 EAI 社区的方法论方向。

为什么值得读

Habitat 2.0 是 EAI 仿真器从"导航"走向"操作"的标志性一步。如果你以后会用任何一个家居仿真器(Habitat 3、HomeRobot、OVMM、RoboCasa),它的设计哲学(资产 / 仿真 / 任务三层栈、SPS 优先、hierarchical baseline)都直接或间接来自这篇。理解它,等于理解了 2021 年之后家居具身 AI 的"地基长什么样"。

引用本笔记 / Cite this note
BibTeX
@online{eai_habitat_2_2026,
  title       = {(readable note) Habitat 2.0},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2021 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/habitat-2/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim