回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
Simulation & Sim2Real · Plate Nº 103

ProcTHOR

6 min read · 2071 字 · ⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

过去训练 AI 在屋里走来走去,得人工一间一间搭样板房,慢且少。ProcTHOR 让电脑按规则批量造 1 万套房,AI 见多了,换个没去过的房子也能找到东西。

这是个什么场景 — 日常类比

设想你刚搬进朋友家,他让你"去厨房帮忙拿一下冰箱里的可乐"。你从没来过这屋,但你不会迷路——因为你这辈子见过几百个厨房,知道冰箱长啥样、一般摆在哪、门怎么开。换句话说,你"会找东西"不是因为背下了某一张户型图,而是因为见过的房子够多。

ProcTHOR 想给 AI 也补上这一课。在它之前,研究者像装修师傅一样手工一间一间搭训练房,造得再精致也就几十几百套,AI 训练完一换房间就懵。ProcTHOR 改成写一台房屋生成机:随机抽户型(两室一厅还是 loft)、随机摆家具、按物理规则保证抽屉能拉灯能开,一键批量产出 1 万套(论文的标志数字 10K Houses)。AI 在这 1 万套里"长大",到了真实评测场景才不至于把训练房的墙纸花色当成"家"的本质。

ProcTHOR — 场景示意:这论文要解决的现实问题
Plate Nº IProcTHOR — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • AI2-THOR / RoboTHOR / Habitat-Matterport:手工建模或扫描真实房间,每个场景都是艺术家或扫描设备生产的,质量高但总量极有限(几十到几百量级)
  • Replica / Gibson:3D 扫描真实公寓,几何真实但不可交互(抽屉打不开,物体不能拿),对操纵任务不够用
  • iGibson 2.0 等:开始引入可交互物体,但场景数量仍受人工建模上限约束
  • 共同瓶颈:场景数 << 网络容量,一旦换到没见过的房间就掉点严重,更像是"记住了几套房"而不是"学会了在房子里行动"

这篇论文的关键想法

核心赌注:用程序化生成代替人工建模,让场景数量从"几百"跳到"几万",看会发生什么。

这背后有两层判断:

  1. 多样性 > 单场景保真度(在当前阶段)。一个粗糙但多样的世界,比一个精致但单调的世界更利于学到迁移性强的 policy。这和 NLP 里"暴力堆数据 + scaling law"是同一种直觉,搬到 embodied 上。
  2. 可交互性 + 物理一致性必须保留。光有几何不够,agent 要能开抽屉、拿杯子、推椅子;所以生成的房间不是静态网格,是 AI2-THOR 引擎里带语义和物理的可交互场景
ProcTHOR — 方法示意:核心 pipeline
Plate Nº IIProcTHOR — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

第一步:户型骨架生成 — 像建筑师先打草图。 真人盖房不会拍脑袋画,得先定几室几厅、客厅挨着厨房还是卧室、走廊怎么连。ProcTHOR 就按这种规则随机采样:抽几间房、定房间类型(厨房/客厅/卧室/卫生间)、决定谁挨着谁。约束保证生成的是一个能住人的家,而不是奇形怪状的迷宫。

第二步:资产填充 — 像往空房子里搬家具。 骨架有了,还得往里塞东西。每个房间按"角色"去资产库(一个家具仓库)抽件——厨房必摆炉灶+冰箱+橱柜,卧室必有床。摆放还要守两套规矩:物理上不能悬浮、不能穿模;语义上杯子得放桌面、不能放床上。资产本身复用并扩展了 AI2-THOR 已有的可交互物体(即抽屉真能拉、门真能开的那种 3D 模型)。

等等,先慢一拍——AI2-THOR 是啥?简单说就是 Allen Institute for AI 做的一个室内 3D 仿真器,相当于一个"AI 专用的我的世界"。ProcTHOR 不是从零造仿真器,是给它接了个自动出图的房屋生成器。

第三步:批量产出 + 训练 — 像用模具批量印房子。 生成器跑一遍就吐出 1 万套,命名 ProcTHOR-10K。然后让 AI 在这 1 万套里反复训练导航、ObjectNav(找物体)、操纵任务,靠大批量并行加速。论文最有冲击力的发现是:只在 ProcTHOR-10K 上训练,AI 拿到没见过的下游评测集(zero-shot,零样本,即没专门为目标任务调过参)也能刷到 SOTA——证明"房间数量"这件事本身就能换来跨场景的能力。

第四步:开源整套生成器和数据。 ProcTHOR 卖的不是一份固定数据集,是一台生成器——后人想要 10 万套、100 万套就再跑一次。这点是它后续影响力的关键。

实验在做什么

论文在多个标准 embodied 任务上做评测:ObjectNav(找物体)、ArmPointNav(机械臂导航/操作)、RoomNav 等。核心对照实验大致回答:

  • 在 ProcTHOR 合成数据上训练,直接迁移到真实/其它合成 benchmark(如 RoboTHOR、Habitat、ArchitecTHOR)能到什么水平
  • 房屋数量从 100 → 1K → 10K,性能曲线如何(验证"规模驱动迁移"假设,具体增益数字需读原文)
  • 与之前需要在目标 benchmark 上 fine-tune 的方法相比,零样本表现是否已经接近或超过

结论方向:合成场景规模化 + 物理可交互 + 程序化多样性,确实能撑起一个 strong embodied 预训练范式。具体每个 benchmark 的 SR/SPL 数字需要查原文表格。

你应该懂的几个新词 — 4-6 个

  • Procedural Generation(程序化生成):用规则/算法批量产出内容,而不是手工逐个建模。游戏行业很常见(Minecraft 地形、暗黑破坏神地下城)。
  • Embodied AI(具身 AI):agent 有"身体",要在 3D 环境里移动、感知、操作物体,而不只是处理静态图像/文字。
  • AI2-THOR:Allen Institute for AI 推出的交互式 3D 仿真平台,ProcTHOR 是它的"场景生成器扩展"。
  • ObjectNav:一类标准任务——给 agent 一个物体名("找冰箱"),它要在未知房间里走过去。考导航 + 视觉语义。
  • Zero-shot transfer(零样本迁移):训练时没见过目标数据集的任何样本,直接拿过去测。能做到说明学到的是通用能力。
  • Sim-to-Real / Sim-to-Sim:仿真训练的策略,在另一个仿真器或真实机器人上能不能用。ProcTHOR 主打 sim-to-sim 迁移能力。

它和其他论文什么关系

  • AI2-THOR / RoboTHOR / Habitat:ProcTHOR 站在 AI2-THOR 肩膀上,是其生态的"数据放大器"
  • Habitat 2.0 / iGibson 2.0:同时期的可交互仿真平台,三者构成 embodied 仿真的 classic 三巨头,路线略不同(Habitat 走真实扫描,iGibson 走物理精度,ProcTHOR 走程序化规模)
  • 后续 PhoneBot / Holodeck(2024):把 ProcTHOR 思路 + LLM 结合——用语言驱动生成场景,"给我造一间科幻办公室"。可以理解为 ProcTHOR 的 LLM 升级版
  • scaling law 类工作(NLP/CV 里):ProcTHOR 是 embodied 领域早期"用合成数据规模换迁移性"的代表,思路同源

我建议这样读 — 3-4 步

  1. 先看 abstract + figure 1,建立"生成器 + 1 万房 + 零样本 SOTA"的直觉
  2. 翻到方法章节,重点看生成器的约束系统(几何/语义/物理三层),这是工程量最大的部分
  3. 看实验中"房屋数量 vs 性能"的曲线(如果有),这是论文最能说明问题的图
  4. 对照阅读 Holodeck(2024):看 ProcTHOR 的规则系统如何被 LLM 自然语言接管,理解技术路径演化

为什么值得读

  • 思路上:是 embodied AI 里把"scaling 数据"这件事讲清楚的标志性论文之一,让后续整个领域开始认真考虑合成场景规模化
  • 工程上:生成器的约束设计、可交互资产库的组织方式,是任何想做仿真平台的人都该参考的样本
  • 影响力上:开源生成器 + 数据被广泛复用,后续 LLM × 场景生成(Holodeck 等)都可以追溯到这条线
  • 对零基础的人友好:方法核心是"规则 + 采样",不需要太多新数学就能看懂,适合作为进入 embodied 仿真领域的第一篇深读

引用本笔记 / Cite this note
BibTeX
@online{eai_procthor_2026,
  title       = {(readable note) ProcTHOR},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2022 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/procthor/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim