回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
High-Level Planning · Plate Nº 81

GenSim

6 min read · 2114 字 · ⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

让 ChatGPT 当"出题老师",自动给机器人编一堆练习关卡,连标准答案也一起写好。

这是个什么场景 — 日常类比

想象你在教小孩玩积木。要让他学会各种摆法,你得自己一关一关出题:「把红积木放盒子里」「按大小排成一排」…… 出 100 道题就能让你出到怀疑人生。而且每道题你还得亲自演示一遍标准动作给他看。

机器人训练就是这样。研究员要手写一个个虚拟"小桌面"任务给机器人练手——任务越多机器人越聪明,但人写得过来吗?写不过来。

GenSim 的想法很直接:让 ChatGPT 这种 LLM(大语言模型)来当"出题老师 + 答案老师"。它写一段 Python 代码搭出一个虚拟桌面(放哪些方块、目标摆成啥样),再写一段代码演示"标准摆法"。机器人就在这堆 LLM 出的题里反复刷题。

核心洞察:仿真任务说到底就是一段代码——而代码恰好是 LLM 最擅长写的东西之一。

GenSim — 场景示意:这论文要解决的现实问题
Plate Nº IGenSim — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • 人类手写仿真任务:Meta-World、RLBench、CALVIN 等 benchmark 都是研究员一个个手工设计的,规模受限于人力(几十到一两百个任务)。
  • 领域随机化(domain randomization):在已有任务上随机改纹理、光照、物体位置,扩出"变体",但任务本身的语义结构没变。
  • 程序化生成(procedural generation):用规则脚本随机组合物体(如 ProcTHOR 生成房间),但规则本身仍是人写的,难以涌现新的任务类型。
  • 演示数据采集:靠遥操作(teleoperation)人工演示,每条轨迹都很贵。
  • 专家脚本(scripted policy):研究员针对每个任务手写个状态机当 expert,扩到新任务又要重写。

这篇论文的关键想法

任务多样性是策略泛化能力的瓶颈,但人类设计任务太慢。让 LLM 既写"任务定义代码"(环境长什么样、目标是什么、reward 怎么算)也写"专家策略代码"(怎么一步步把任务解掉),就能把任务库从几十个膨胀到几百上千个。

更进一步:把生成的任务存进一个"任务库",让 LLM 在生成新任务时把库里现有任务作为 in-context 示例参考,形成"自举"循环——任务越多,新任务质量越高。

GenSim — 方法示意:核心 pipeline
Plate Nº IIGenSim — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

任务的代码化表示。先把"出一道题"这件事拆成统一格式,就像老师出题永远要写「场景 + 评分标准 + 参考答案」三栏。GenSim 基于 Ravens / CLIPort 这类 tabletop manipulation(桌面操作)仿真环境,每个任务被表示为一个 Python 类,包含三块代码:场景搭建(放哪些物体、目标位置)、reward / 成功判定(怎么算通关)、专家策略(用 pick-and-place 抓放原语一步步把物体摆到目标位)。LLM 输出的就是这种结构化代码。

等等,先慢一拍——pick-and-place 原语是啥?就是把"机械臂动作"简化成两步:抓起 A,放到 B。CLIPort 这套环境里所有任务都靠这俩动作组合出来,所以 LLM 写"专家策略"其实就是写一串 pick-and-place 序列,难度大大降低。

目标导向 vs 探索性两种生成模式。像两种出题思路:一种是"按大纲出题"——给 LLM 一个高层描述("做一个排序任务"),让它写对应代码;另一种是"自由发挥"——让 LLM 在已有任务库基础上提出新颖任务。两种模式互补:前者保证覆盖已知概念,后者制造惊喜。

任务库 + in-context 自举。像学生写作文总要参考几篇范文。生成的代码先丢进仿真器跑一遍验证(能跑通且专家策略能解出来才算合格),通过的存入"任务库"。下一轮生成时,LLM 的 prompt 里塞几个库里的样例当参考——相当于"看着以前的题出新题",库越大新题质量越稳。这就是 in-context learning(上下文学习)的自举循环。

下游策略训练。题出好了就让机器人做题。批量跑专家策略采集数据,喂给一个语言条件的 multi-task policy(多任务策略,基于 CLIPort 架构)。具体训练规模和数据量需读原文。

实验在做什么

  • 任务库规模:宣称生成了上百个任务,相比 CLIPort / Ravens 原版几十个任务有数量级扩张。具体数字需读原文。
  • 多任务策略性能:在 GenSim 生成的任务上联合训练,看在原 benchmark(CLIPort 的 10 个任务)上的成功率。
  • 任务多样性度量:用 embedding 距离或人工评估检查生成任务是否真的"新",避免只是同质改名。
  • 泛化迁移:训练好的 policy 转到没见过的 GenSim 任务、甚至 sim-to-real 上的表现。
  • 消融:去掉任务库的 in-context 自举 vs 保留,看任务通过率怎么变。

你应该懂的几个新词 — 4-6 个

  • Tabletop manipulation:桌面操作,机械臂在一张桌子上抓放物体的简化场景,是 manipulation 研究的"实验室小白鼠"。
  • Pick-and-place primitive:抓取-放置原语,最简化的动作单元(抓起 A,放到 B),CLIPort 就建立在这个原语之上。
  • In-context learning:上下文学习,不更新模型参数,仅靠 prompt 里的几个例子让 LLM 举一反三。
  • Bootstrapping(自举):模型自己生成数据再训练自己(或下一轮自己),靠迭代把性能滚大。
  • Domain randomization:领域随机化,训练时随机扰动仿真参数,让策略在真机上更鲁棒。
  • Multi-task policy:多任务策略,一个网络处理多种任务,通常用语言指令区分目标。

它和其他论文什么关系

  • CLIPort / Ravens:GenSim 的 benchmark 母体和动作原语来源;GenSim 本质是 CLIPort 的"任务工厂"。
  • Code as Policies / ProgPrompt:同样让 LLM 写代码控制机器人,但那一脉是写"运行时控制代码",GenSim 写的是"训练时的环境和专家代码"——一个面向部署,一个面向训练数据生成。
  • RoboGen / Eurekaverse / Holodeck:同期或后续的"LLM 生成仿真任务/环境"工作,思路一脉相承,区别在生成对象(任务 vs 整个 3D 场景 vs reward 函数)。
  • Eureka:同样让 LLM 自动写 reward 代码,但 Eureka 聚焦单任务的 reward shaping,GenSim 聚焦多任务的任务定义扩展。
  • RT-1 / Open X-Embodiment:靠人类遥操作大规模采数据;GenSim 是对立方向——用仿真 + LLM 生成代替人工采集。

我建议这样读 — 3-4 步

  1. 先看一个生成出来的任务长什么样:找论文 appendix 或代码仓库里的 sample task 类,看 LLM 写出来的 Python 代码结构(场景 / reward / expert 三块),理解"任务即代码"的含义。
  2. 再看自举循环的 prompt 设计:搞清楚任务库里的样例是怎么喂回去的,这是论文最有迁移价值的工程细节。
  3. 看实验中关于任务多样性的度量:因为 LLM 生成容易"看起来新但本质同质",论文怎么验证多样性是真的?
  4. 想清楚一个问题:这套办法能从 tabletop pick-and-place 推到 dexterous manipulation(灵巧手)或 mobile manipulation(移动操作)吗?瓶颈在 LLM 还是在仿真原语?

为什么值得读

GenSim 是"LLM 当数据工厂"这条思路的代表作之一。在 embodied AI 里,"数据从哪来"始终是核心问题——遥操作贵、真机危险、仿真又缺多样性。GenSim 给出的答案是:让 LLM 把人类研究员从"任务设计"这个瓶颈里解放出来,把人力推到更高层的设计审美上。

读它能帮你建立两个思维框架:一是"什么东西可以代码化,什么就可以让 LLM 来做";二是"自举循环 + 库化存储"是把 LLM 一次性输出变成可累积资产的通用模式——这个模式在后续 Eureka、RoboGen、各种 self-improving agent 里反复出现。即使具体方法被超越,这个范式本身值得吃透。

引用本笔记 / Cite this note
BibTeX
@online{eai_gensim_2026,
  title       = {(readable note) GenSim},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2024 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/gensim/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim