回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
Simulation & Sim2Real · Plate Nº 107

Isaac Lab

6 min read · 1944 字 · ⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

机器人在电脑里"练功"的虚拟训练场。以前练得飞快但看不清画面,画面漂亮又练得慢;Isaac Lab 把这两件事捏到了一起。

这是个什么场景

想象你要教一个新手厨师颠勺。直接让他上真灶台太贵——油溅了、锅砸了都是钱。聪明的做法是先在"模拟厨房"里练个几千遍,再上真灶。机器人也一样:直接拿真机训练,摔坏一个人形机器人就是几十万。所以大家都在电脑里盖一个"虚拟健身房",让机器人在里面摔个百万次,再把学会的动作复制回真机。

但虚拟健身房有个老问题:

  • 只想练动作的房间(Isaac Gym 前辈):像没开灯的健身房——动作算得飞快,每秒练几千次,但你看不见画面,机器人也"看不见"东西。
  • 画面漂亮的房间(Isaac Sim):像影视片场——灯光、阴影、相机都很真,但训练慢,更像拍样片而不是练功。
  • Isaac Lab(本文):把"地下健身房"和"影视片场"打通——同一个屋子里,既能高速颠勺一百万次,也能在需要的时候开灯看清画面。

机器人训练里最头疼的事叫 sim-to-real gap(仿真到真机的落差):在电脑里练得很溜,搬到真机就翻车。原因常常是仿真里看到的画面太假、传感器太糙。Isaac Lab 要做的,就是把这条"从仿真走到真机"的桥铺平一点。

Isaac Lab — 场景示意:这论文要解决的现实问题
Plate Nº IIsaac Lab — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • Isaac Gym(2021):GPU 上跑物理 + RL 训练,速度快了几十倍,但渲染粗糙,传感器只有简化版。
  • MuJoCo / PyBullet:CPU 仿真器,物理精度好,但并行能力差,渲染更弱。
  • Webots / Gazebo(ROS 系):偏工程化,资产丰富但训练吞吐量不够。
  • Omniverse Isaac Sim:渲染和场景非常漂亮,但偏向"演示和数字孪生",RL 训练 pipeline 不顺手。
  • 结果:研究者要么"快但难看",要么"漂亮但慢",没法一站式拿到 perception + control 的端到端训练。

这篇论文的关键想法

像合并两间工坊:一间专做"练动作"(Isaac Gym),一间专做"做画面"(Isaac Sim)。Isaac Lab 把两间合到同一个屋檐下,再用三个小巧思解决"既要快又要真"的老矛盾:

  • 多频率仿真(multi-rate simulation):像家里的电器各有节奏——空调每秒检测一次温度,闹钟每分钟跳一格。物理引擎跑得最快(1kHz),相机慢一点(30Hz),IMU 中速(200Hz),各跑各的,不强行对齐。
  • 渲染画质可切换:训练阶段用"草图模式"(快速光栅化)狂练;快上真机时切到"电影模式"(光线追踪)让画面更接近真实,减小视觉落差。
  • 统一接口:人形、机械臂、四足狗、无人机都接同一个插座(API)。写一份配置文件就能换机器人,不用每种重写一套。
Isaac Lab — 方法示意:核心 pipeline
Plate Nº IIIsaac Lab — 方法示意:核心 pipeline

它怎么做的(方法)

第一段:分三层楼盖房子。像一栋楼分地基—中间—顶楼。地基是 Omniverse / PhysX 5(NVIDIA 的 GPU 物理引擎,负责"力学"计算);中间是 Isaac Lab 自己写的"环境抽象层",把强化学习需要的四件套(reset 重置、step 走一步、observation 观察、reward 奖励)做成统一接口;顶楼才是具体任务,比如走路、抓东西、导航。地基换了,顶楼的任务代码也不用改。

第二段:传感器各按各的钟点上班。像办公楼里有人 9 点打卡、有人 10 点打卡,调度员不强行让所有人同时到。每个物理 tick(最小时间步)里,调度器只唤醒那些"该刷新"的传感器。这样 1024 个机器人同时训练时,相机不会拖累整条流水线。具体吞吐数字需读原文。

等等,先慢一拍 — 什么是"渲染 backend"?就是"画画的引擎"。同一个场景你可以让铅笔素描(快但糙)来画,也可以让油画大师(慢但真)来画。

第三段:三种画师任你选。栅格化(最快,训练用,类似铅笔素描);路径追踪 / 光追(最真,做 sim-to-real 时用,类似油画);Hydra render delegate(按 OpenUSD 标准对接外部工具,类似把画稿交给别人继续修)。训练阶段用快的,验收阶段切到慢的。

第四段:开源菜谱社区。所有任务都是开源 Python 配置加 URDF/USD(机器人和场景的"建筑图纸")资产,谁都能贡献新机器人、新场景。这和 Isaac Gym 时代很不一样——以前菜谱主要由 NVIDIA 自己写。

实验在做什么

具体实验配置和数字需读原文,但根据这类系统论文的惯例:

  • 吞吐量基准:在不同 GPU(H100 / A100 / 4090)上跑 1k / 4k / 16k 并行 env,测每秒 step 数。
  • 任务复现:把 Isaac Gym 上经典的 locomotion / manipulation 任务迁移过来,看训练曲线是否对齐或更好。
  • sim-to-real 验证:在 Isaac Lab 训出策略,部署到真机(如 Unitree H1、ANYmal、Franka),看 success rate 和 zero-shot transfer 表现。
  • 多机器人异构:同一脚本里训练人形、四足、机械臂,验证 API 通用性。

你应该懂的几个新词 — 4-6 个

  • Isaac Gym:NV 2021 年开源的 GPU 物理 + RL 框架,本论文的前身。
  • Omniverse / OpenUSD:NV 主推的 3D 协作平台和场景描述格式,类比 Photoshop 之于图像,USD 之于 3D 场景。
  • PhysX 5:NV 的 GPU 物理引擎,支持 rigid body / soft body / 关节动力学。
  • 多频率仿真(multi-rate simulation):不同传感器/控制器以各自真实频率运行,避免被最高频拖累。
  • sim-to-real gap:在仿真器训出来的策略放到真机时性能下降的现象,是具身 AI 的核心难题。
  • domain randomization:训练时随机化光照、纹理、摩擦、质量等参数,让策略更鲁棒,是缩小 sim-to-real gap 的常用手段。

它和其他论文什么关系

  • 直接前身:Isaac Gym(Makoviychuk 2021)—— 提供了 GPU 并行 RL 这个核心能力。
  • 同代竞品:Genesis(2024 大学联合)、MuJoCo MJX(Google DeepMind 把 MuJoCo 上 GPU/TPU)、Brax(Google 的 JAX 物理引擎)、Drake(MIT,偏 control 严谨度)。
  • 下游用户:几乎所有 2024-2026 的 humanoid locomotion 论文(H1、G1、Atlas 系)和很多 manipulation/whole-body control 工作都开始默认用 Isaac Lab。
  • 方向上和 RoboCasa / Habitat 互补:后者专注 home/indoor 大场景资产,Isaac Lab 提供物理 + 渲染底座。

我建议这样读 — 3-4 步

  1. 先看官方 GitHub README 和 docs 的 quickstart,跑通一个 cartpole 或 ant 例子,对"环境抽象层"建立直观认知。
  2. 读论文的"架构图 + 多频率仿真"那一节,理解为什么这套抽象比 Isaac Gym 灵活。
  3. 跳到"benchmarks / sim-to-real 案例"看真机数字,决定是否值得迁移自己的项目。
  4. 如果你做 humanoid 或 manipulation,去 GitHub 翻 isaaclab_tasks,照着改一个任务比读完整论文更高效。

为什么值得读

  • 2025-2026 具身 AI 的事实标准:人形 / 四足 / manipulation 论文里出现频率非常高,不熟它会读不懂别人的实验设置。
  • 工程值得学:多频率调度、渲染 backend 抽象、资产 USD 化——这些是仿真平台设计的通用模式,不只对机器人有用。
  • 门槛降低:相比 Isaac Gym,新手在 1-2 天内就能跑通自己的任务,写 paper 时省下来的工程时间可以投入到 idea 验证。
  • 生态会持续:NV 在押人形和具身 AI,这条线在可见未来不会被废弃,学会回报期长。

引用本笔记 / Cite this note
BibTeX
@online{eai_isaac_lab_2026,
  title       = {(readable note) Isaac Lab},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2025 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/isaac-lab/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim