回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
World Model & Video Policy · Plate Nº 148

Dreamer V3: Mastering Diverse Domains through World Models

7 min read · 2507 字 · ⭐⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

同一套设置,让一个 AI 自己玩 150 多种游戏都不用改参数,还第一次靠自己挖到《我的世界》里的钻石。

这是个什么场景 — 日常类比

想象你刚买了一台新游戏机,里面塞了 150 款完全不一样的游戏:有的是赛车(每一秒都得反应),有的是象棋(要下完几十步才知道输赢),有的是节奏音游(奖励就是"听起来对不对"这种模糊感觉)。

如果让一个朋友替你打通关,按常规做法他每换一款游戏,就得重新调一遍手柄灵敏度、重新练一套手感——像换一种乐器就要换一个老师。这在强化学习(reinforcement learning, RL,让 AI 通过试错拿奖励学习)里是个老问题:每换一个任务,工程师就得花几周重调一堆超参数(学习率、奖励缩放、探索强度等)。

Dreamer V3 想做的事就一句话:手柄灵敏度只调一次,150 款游戏全用同一套。它的窍门是:让 AI 先在脑子里建一个"小世界模型"(world model,对环境的内部模拟器),然后大量在脑内"做白日梦"反复演练,再回到现实里出手。

Dreamer V3 — 场景示意:这论文要解决的现实问题
Plate Nº IDreamer V3 — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • 无模型 RL(PPO / SAC / Rainbow):直接从环境采样训策略,简单但样本效率低,跨任务调参成本高
  • Dreamer V1 / V2:开创"在 latent 想象空间里训策略",但跨领域仍需调超参;V2 在 Atari 上接近 SOTA 但不够通用
  • MuZero:树搜索 + 学到的动力学模型,强但训练成本极高,且离散控制和连续控制需要不同变体
  • Decision Transformer / Trajectory Transformer:把 RL 当序列建模,思路新但对在线探索类任务不友好
  • 跨任务做法常见缺陷:奖励量级差异大(Atari 几千分 vs DMC 0~1),不做归一化就会让一个任务主导梯度

这篇论文的关键想法

想让"脑内演练 + 真实出手"这条路成为通用 AI 算法,卡住的不是算力,而是得让训练像那种闭眼也能用的傻瓜相机——换什么光线都不用重调

作者的三个具体招数:

  1. symlog 压缩(一个把"分数"取对数的小函数):游戏里有的奖励几千分、有的零点几分,差距大到没法一起学。先用 symlog(x) = sign(x)·log(1+|x|) 把它们压到差不多的范围,再喂给模型——就像把鲸鱼和金鱼一起画在课本上时,得换个非线性的尺子
  2. two-hot 编码学价值:不让模型直接猜"这步值几分",而是让它在一排预设格子(bin)上分配概率(像投票),把回归题改成选择题,遇到极端分数也不会崩
  3. 固定 KL balancing + free bits:给世界模型的内部学习量设了一对"上下闸门",确保不管玩什么游戏,模型都不会一头扎进死胡同

整篇论文最有冲击力的不是某一招新发明,而是这句结论:这三招凑齐了,一套超参就能横扫 150+ 任务——其中包括第一次有一个通用 AI 从零开始挖到《我的世界》钻石。

Dreamer V3 — 方法示意:核心 pipeline
Plate Nº IIDreamer V3 — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

世界模型架构(RSSM)——脑内沙盘。就像下棋高手在脑子里能"提前走两步看看局势"。Dreamer V3 沿用了之前 Dreamer 系列的 RSSM(Recurrent State-Space Model,递归状态空间模型):把摄像头看到的画面压成一组小数字(latent,潜在表示),其中一部分是确定的 h_t(像棋盘上现在的固定布局),一部分是随机的 z_t(像对手下一步的猜测)。给定动作,这个沙盘能预测下一步会变成什么、能拿多少奖励、是不是该收摊。所有策略训练都不在真实游戏里做,而是在这张脑内沙盘里反复演练。

等等,先慢一拍 — 这里面的 latent 是什么?把它想象成做菜时的味觉记忆:你不需要记下整盘菜每一粒米的位置,只需要记"咸淡、火候、口感"几个关键维度,下次再做就够用了。latent 就是把一帧高清画面压成几十个这种关键维度。

actor-critic 在想象中训练——演员加教练。actor(演员)负责出动作,critic(教练)负责打分。做法像复盘:从过去玩过的一帧真实画面出发,让脑内沙盘往后想象 H 步(具体步数需读原文,量级在十几步),让演员每步出招、教练给每招估个回报。教练用 lambda-return(多步回报的加权平均)来学,演员则按"回报高就多用、还要保持点尝试新招的随机性"来学。框架和 Dreamer V2 基本一致,真正的区别全在下一段的数值稳定性处理

让训练对超参不敏感的三件套——傻瓜相机里的自动曝光

  • symlog:像相机自动把强光弱光都压到能看的亮度。作用在奖励 target、价值 target、画面重建上,吃掉跨任务的量级差异
  • two-hot critic:教练不直接报"这步值 47 分",而是在一排 symlog 间隔的格子(bins)上分配概率(像投票),回归题变选择题,遇到极端值也稳
  • percentile-based return normalization:用回报的 5%-95% 分位差做归一化,避免一两次"超级大奖"把整个学习方向带偏

"一套超参,多领域复用"的工程意义——同一把万能钥匙开 9 把锁。论文用同一组超参跑了 Atari 100k、Atari 200M、ProcGen、DMC proprio、DMC vision、BSuite、Crafter、DMLab、Minecraft 9 个 benchmark。最有标志性的是 Minecraft 从零挖到钻石(纯 RL,无人类示教、无课程引导),具体训练步数和样本量需读原文。

实验在做什么

  • 覆盖广度:横跨 7+ benchmark suite,超过 150 个任务,连续/离散动作、像素/状态输入、稀疏/密集奖励都有
  • 核心对照:跟 PPO、Rainbow、MuZero、IQN、DreamerV2 等比,强调"我不调参,他们调"
  • scaling 曲线:模型从小到大单调变好,且大模型反而样本效率更高(这点反直觉,是论文重点 selling point 之一)
  • 消融:拿掉 symlog、two-hot、return normalization 之后训练崩坏程度——具体数字需读原文
  • Minecraft 钻石:从零开始,纯 RL,agent 学会砍树→造工作台→采石→炼铁→采钻石的整条 tech tree,是论文最出圈的结果

你应该懂的几个新词 — 4-6 个

  • 世界模型(world model):agent 学到的"环境近似器",输入当前 latent + 动作,预测下一步 latent + 奖励。类比:你脑子里关于"杯子推一下会怎样"的预期
  • RSSM(Recurrent State-Space Model):Dreamer 系列用的世界模型骨架,混合确定性 RNN 和随机 latent,兼顾稳定性和不确定性建模
  • 想象训练(imagination training):策略完全在世界模型 rollout 出的虚拟轨迹上优化,不消耗真实环境样本,是样本效率的根本来源
  • symlogsign(x)·log(1+|x|),对称的对数压缩,把跨任务的奖励 / 价值量级吃平
  • two-hot encoding:把标量 y 表示成相邻两个 bin 上的概率分布(按距离分配),让回归变分类,对极端值更稳
  • lambda-return:n-step return 的指数加权平均,平衡 bias 和 variance 的标准做法

它和其他论文什么关系

直接前作

  • World Models (Ha & Schmidhuber 2018):奠基"latent + RNN + 想象"思路,但只在简单任务
  • Dreamer V1 (2020) / V2 (2021):发展 RSSM 与想象 actor-critic,V2 首次在 Atari 接近 SOTA。V3 = V2 框架 + 通用化技巧

横向对比

  • MuZero:同样是基于模型的 RL,但靠 MCTS 在模型里做规划而非想象 rollout 训策略;MuZero 更强但更贵且更专用
  • EfficientZero / SimPLe:低样本 model-based 路线,专攻 Atari 100k,不追求跨领域
  • PPO / SAC:model-free baseline,Dreamer V3 的"不调参跨任务"对标的就是它们调过参的版本

后续影响

  • DayDreamer (2022):把 Dreamer 直接搬到真实机器人上学习
  • 机器人 / embodied AI 圈:world model + 想象训练成为继 diffusion policy、VLA 之外的第三条主流路线之一
  • 大尺度 world model(Genie、UniSim、OASIS 等):朝"世界模型即模拟器"方向延伸,而 Dreamer V3 证明了这条路线至少在控制层面是 work 的

我建议这样读 — 3-4 步

  1. 先读 abstract + Figure 1 + Minecraft 那张 tech tree 图,建立"一套超参 150 任务、且能解钻石"这个 claim 的直觉冲击
  2. 回去看 Dreamer V2 的 RSSM 和想象 actor-critic 框架(如果没读过 V2,先读 V2 的方法节,否则 V3 的"区别"看不懂在区别什么)
  3. 聚焦 V3 的三件套:symlog、two-hot critic、return normalization,对着公式和消融表理解每件在解决什么具体的不稳定问题
  4. 跳读实验:只挑你关心的领域看曲线(机器人方向重点看 DMC 和 Minecraft,游戏方向看 Atari 和 Crafter),别一个个 benchmark 啃

为什么值得读

  • 方法论意义:在 RL 长期"换任务就要换调参侠"的背景下,第一次把"一套超参打天下"做成了实证 claim,是世界模型路线的正名之作
  • 工程启发:symlog + two-hot 这套数值稳定性技巧,可以直接迁移到任何跨任务/跨尺度的回归问题,不止 RL
  • embodied AI 视角:如果做机器人 / 具身智能,world model + imagination 是绕不开的一条路线,Dreamer V3 是这条路线目前最干净、可复现的参考实现
  • Nature 2025 收录:意味着方法学和实验工程都经过严格审查,作为入门世界模型领域的"标准课文"非常合适
  • 延伸阅读链路清晰:往前是 Dreamer V1/V2 / World Models,往后是 DayDreamer / Genie / UniSim,这篇是中间最重要的承接节点

引用本笔记 / Cite this note
BibTeX
@online{eai_dreamer_v3_2026,
  title       = {(readable note) Dreamer V3: Mastering Diverse Domains through World Models},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2025 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/dreamer-v3/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim