回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
World Model & Video Policy · Plate Nº 154

Genie: Generative Interactive Environments

6 min read · 2186 字 · ⭐⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

Genie 看一堆游戏录屏,自己猜出每帧之间"按了什么键",再用这个"按键"画出下一帧——把死视频变成能玩的小游戏。

这是个什么场景 — 日常类比

你小时候看哥哥打《超级马里奥》,但你看不到他的手柄,只能盯着电视屏幕。看了几百小时后,你脑子里其实悄悄学会了一件事:马里奥忽然向右动一下,那哥哥八成按了右键;马里奥腾空了,那肯定是跳键。你没看过按键,但从画面变化里反推出来了。

回到 AI 这边——网上有海量游戏录屏,但没人给视频配按键标注("这一帧按了右键"这种数据极度稀缺)。

  • 一般视频生成模型(比如 Sora 那种):只学着续画下一帧,是个被动的"视频接龙",你没法控制画面走向
  • Genie 反过来做:它先自己从相邻两帧的差异里反推"刚才大概按了哪个键",把这个反推出来的"虚拟按键"压成一个 token(叫潜动作)。学会以后,你给它一张静态图当开局,再随手按一个"虚拟按键",它就能一帧一帧续画出可玩的画面

类比:像一个没碰过吉他但听了一万首歌的人,他能反推"这里大概是 G 和弦",然后自己照着这个猜测弹出新曲子。

Genie — 场景示意:这论文要解决的现实问题
Plate Nº IGenie — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • 传统世界模型(Dreamer 系列):需要带动作标注的轨迹数据(state-action-state),强依赖 RL 环境采集
  • 被动视频生成(Sora、各种 video diffusion):能续生但不可交互,用户没有"按键控制"画面走向的能力
  • 行为克隆类:从带动作标签的人类示范中学策略,瓶颈是动作标签的获取成本
  • 早期 latent action 探索(如 World Models, Hafner et al.):在小规模仿真环境里 work,但没有"用互联网视频当原料"这个量级
  • Decision Transformer / Trajectory Transformer:序列建模思路,但同样依赖标注好的 (s, a, r) 三元组

这篇论文的关键想法

核心洞察一句话:按键数据贵得要死,但其实"按了什么键"已经写在画面变化里了,只是没人去捡

像侦探看监控录像——监控里没有罪犯的口供,但前后两帧画面的差异本身就在告诉你"他刚才往左跑了"。Genie 就是这个侦探。

它把整个事情拆成三个组件,捆在一起训:

  1. 视频 tokenizer:像把一张照片切成拼图块。每一帧被压成一串离散 token(类似 VQ-VAE 那套),方便 Transformer 处理
  2. 潜动作模型(Latent Action Model, LAM):看相邻两帧,硬反推一个离散的"虚拟按键"(潜动作 token)
  3. 动力学模型(Dynamics Model):拿到历史帧 + 这个虚拟按键,预测下一帧

等等,先慢一拍 — 这"虚拟按键"凭什么不会作弊?

如果让 LAM 自由发挥,它最简单的办法是把"下一帧长啥样"整张抄进 token 里,那 Dynamics 就闭眼也能续画。所以论文给 LAM 上了个信息瓶颈——动作码本(codebook)只有很少几个槽位(比如 8 个)。8 个 token 装不下一整帧,LAM 只能挑"最关键的那点意图"塞进去("角色向右"、"跳"这种高层信号)。

推理时 LAM 拿掉,让人(或 AI agent)直接挑一个虚拟按键扔给 Dynamics,下一帧就出来了——就成了一个可玩的环境。

Genie — 方法示意:核心 pipeline
Plate Nº IIGenie — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

第一步:视频 token 化。 用 ST-ViViT(时空 ViT)或类似架构把视频帧编码成 patch token 序列。这一步把高维像素压成可处理的离散单元,是后续 Transformer 建模的前提。

第二步:潜动作模型训练。 这是论文最巧的部分。LAM 输入是相邻帧 (x_t, x_{t+1}),输出一个离散动作 token a_t。关键约束是 a_t 的码本(codebook)很小,强制信息瓶颈。配合 Dynamics 一起训:Dynamics 拿 (x_{<=t}, a_t) 预测 x_{t+1},loss 反传到 LAM,让 LAM 学会"挑出对预测最有用的那点信息"。

第三步:动力学模型用 MaskGIT 风格的并行解码。 Dynamics 是一个时空 Transformer,预测下一帧 token 时不是一个个自回归出,而是 MaskGIT 那种"先全部 mask、按置信度迭代填充",提速很多。这对于"实时可玩"很关键。

第四步:规模化训练。 论文核心卖点之一是规模——用了大量 2D 平台游戏视频(来源是公开互联网视频,具体数据集规模和组成需读原文)。模型参数规模 11B 左右(具体配置需读原文核对)。训出来的 Genie 能对一张前所未见的输入图(甚至手绘草图、真实照片)做潜动作可控的续生。

实验在做什么

主要展示三类能力:

  • 可玩性 demo:给一张静态图(游戏截图、草图、真实风景照),让人选潜动作,看 Genie 续生出来的视频是不是"像在玩游戏"
  • 潜动作的一致性:同一个潜动作 token 在不同输入图上是否表现出"语义一致"的行为(比如永远代表"角色向右移动")
  • 下游迁移:把潜动作空间当成 RL 的预训练,看能不能用极少真实动作标签 finetune 出可用策略;或者用 Genie 作为模拟器训 agent

具体数值(FVD、人类评分、RL 成功率等)需读原文。

你应该懂的几个新词 — 4-6 个

  • 潜动作(latent action):模型自己造出来的"虚拟按键",不是真实键盘按键,但功能上等价——给它就能驱动画面变化
  • 世界模型(world model):能"在脑子里想象环境如何响应动作"的模型,是 model-based RL 的核心
  • VQ-VAE / 离散 token 化:把连续向量映射到一个有限码本里的离散 token,类似把连续频率量化成钢琴的 88 个键
  • MaskGIT:一种并行图像生成方法,先全 mask,每轮按置信度填回一部分 token,比纯自回归快
  • 信息瓶颈(information bottleneck):故意限制中间表示的容量(比如只用 8 个 token),逼模型学到"压缩后的本质"
  • ST-Transformer(spatio-temporal Transformer):同时处理空间维度(帧内 patch)和时间维度(帧间)的注意力机制

它和其他论文什么关系

  • vs Dreamer / DreamerV3:Dreamer 的世界模型在 RL 仿真环境里 closed-loop,但要标注动作;Genie 反过来,从无标注视频学,但目前主要 demo 在 2D 游戏域
  • vs Sora / video diffusion:Sora 一类是被动续生,Genie 多了"潜动作可控"这一维
  • vs SIMA / 通用游戏 agent:SIMA 是学策略玩既有游戏,Genie 是学造游戏;两者可组合(Genie 当模拟器,SIMA 当 player)
  • vs UniSim / 1X World Model:同期/后继工作把"从视频学世界模型"思路推到机器人域、真实世界域
  • 后续影响:Genie 2(DeepMind 2024 末发布)把这套思路扩到 3D、长序列、更复杂物理交互;催生一大批"latent action + video pretraining"方向的工作

我建议这样读 — 3-4 步

  1. 先看 demo 视频:DeepMind blog 上的 Genie 主页有大量 GIF,先建立"哦,原来是这种交互"的直觉,再读论文不容易迷路
  2. 重点啃 Method 第 3 节:LAM + Dynamics 联合训练那段是全文核心,画一张数据流图(输入帧 → tokens → LAM 出潜动作 → Dynamics 重建下一帧)
  3. 跳着看 Experiments:定性的可玩性 demo 比定量指标更重要;FVD 之类数字看个数量级即可
  4. 延伸:读完去看 Genie 2 的 blog,对比规模化后哪些能力涌现了,哪些 Genie 1 的局限被解决了

为什么值得读

  • 方法论上的"小聪明"很值:用信息瓶颈逼出 latent action,是那种听完会拍大腿的设计
  • 打开了一条新路:把"互联网视频"这个超大规模无标注数据源,纳入到了世界模型 / RL 预训练的视野里,比起 Dreamer 系的"必须有标注"是质的变化
  • Embodied AI 路线图上的关键节点:要做通用 agent,"能凭空想象环境"和"能从看的东西里提炼可执行动作"是两个必经能力,Genie 同时在啃这两块
  • 对生成模型从业者也有启发:可控视频生成的"控制信号从哪来"这个老问题,Genie 给了一个"让模型自己学控制信号"的回答

引用本笔记 / Cite this note
BibTeX
@online{eai_genie_2026,
  title       = {(readable note) Genie: Generative Interactive Environments},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2024 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/genie/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim