回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
High-Level Planning · Plate Nº 83

Tree-Planner

7 min read · 2402 字 · ⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

让大模型一次写好十份菜谱,把重复步骤合成一棵树,做菜时照树走,错了就换条岔路,不用反复打电话问。

这是个什么场景 — 日常类比

周末你想做一顿西红柿炒鸡蛋,但你完全不会做饭,得边问边学。

朴素做法:每切一刀、每开一次火都掏出手机打电话问大厨"下一步呢?"。每问一次 5 块钱话费,大厨还记不住你刚才问过啥,可能上一句让你"放盐"下一句又让你"加糖",前后打架。

Tree-Planner 的做法:一开口就让大厨一口气写下 10 份完整菜谱。10 份菜谱开头几步几乎一模一样("洗西红柿、打鸡蛋"),把这些重复的步骤合并掉,只在大厨意见不一致的地方留出岔路 —— 这就成了一棵"动作树":树根是共同开头,越往后岔路越多。

做菜时你照着树走,遇到岔路口看锅里现在啥情况、挑最合适的一条;如果这条路烧糊了(这一步在真环境里执行失败),就退回上个岔路口换另一条试。整个过程只在最开始打了一通电话。

对应到论文里:机器人在虚拟厨房里完成 "make breakfast" 这类需要十几步动作(拿杯子 → 倒牛奶 → 打开烤箱 → ...)的长任务,每一步都得真在环境里执行。

Tree-Planner — 场景示意:这论文要解决的现实问题
Plate Nº ITree-Planner — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • Iterative planning(迭代式规划,比如 ReAct、Inner Monologue):每一步都让 LLM 看当前状态再决定下一步动作。token 消耗大,而且 LLM 容易"前后失忆",规划不一致。
  • Plan-and-Execute(先规划后执行,比如经典的 SayCan、ProgPrompt):让 LLM 一次性生成完整计划,然后机器人照着执行。问题是计划一旦在中途出错(环境状态和预期不符),没有回退机制。
  • Tree-of-Thought(思维树,2023):在推理任务上让 LLM 反复展开树,但每个节点都要再调用 LLM 评分,开销大,而且面向纯推理不是 embodied 任务。
  • Self-consistency(自洽采样):多次采样同一问题然后投票,但只用于单步答案,没有把多条计划"结构化合并"。

这篇论文的关键想法

像把十份手抄菜谱叠在一起对齐 —— 你会发现前几页几乎一字不差,差别都在后半段。

核心观察:LLM 一次采样多条计划,里头大量动作前缀是重复的。那为什么不把重复部分合并、只在分歧处留岔口?这样得到一棵"动作树":从根走到任一叶子是一条完整计划,被合并的节点代表 LLM 在这一步上意见一致,分叉则代表它觉得可以有好几种走法。

执行阶段不再问 LLM,而是在这棵已经画好的树上做 grounded 搜索(落地搜索:要看环境此刻真能干什么):环境告诉你当前状态、哪些动作能做,你就在树里挑能走的分支。走错了能回溯到上一个分叉。

收益

  • LLM 调用从 O(plan length) 降到 O(1)(只有最初采样那次)
  • 错误恢复来自树结构本身,不需要 LLM 重新规划
  • 一次性采样多样化的计划,提升整体成功率
Tree-Planner — 方法示意:核心 pipeline
Plate Nº IITree-Planner — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

Step 1:Plan Sampling(计划采样)— 像让大厨一口气写 10 份菜谱 给 LLM 一个 prompt(任务描述 + 环境物体列表 + 可用动作列表 + few-shot 示例),把"温度"调高一点(让它发挥得活泼些),采样 N 条完整计划(具体 N 需读原文,一般在 10-50 量级)。每条计划就是一串动作,比如 [walk to kitchen, open fridge, grab milk, ...]

等等,先慢一拍 —— 这里说的 temperature(温度) 是什么?可以理解成 LLM 的"放飞程度":温度低它每次都给最稳妥那一份;温度高就允许它写出几份不一样的菜谱,多样性才出得来。

Step 2:Action Tree Construction(动作树构建)— 像把 10 份手抄菜谱叠起来对齐 把 N 条计划合并成 trie(前缀树:一种把相同开头折叠在一起的数据结构):相同前缀共享一条路径,从第一个分歧点开始才分叉。一个节点底下挂几个子节点,就代表 N 份计划在这一步上提了几种不同的下一步动作。理论上这棵树最多有 N 条从根到叶子的路径。

Step 3:Grounded Deciding(落地执行)— 像照着树走、看锅里实际情况挑岔路 agent 在环境里一步步走。每到一个树节点:

  • 问环境此刻哪些动作能做(这就是 grounding 落地:比如 "milk 不在视野里就不能 grab milk")
  • 在这个节点的子节点里筛出能做的那几个
  • 如果还有多个能做的,用启发式排序挑一个(比如哪个子节点下面挂的计划份数最多、或者跟当前情境最像)
  • 执行成功就走进这个子节点

Step 4:Backtracking(回溯)— 像走迷宫撞墙了退回上个路口 某一步执行失败(动作报错 / 环境反馈和预期不符)时,回到当前节点的兄弟节点试别的;如果兄弟都试完了,再退到父节点那一层换条路。一直退到能继续往下走的位置。整个回溯过程不再问 LLM,纯粹在树上做。

实验在做什么

主要在 VirtualHome(一个家庭场景虚拟环境,机器人执行做饭、清洁等长序任务)上做。

评估指标:

  • Success Rate(任务完成率):机器人最终是否完成了目标
  • Executability(可执行性):生成的动作中能被环境接受的比例
  • LLM token cost / call count:相比 iterative 方法节省了多少

对比基线:iterative planning(如 ReAct)、plan-and-execute(如 ProgPrompt)、单条计划采样。

具体数字(成功率提升、token 节省比例)需读原文。论文一般会在多个任务复杂度(短序 / 长序)上分别报告,并消融 N(采样数量)和回溯策略的影响。

你应该懂的几个新词 — 4-6 个

  • Embodied Agent(具身智能体):在虚拟或真实环境里有"身体"、能感知和执行动作的 agent。和纯 chatbot 区别在于它的输出会改变环境。
  • Grounding(落地):把 LLM 输出的"理论上的动作"对齐到"环境此刻真能执行的动作"。比如 LLM 说 "grab the cup",但视野里没有 cup,这个动作就 not grounded。
  • Trie(前缀树):一种把多个序列合并、共享公共前缀的数据结构。Tree-Planner 的"动作树"本质是动作序列的 trie。
  • Backtracking(回溯):搜索算法在走死路时退回上一个分叉重新选择的机制。这里指执行失败时退回树上的上一个节点。
  • VirtualHome:一个常用的 embodied AI benchmark,提供家庭场景和动作 API(go to / grab / open 等)。
  • Plan-and-Execute vs Iterative Planning:两种 LLM 规划范式。前者一次给完整计划再执行,后者每步重新规划。Tree-Planner 是介于两者之间的"一次规划但留多条路"。

它和其他论文什么关系

  • vs ReAct / Inner Monologue(迭代式):Tree-Planner 把 LLM 调用从每步都调降到只调一次,token 省一两个数量级;但代价是初始采样必须足够多样,否则树覆盖不到正确路径。
  • vs SayCan / ProgPrompt(一次性规划):Tree-Planner 通过多采样 + 树结构具备了错误恢复能力,而单条计划方法一旦中途出错就完蛋。
  • vs Tree-of-Thought(推理任务):思想类似(搜索树),但 ToT 每个节点都要 LLM 打分扩展,Tree-Planner 一开始就把整棵树物化,执行时不再调 LLM。Tree-Planner 是 ToT 思想在 embodied planning 上的"廉价化"。
  • 后续影响:和 LLM-Planner、AdaPlanner 一起被列为 "LLM as Planner" 范式下的代表方法。后续工作(如 2024+ 的一些 hierarchical planning)会进一步把树结构与 world model、value function 结合。

我建议这样读 — 3-4 步

  1. 先看 Figure 1(一般是方法总览图):看清"采样 → 合并成树 → 执行 + 回溯"三段式。这是论文的脊梁,看懂这张图基本就 get 了。
  2. 看 Plan Sampling 的 prompt 设计:理解输入 LLM 的到底是什么(任务描述 / 物体列表 / few-shot),这影响采样质量上限。
  3. 看 Grounded Deciding 的具体规则:在分叉点用什么启发式选下一步?这是工程细节但决定实际效果。
  4. 看 ablation:N 采样多少够?回溯策略消融?这些数据告诉你方法的"敏感点"和实际部署该怎么调。

为什么值得读

  • 一个清爽的工程 idea:把"多次采样 + 投票"升级成"多次采样 + 结构化合并",几乎是即插即用的优化思路,可以套到任何 LLM 规划场景。
  • 理解 embodied planning 范式权衡:通过这篇能清楚看到 iterative / one-shot / tree-based 三类方法各自的代价。
  • 后续 follow-up 的起点:2024+ 很多 LLM agent 工作(搜索 + 规划 + 工具使用)都借鉴了"一次采样多条然后在结构上搜索"的思想,理解这篇是入门钥匙。
  • 工程参考价值高:方法实现起来不复杂(trie 合并 + 简单回溯),适合作为自己第一个 embodied agent 项目的参考实现。

引用本笔记 / Cite this note
BibTeX
@online{eai_tree_planner_2026,
  title       = {(readable note) Tree-Planner},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2024 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/tree-planner/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim