回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
High-Level Planning · Plate Nº 77

LLM+P: Empowering LLMs with Optimal Planning

6 min read · 1995 字 · ⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

让 LLM 只当翻译——把你说的话翻译成机器格式,真正的规划交给老牌算法去算。LLM 管说话,算法管动脑子。

这是个什么场景

你出国旅行想订一趟最便宜的转机航班,但你不会英文,也不会用航空公司的查询系统。

幸好你有个朋友:他中英文都行,还会用机场那台只认 SQL 命令的老查询机。于是流程变成这样:

  • 你用中文说:"我想从北京去纽约,预算 5000,不能在芝加哥转机"
  • 朋友把这段话翻译成 SQL,敲进查询机
  • 查询机吭哧吭哧算半天,吐出一个最优航班组合
  • 朋友再把结果翻译成中文:"明天早上 8 点,北京飞东京转机,2 小时后接纽约航班,3800 块"

这篇论文做的事一模一样:你(用户)说人话,朋友(LLM,大语言模型)做翻译,查询机(经典规划器,classical planner)真正去算最优解,中间的 SQL 就是一种叫 PDDL(Planning Domain Definition Language,规划领域定义语言)的机器格式。LLM 自己不会规划,但它擅长在两种语言之间倒腾。

LLM+P — 场景示意:这论文要解决的现实问题
Plate Nº ILLM+P — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • 纯 LLM 规划:让 LLM 直接生成动作序列("先拿杯子,再倒水,再喝")。问题:步数一多就胡说,不保证可达目标,也不保证最优
  • 链式思考(CoT, Chain-of-Thought):让 LLM 分步推理。在简单问题上效果不错,但在 Blocksworld 这种需要多步搜索的经典规划任务上仍会失败
  • 强化学习(RL)规划器:训练专门的策略网络。问题是泛化差,换个领域就要重训
  • 经典规划器(如 Fast Downward):算法保证完备性和最优性,但只接受 PDDL 输入——而把人类需求写成 PDDL 是专家活
  • 此前一直没人桥接"自然语言 → PDDL"这一关,所以经典规划器没法被普通用户用起来

这篇论文的关键想法

把 LLM 当作"自然语言 ↔ PDDL"的翻译层,而不是规划器本身。

核心洞察:LLM 在符号生成(写代码、写格式化文本)这件事上比在长程推理上更可靠。所以与其让它做它不擅长的事(一步步推规划路径),不如让它做它擅长的事(生成符合语法的 PDDL 文件),把推理交给保证正确性的工具。

这是一个典型的 neuro-symbolic(神经-符号混合)思路:神经网络负责模糊的语言理解,符号系统负责精确的逻辑搜索。

LLM+P — 方法示意:核心 pipeline
Plate Nº IILLM+P — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

第一步:菜谱专家先写好"厨房说明书",LLM 只填今天的订单。 像餐厅一样:厨房有什么锅、能做什么菜(domain file,领域文件)由大厨提前写一次;今天客人点了什么、要做成什么样(problem file,问题文件)每次现填。论文让人类专家提前写好 domain,LLM 只负责把客人的话翻译成 problem。

等等,先慢一拍 — PDDL 的 domain 和 problem 到底是什么?你可以理解成"游戏规则"和"这一关的关卡设定"。规则只写一次(这个世界里能做什么动作、有什么前置条件),关卡每次不一样(积木现在长什么样、要堆成什么样)。

第二步:照葫芦画瓢(few-shot prompting,少样本提示)做翻译。 像教小孩做应用题:先给一道带答案的例题,再让他做新题。给 LLM 看一个"自然语言 + 对应 PDDL"的样例,它就能模仿着把新任务翻译过去。LLM 不用真的懂规划,它只在做"模式匹配 + 填空"。

第三步:把活儿交给老黄牛——经典规划器。 把 LLM 写好的 problem.pddl 和人写的 domain.pddl 一起塞给 Fast Downward(一个开源经典规划器),它会算出一条保证最优(步数最少或代价最小)的动作序列。这一步不靠 AI,靠的是几十年的搜索算法。

第四步:把"机器话"再翻回人话。 规划器吐出来的是 (pick-up A) (stack A B) 这种符号,LLM 再把它读成:"先把积木 A 拿起来,然后放到 B 上。" 用户全程只看到自然语言进、自然语言出,中间那台老黄牛对他完全透明。

实验在做什么

  • 测试领域:覆盖经典规划基准(Blocksworld 积木世界、Barman 调酒师、Termes 蚂蚁建塔等)和一些机器人任务(Tyreworld、Floortile 等)
  • 比较对象:纯 LLM(GPT-4 直接生成动作序列)、CoT 提示
  • 指标:成功率(生成的计划能否真的达到目标)、最优性(步数是否最少)
  • 核心结论:LLM+P 在所有需要长程规划的任务上几乎全胜,纯 LLM 经常在 5+ 步任务就失败;具体准确率提升数字需读原文
  • 失败模式:LLM 偶尔会在 PDDL 翻译时漏掉一两个谓词(predicate)或写错对象名,这时整个 pipeline 就废掉。论文也讨论了这种翻译误差

你应该懂的几个新词 — 4-6 个

  • PDDL(Planning Domain Definition Language):规划领域的"标准格式",1998 年起作为规划比赛的统一输入语言。分 domain(世界规则)和 problem(具体任务)
  • classical planning(经典规划):完全可观察、确定性、离散动作的规划问题。Blocksworld 是教科书例子
  • domain file / problem file:domain 写一次描述世界(有哪些谓词、动作、前置条件、效果),problem 每次写描述当前任务(初始状态 + 目标)
  • Fast Downward:开源经典规划器,工业界标杆。给它合法的 PDDL 它就能返回最优计划
  • neuro-symbolic:神经网络 + 符号系统混合架构。这篇是非常清晰的一个例子
  • few-shot prompting:在提示里塞几个示例(典型 1-3 个),让 LLM 模仿生成。无需 fine-tune

它和其他论文什么关系

  • 与 SayCan / Inner Monologue 等"LLM 直接当 planner"路线对比:LLM+P 走的是相反方向——不让 LLM 做规划,只让它做翻译。立场更"谦虚"
  • 与 Code as Policies 一脉:都是"LLM 生成结构化语言(代码 / PDDL),交给底层执行"的思路。CaP 生成 Python,LLM+P 生成 PDDL
  • 后续工作:启发了 LLM-DP、PDDLego、AutoTAMP 等一系列"LLM + 形式化规划"工作。也是后来 task-and-motion-planning(TAMP)社区把 LLM 接入的范本
  • 对比 ReAct:ReAct 让 LLM 边推理边交互;LLM+P 是"一次性翻译完,规划器搞定",更适合静态、目标明确的任务

我建议这样读 — 3-4 步

  1. 先理解 PDDL 长什么样:去找一个 Blocksworld 的 domain.pddl + problem.pddl 例子读 5 分钟,知道 (:predicates ...)(:action ...) 是什么
  2. 跳着读论文 Section 3-4:看清楚 prompt 模板和 pipeline 流程图,理解 LLM 输入输出的具体边界
  3. 跑一遍 demo:作者放了 GitHub 仓库(搜 LLM-Planner / LLM+P),跑一个 Blocksworld 例子,亲眼看到自然语言变成 PDDL 又变回自然语言
  4. 思考它的限制:domain 文件还是人写的;如果用户描述的任务超出 domain 表达能力(比如涉及概率、连续值),整套架构就不适用

为什么值得读

  • 方法论价值:示范了"扬长避短"的混合架构思路——遇到 LLM 不擅长的任务,先想想能不能让它只做擅长的部分
  • 历史定位:embodied AI / agent 领域 2023 年中期最重要的"LLM + 经典工具"代表作之一,被后续大量工作引用
  • 对零基础读者友好:论文短、思路清晰、不需要懂深度学习细节,读完就能讲清楚 neuro-symbolic 是什么
  • 批判性视角:也能让你看到 LLM "看起来全能"背后的真实边界——它在严肃规划上靠不住,需要外接计算器

引用本笔记 / Cite this note
BibTeX
@online{eai_llm_plus_p_2026,
  title       = {(readable note) LLM+P: Empowering LLMs with Optimal Planning},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2023 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/llm-plus-p/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim