回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
End-to-End VLA · Plate Nº 123

TraceVLA: Visual Trace Prompting

6 min read · 2136 字 · ⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

机器人的手刚走过哪里?TraceVLA 把这条路径直接画在它看到的照片上,让它看见自己的足迹,再决定下一步往哪动。

这是个什么场景

想象你在玩一个游戏:每隔一秒给你看一张厨房的照片,然后让你说出锅铲下一秒该往哪挥。但有个坑——每张照片都是孤立的,你根本不记得自己上一秒挥到了哪里。结果就是你在锅里来回打转,左边搅了三遍,右边一下没碰。

机器人现在做菜(或者抓积木、放杯子)就是这个状态。它每一步只看当前一帧画面,下一步动作全靠"猜",因为它不知道自己刚才动过哪。

TraceVLA 的解法很像在锅边架一支荧光笔:锅铲走过哪里,画面上就留一道光痕。机器人下次瞥一眼,当前这张照片里就带着自己刚才的足迹——不用回忆、不用读取历史文件,看图就知道"我已经搅过左边了,该轮到右边"。

关键是:轨迹不是塞进文字("刚才手到了 (0.3, 0.5, 0.2)"这种坐标),而是直接画进图像里,让模型用看图的方式消化。

TraceVLA — 场景示意:这论文要解决的现实问题
Plate Nº ITraceVLA — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • OpenVLA / RT-2 等单帧 VLA:每步只看当前 RGB 帧,丢掉历史。模型靠 transformer 内部隐式建模时序,但单帧输入下信息其实不全。
  • 多帧堆叠(frame stacking):把过去 N 帧拼起来一起喂模型。代价:token 数量爆炸,长上下文训练困难,且大量像素冗余。
  • 历史动作文本化:把过去几步的动作 token(如 <a1><a2><a3>)拼到 prompt 里。问题:动作空间和视觉空间分离,模型要做跨模态对齐才能利用历史。
  • RT-Trajectory(同组思路):把目标轨迹画在图上作为任务指令。和 TraceVLA 是镜像关系——一个画"未来要走的路",一个画"过去走过的路"。
  • 隐式记忆模块(如 RNN/Mamba/状态变量):用循环结构压缩历史。但 VLA 主流是 decoder-only transformer,引入循环架构成本大。

这篇论文的关键想法

像给一个英语很好但听不懂中文的朋友指路——别费劲翻译成中文,直接画地图给他看。

核心洞察:VLM(视觉语言模型,预训练过的"看图王")已经非常会读图了。那历史信息也别另开通道塞给它,直接画成图喂进去就行。

具体三步:

  • 取最近 K 步机械手的 3D 位置,投影到当前相机画面变成 2D 像素点
  • 把这些点连成一条线(trace,轨迹),叠加渲染在当前 RGB 帧上
  • 把这张"带轨迹的图"当作 VLA 的视觉输入

好处:

  • 零新增 token:还是一张图,不增加模型上下文
  • 零新增模块:现成 VLA 架构和权重直接用
  • 时序信息可视化:模型一眼看出"我已经接近目标"或"我在原地打转"
TraceVLA — 方法示意:核心 pipeline
Plate Nº IITraceVLA — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

轨迹生成——像在地图上标"我刚才走过这几个点"。每个时间步 t,回看过去 K 步(K 的具体值需读原文)机械手的 3D 位置,再用相机参数把它们投影成当前画面里的 2D 像素点,按时间顺序连成一条线。颜色或粗细可能编码"多久之前"——越早越淡或越细,像褪色的脚印。

等等,先慢一拍 — "相机外参 + 内参"是什么?简单说:外参告诉你相机站在哪、朝哪看;内参告诉你相机镜头怎么把 3D 世界压扁成 2D 照片。两个加起来才能算出"3D 空间里这个点,在照片上对应哪个像素"。

视觉叠加——像 PS 图层一样把线画上去。把这条 trace 直接渲染到当前 RGB 图上,得到一张"增强图"。这一步是纯绘图,不进梯度,类似数据增强。增强图替换掉原始图作为 VLA 的视觉输入。

模型与训练——抄作业但抄得更聪明。底座大概率是 OpenVLA(同组先前工作)。在带 trace 的图上做 SFT(supervised fine-tuning,监督微调),目标仍是预测下一步动作 token。论文应该会比较:

  • baseline:原 OpenVLA(无 trace)
  • TraceVLA:带 trace 的同款模型,同等训练数据 / 步数

推理——边走边画。每步实时计算 trace 叠加到当前帧,喂给模型出动作。推理时多了一个轻量的"画线"步骤,但模型本身前向不变。

实验在做什么

预期评测维度(具体数字需读原文):

  • 仿真:SIMPLER-Env、LIBERO 等标准 VLA benchmark,对比 OpenVLA / Octo 等基线在成功率上的提升
  • 真机:可能在 WidowX 或 Franka 上做长时序任务(pick-place、stacking、articulated objects)
  • 消融:trace 长度 K 怎么选、trace 视觉风格(颜色 / 粗细 / 透明度)的影响、是否需要历史动作 token 配合
  • 失败模式分析:哪些任务 trace 帮不上忙——比如完全静态的开始阶段,trace 是空的,等价于无 trace

关键问题:trace 在 OOD(分布外)场景的鲁棒性如何?训练时 VLA 没见过画了线的图,靠的是 VLM 预训练的视觉常识——这个迁移能力是论文价值的核心证据。

你应该懂的几个新词 — 4-6 个

  • VLA(Vision-Language-Action):把图像 + 语言指令直接映射成机器人动作 token 的大模型,例如 RT-2、OpenVLA。
  • End-effector(末端执行器):机械臂最末端那个"手",通常是夹爪。它的位置/姿态是机器人控制的关键状态。
  • Visual prompt(视觉提示):和文字 prompt 对应——通过修改输入图像来引导模型行为,比如画框、画箭头、叠加 mask。
  • Trace / Trajectory(轨迹):一系列时序位置点连成的路径。这里指末端执行器在过去 K 步的运动轨迹。
  • Frame stacking(多帧堆叠):把多帧图像直接拼在一起喂给模型作为时序输入的朴素做法。
  • OpenVLA:开源 VLA 底座,TraceVLA 大概率基于它做。详见 learnings/openvla 同名笔记(如果有)。

它和其他论文什么关系

  • OpenVLA(基础):TraceVLA 是它的"轻量增强版"——同款模型,输入端改一改就提点。
  • RT-Trajectory(DeepMind, 2023):把目标轨迹画在图上作为指令;TraceVLA 把历史轨迹画在图上作为状态。一个朝前看,一个朝后看,思路对偶。
  • RT-2 / Octo:同样是 VLA,但靠多帧或大规模数据解决时序。TraceVLA 主张"一张图 + 视觉先验"就够了,是更省的方向。
  • Inner Monologue / Code as Policies:靠 LLM 文字推理处理历史。TraceVLA 选了纯视觉路线,不依赖 LLM 自言自语。
  • Set-of-Mark prompting(GPT-4V 上的视觉提示技巧):思路同源——给 VLM 看的图加视觉标记来引导关注点。TraceVLA 是机器人版的 SoM。

我建议这样读 — 3-4 步

  1. 先看 fig 1 + method 章节:理解 trace 长什么样、怎么叠到图上。这是全文最直观的部分,看图就懂 80%。
  2. 跳到实验表:直接看主结果——TraceVLA vs OpenVLA 在 SIMPLER / LIBERO 的成功率差。如果差距 < 3%,这个 trick 可能不值得;如果 > 10%,就是个真·strong baseline。
  3. 读消融:重点看 K 的选择、trace 视觉风格的影响。这决定你自己复现时的超参。
  4. 可选:附录的 OOD / 长时序任务:如果 trace 在新场景也能 work,说明 VLM 的视觉先验真的吃下了"线条 = 路径"这个抽象,价值更高。

为什么值得读

  • 方法极简:渲染一条线,没新模块没新数据,是"四两拨千斤"的典型代表。读完你会感叹"为什么之前没人这么干"。
  • 视觉提示在机器人领域的样板:GPT-4V 时代视觉 prompt 已被验证(SoM、ViP-LLaVA 等),TraceVLA 把这套方法论搬到 VLA,思路可迁移到很多 embodied AI 子任务。
  • 对 VLA 时序建模的反思:它隐含一个观点——transformer VLA 内部"看不太懂"自己几步前在干嘛,需要外部把历史显式画给它看。这个观察对后续设计有启发。
  • 复现成本低:如果有 OpenVLA 跑通的环境,加 trace 渲染只要几十行代码,适合作为入门 VLA 改进研究的第一个项目。

引用本笔记 / Cite this note
BibTeX
@online{eai_tracevla_2026,
  title       = {(readable note) TraceVLA: Visual Trace Prompting},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2024 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/tracevla/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim