回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
Multimodal Ecology · Plate Nº 68

FROMAGe: Grounding LLMs to Images

6 min read · 2225 字 · ⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

把一个会说话的大模型整个冻住不动,只在它前后各加一层薄薄的"翻译片",就让它能看图、找图、还能图文混着聊天。

这是个什么场景 — 日常类比

你手机相册里堆了一万张照片。朋友随口一句"去年那次海边烧烤的图发我",你要翻五分钟。

要是有个聊天 AI 能听懂这种自然描述,直接帮你把对应的图捞出来——边聊边出图——岂不是很方便?

普通做法很贵:相当于把一个只会用中文交流的、知识渊博的同事送去脱产培训三个月,让他重新学一套带图的语言(即从零训练多模态大模型,烧一堆 GPU)。

FROMAGe 的做法更省:不培训同事,而是在他面前放一副翻译耳机和一个翻译麦克风。耳机把图片实时翻成"他听得懂的中文向量",麦克风把他想表达的"找图意图"翻成"图像检索能用的向量"。同事本人一节课都不用上,只需训这两个小翻译设备。

代价小、迁移快。但天花板也被同事原本的语言能力锁死了。

FROMAGe — 场景示意:这论文要解决的现实问题
Plate Nº IFROMAGe — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • 从零训练多模态大模型:例如早期的 VL-BERT、ViLBERT,把视觉和语言一起从头训,成本高、数据贵
  • 微调(fine-tune)整个 LLM:拿 GPT 或 LLaMA 把所有参数都解冻一起训,效果好但显存压力大、容易把语言能力训坏(catastrophic forgetting,灾难性遗忘)
  • Frozen / Flamingo 路线:开始流行"冻 LLM 主干"的思路,但 Flamingo 仍然在 LLM 内部插了大量 cross-attention 层(交叉注意力,让文本能"看"到图像 token),训练成本依然高
  • CLIP 系列:只做"图文对齐",图像和文本各自有 encoder(编码器),但不会生成自由文本,更不能做交错对话
  • BLIP / BLIP-2:BLIP-2 也走"冻主干 + 加桥接模块(Q-Former)"的路线,但 Q-Former 本身参数不算少,且仍以"看图回答"为主,弱在图像检索

FROMAGe 把"冻得更彻底、加得更少"推到极致:只加两个线性层。

这篇论文的关键想法

三个连环动作:

  1. 图 → 文向量空间:用一个视觉编码器(visual encoder,论文用的是已有的 CLIP-style 模型)抽出图像特征,再加一个线性层把它投射到 LLM 的输入嵌入(input embedding)空间。等于让 LLM "误以为"自己在读一段文本 token,但其实那是图。

  2. 文 → 图向量空间:在 LLM 的输出端加一个特殊 token(论文里叫 [RET]),这个 token 出现时,把它对应的隐藏状态(hidden state)通过另一个线性层投射回图像检索空间,用来去图库里捞匹配的图。

  3. 主干完全不动:LLM 的所有参数、视觉编码器的所有参数都冻结,只训这两个线性层 + [RET] 这个 token 的嵌入。训练任务就是图文配对的 caption 数据 + 图像检索 loss。

最妙的副作用:因为 LLM 主干没动,它原本的语言能力、上下文学习(in-context learning)能力都完整保留。所以你可以扔给它一段交错的"文字-图-文字-图-文字",它能自然地继续生成下一段,甚至下一张应该检索什么图。

FROMAGe — 方法示意:核心 pipeline
Plate Nº IIFROMAGe — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

输入侧 — 像把照片写成几张便签塞给同事:一张图先过视觉编码器(visual encoder,把图变成一串数字的拍照机)抽出特征,再用一个可训练的线性层把它"翻译"成 k 个假 token(论文里 k 是个小数字,具体需读原文),插到 LLM 的输入序列里。对 LLM 来说,图和字长得一模一样——都是 token。

等等,先慢一拍——token 是什么?可以理解成 LLM 嘴里的一个个小积木块。它本来只认文字积木,FROMAGe 偷偷把图片切成几块"长得像文字"的积木混进去。

输出侧 / 文本生成 — 像同事正常说话:LLM 像往常一样一个 token 一个 token 往外吐。但它的词表里被偷偷塞了一个新词 [RET],意思是"这儿该插一张图"。这个新词的 embedding 也是可训练的。

输出侧 / 图像检索 — 像图书馆查书:当 [RET] 蹦出来时,取该位置的隐藏状态(hidden state,模型脑子里那一刻的想法向量),过一个可训练的输出线性层,得到一个"查询牌";图库里每张图也用同一套流程算出"候选牌";两边做点积(dot product,比相似度的简单方法),最像的那张图就是答案。

训练目标 — 两份作业一起做:一边 captioning loss(让模型看图能写出描述)+ 一边 retrieval loss(让 [RET] 的查询向量贴近正确的图、远离错的图,类似 CLIP 的 InfoNCE 对比损失)。因为只有两层薄翻译片在更新,单机就能跑,也不需要海量数据。

实验在做什么

论文典型评估场景(具体数字需读原文):

  • 零样本图像检索(zero-shot image retrieval):给定一段长描述或多轮对话,让模型从图库里捞图,对比 CLIP 等基线
  • 图像字幕生成(image captioning):给图,让模型说出描述
  • 多模态对话 / 交错图文生成:给一段"文-图-文-图"的上下文,看模型能否合理续写下一段文本,或在恰当位置插入合适的检索图
  • Few-shot / in-context learning:因为 LLM 没动,论文重点展示它"学了几个示例就会做新任务"的能力依然在线

亮点不在指标多漂亮,而在用极少训练参数达到了能用的水平,并且语言能力没退化。

你应该懂的几个新词 — 4-6 个

  • frozen backbone(冻结主干):训练时把模型某些参数固定不更新,只训新增的部分。省显存、保护原能力
  • linear projection / linear layer(线性投射 / 线性层):最简单的全连接层,y = Wx + b,本论文做"空间翻译"全靠它
  • interleaved image-text(图文交错):输入或输出是"文字-图-文字-图"穿插的序列,不是单纯"一图一描述"
  • retrieval token [RET]:词表里新加的特殊 token,专门用来标记"这里要去捞一张图"
  • in-context learning(上下文学习):LLM 不更新参数、只看 prompt 里的几个示例就能学会新任务的能力
  • InfoNCE / contrastive loss(对比损失):让正样本对(匹配的图文)相似度高、负样本对相似度低的训练目标,CLIP 同款

它和其他论文什么关系

  • 承接 CLIP:视觉编码器和图像检索逻辑沿用 CLIP 的对比学习范式
  • 承接 Frozen / Flamingo:同样是"冻 LLM"思路,但 FROMAGe 比 Flamingo 加得更少(没在 LLM 内部插层),代价是看图理解的深度不如 Flamingo
  • 对比 BLIP-2:BLIP-2 加 Q-Former(参数量更大的桥接模块),FROMAGe 只加线性层;BLIP-2 偏 VQA / 看图问答,FROMAGe 偏检索 + 交错生成
  • 后续影响:Mini-GPT4、LLaVA 等开源多模态项目都吸收了"冻主干 + 训轻量投射层"的思路;LLaVA 早期版本就是一个 MLP 投射 + 冻 LLM
  • 和 PaLM-E / Embodied 路线的差异:PaLM-E 想让 LLM 控制机器人,FROMAGe 只关心图文,没碰动作空间

我建议这样读 — 3-4 步

  1. 先看图 1(架构图):FROMAGe 的所有秘密都在那张图里——两条线性层、一个 [RET] token、冻结的主干。看懂图就懂 70%
  2. 再看 method 那一节:重点抓"训练目标是哪两个 loss",以及"[RET] token 是怎么参与训练的"
  3. 跳过实验细节,先看定性示例(qualitative examples):论文里展示的图文交错对话最能说明"为什么冻主干很值"
  4. 最后回头看消融(ablation):如果只用 captioning loss 不用 retrieval loss 会怎样?投射层加宽会怎样?这部分回答"线性层够不够"

为什么值得读

  • 方法极简:少有的论文能把"两个线性层"作为主要创新点还讲明白
  • 思路有迁移性:后来一大批"冻 LLM + 轻桥接"的多模态工作(LLaVA 系列尤其)能在这里找到精神先祖
  • 示范了一个工程哲学:与其训练新能力,不如借用已有大模型的能力,只训"翻译接口"。这套思路在大模型时代通用——后来出现的各种 adapter、LoRA、Q-Former 本质都是这个家族
  • 对学习者友好:架构干净、参数少、概念集中,适合作为"理解多模态对齐"的入门样本。读完它再去读 BLIP-2、LLaVA 会非常顺

引用本笔记 / Cite this note
BibTeX
@online{eai_fromage_2026,
  title       = {(readable note) FROMAGe: Grounding LLMs to Images},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2023 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/fromage/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim