回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
VLM Foundation · Plate Nº 137

What matters when building vision-language models?

6 min read · 2201 字 · ⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

做"看图说话 AI"时大家凭感觉选零件,这篇把每个选择拆开做对照实验,整理成一份避坑清单,再训了个 8B 模型当样板。

这是个什么场景

想象你想开一家面包店。你刷小红书,发现网红配方千奇百怪:A 店用日本面粉、低温发酵 24 小时;B 店用法国面粉、加汤种;C 店把烤箱温度调高 20 度。每家都说自己那一步才是"关键"。

但作为新手老板,你真正纠结的不是听谁的,而是:到底哪一步真的让面包变好吃,哪一步只是听起来高大上、其实换了也没区别?

2024 年做"看图说话 AI"(学名叫 VLM,Vision-Language Model,视觉-语言模型)的人就是这种状态——社区里飘着一堆"听说 X 比 Y 好"的经验贴,但谁也没干净地比过。Idefics2 干的就是面包店新手最想要的那件事:把所有"听说有用"的设计选择(用哪个视觉编码器、怎么把图接进语言模型、训练分几步、各种数据混多少比例)逐个拉出来做对照实验,告诉你哪些真有效、哪些只是花架子。

What matters when building vision-language models? — 场景示意:这论文要解决的现实问题
Plate Nº IWhat matters when building vision-language models? — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • Flamingo / 早期 VLM:把视觉特征通过 cross-attention 注入到 LLM 的某些层里,结构复杂、训练难复现
  • LLaVA 系列:简单粗暴——视觉编码器 + 一个 MLP 投影层 + LLM,先对齐再指令微调,开源界主流方案
  • BLIP-2 / Q-Former:用一个可学习的小 transformer(Q-Former)做视觉到语言的"翻译官",参数少、压缩强
  • 多数工作只报告自己的最终配方,不告诉你"为什么选 SigLIP 而不是 CLIP""为什么用 perceiver 而不是 MLP"
  • 结果是社区里有一堆"听上去合理"的设计建议,但没有干净的 A/B 对照,新人复现要踩一堆雷

这篇论文的关键想法

把每一个设计选择都当成一个独立变量,固定其他条件,跑消融实验,看哪个真的对下游 benchmark 有提升。

具体它把 VLM 拆成几个可替换的"零件":

  1. 视觉骨干(vision backbone):CLIP / SigLIP / DINOv2 等
  2. 连接器(connector):MLP / Perceiver Resampler / Q-Former 等——决定视觉 token 怎么"翻译"成 LLM 能吃的形式
  3. LLM 骨干:Mistral-7B 之类
  4. 训练阶段:预训练 / 视觉对齐 / 指令微调,每一阶段的数据混合比例
  5. 图像处理策略:原始分辨率 vs 切 patch、是否保留长宽比

然后挨个跑实验,沉淀出一份"如果你 2024 年要做开源 VLM,照这个做大概率不会差"的工程清单。

What matters when building vision-language models? — 方法示意:核心 pipeline
Plate Nº IIWhat matters when building vision-language models? — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

等等,先慢一拍 — "视觉编码器" 是什么?把它想成一双"AI 的眼睛":拿到图片后先把它翻译成一串数字向量,语言模型才看得懂。"连接器(connector)"则是眼睛和嘴巴中间的那段神经——决定眼睛看到的东西怎么传给负责说话的语言模型(LLM)。

架构选择阶段:像挑相机和挑大脑。作者把"大脑"(LLM,固定用 Mistral-7B)锁死,然后换不同的"眼睛"(视觉编码器)跑同一套测试。结论之一是 SigLIP 比 CLIP 好(具体差距需读原文);更重要的是 眼睛和大脑都重要,但换更强的大脑收益更大——与其折腾相机镜头,不如换个聪明点的人来看照片。

连接器对比:像点菜传话。传统做法(Perceiver Resampler / Q-Former)是让一个"翻译官"把眼睛看到的几百个细节先压缩成几十句要点再告诉大脑,听起来高效。Idefics2 的反直觉发现是——直接让大脑看全部细节(简单 MLP 投影 + 全部 patch token 喂进 LLM)反而更好,配合下一段的"切图"策略效果最佳。翻译官压缩得太狠,细节丢了。

图像处理策略:像看高清大图先放大再分屏。过去模型被迫把图缩到 224×224 或 336×336 的小方块,文字和细节都糊了。他们改用"保留原始长宽比 + 把大图切成几张子图分别看"(类似 LLaVA-NeXT 的 anyres 切片)。对要看清楚字的任务(OCR、文档问答)提升明显。

训练阶段拆分:像学厨师分三步——先认食材,再练菜系,最后照菜单做菜。具体是:(1) 视觉-文本对齐预训练(图文对 + interleaved 文档,即图和字穿插的网页/教科书);(2) 任务多样化的多模态预训练;(3) 指令微调,用一个名为 The Cauldron 的整合数据集,覆盖 50 个开源 VLM 任务。每一阶段都做了数据混合比例的消融,告诉你 OCR 数据加多少、interleaved 文档占多少最优。

实验在做什么

主体是消融矩阵:在固定的下游 benchmark 套件(VQAv2、TextVQA、DocVQA、MMMU、MathVista 等)上,每次只改一个变量,对比平均分。

评估覆盖:通用 VQA、文档/OCR、数学推理、多语言、长上下文图文等,目的是确保"在 A 任务上更好"不会变成"在 B 任务上变差"。

最终模型 Idefics2-8B:把所有消融选出的"最佳子选项"拼起来训练一个 8B 模型,与同期 LLaVA-NeXT、MM1、Qwen-VL 等开源 VLM 对比,号称在同尺寸里达到或超过 SOTA。具体数字需读原文。

附带产出:The Cauldron 指令数据集(50 个任务的整合)也开源,这本身是一份有价值的社区贡献。

你应该懂的几个新词 — 4-6 个

  • VLM(Vision-Language Model):视觉-语言模型。能同时吃图像 + 文字,输出文字(如回答关于图的问题)。
  • 视觉编码器 / 视觉骨干(vision backbone):把图像编码成一串向量的网络,通常是 CLIP/SigLIP/DINOv2 这类已经预训练好的 ViT。
  • 连接器(connector / projector):把视觉编码器输出的视觉 token 转换成 LLM 能消化的形式的桥梁。可以是简单 MLP,也可以是 Q-Former 这种带可学习 query 的小 transformer。
  • interleaved 图文文档:图和文字穿插的训练数据(比如网页、教科书),相比"图-题-答"三元组更接近真实多模态分布,对长上下文很重要。
  • anyres / 切片策略:把高分辨率大图切成多个子图分别编码,绕过 ViT 输入分辨率固定的限制。
  • 指令微调(instruction tuning):用"任务描述 + 输入 + 期望输出"格式的数据再训练,让模型学会跟指令做事,而不只是续写。

它和其他论文什么关系

  • 延续 LLaVA 的极简哲学——视觉骨干 + 投影 + LLM 这套结构,但用消融把它推到了"工程标准"的程度
  • 回应 Flamingo / BLIP-2 的复杂连接器:实验上反驳了"连接器越精巧越好"
  • 和 MM1(Apple 2024)是同期"消融体系"工作——MM1 也做了类似的设计选择消融,两篇可对照读,结论大体一致但细节有差
  • LLaVA-NeXT、Qwen-VL 等在 anyres、长上下文方向有平行探索,Idefics2 把这些技巧整合并验证
  • 后续开源 VLM(如 Idefics3、InternVL 系列)影响很大——很多人直接拿这套 ablation 结论当默认起点

我建议这样读 — 3-4 步

  1. 先扫摘要 + 引言 + 结论的 takeaway 列表——很多人就是冲这份"清单"来读的,先把清单本身记住
  2. 重点读消融章节里和你最相关的 1-2 个:如果你关心连接器选型就读连接器那段;关心数据配比就读训练数据那段。不用每个 ablation 都精读
  3. 看 The Cauldron 数据集介绍——如果你以后要做 VLM 指令微调,这是现成的高质量数据
  4. 跳过具体超参表,除非你要复现训练;那种细节读了也记不住

为什么值得读

  • 如果你要做 VLM 工程:这是 2024 年开源社区最系统的"避坑指南",比读十篇单方法论文有用
  • 如果你做 embodied AI / 机器人:很多 robot foundation model(如 RT-2、π0)的视觉模块都在沿用这套消融结论,理解 Idefics2 等于理解了它们的视觉端为什么这么搭
  • 如果你只想了解 VLM 概貌:读这一篇能省下读 LLaVA / BLIP-2 / Flamingo 三篇的功夫,因为它把这些方法都对照过了
  • 方法论价值:哪怕你不做 VLM,"把领域里所有听说有用的 trick 拉出来做控制变量实验"这种工作模式本身值得学习——这是把"炼丹"变"工程"的标准动作

引用本笔记 / Cite this note
BibTeX
@online{eai_idefics_2_2026,
  title       = {(readable note) What matters when building vision-language models?},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2024 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/idefics-2/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim