回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
VLM Foundation · Plate Nº 133

Improved Baselines with Visual Instruction Tuning

6 min read · 2181 字 · ⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

给会聊天的 AI 配一副"看图眼镜"。把眼镜从一片镜片换成两片,再多给它看点带字的图片,看图答题就刷榜了。

这是个什么场景 — 日常类比

你拍了一张菜单照片发给 ChatGPT,问它"这家有没有素食"。AI 要做对这件事,得同时干两件事:看懂图(认出菜名、看清招牌字)、会聊天(理解你的问题、组织答案)。

问题是:会聊天的语言模型(GPT、LLaMA 这些)天生只认文字,不认图。所以工程师给它配了一副"翻译眼镜"——眼镜负责把图片翻译成模型能读懂的"密码",模型再照常聊天回答。

初代 LLaVA(2023 年 4 月)就是这套思路,但眼镜很简陋:一片薄薄的单层镜片(线性投影),看大致轮廓还行,看菜单上的小字、考试题里的图表就糊了。LLaVA-1.5 做的事很朴素——把镜片换成两片叠起来(两层 MLP),再让模型多看一些带文字、带表格、带考题的图片当教材。模型本身一行代码没改,就在十几个看图答题榜单上拿了开源第一。

这是 2023 年下半年开源 VLM(视觉语言模型)圈的典型故事:少折腾架构、多喂对的数据,反而最划算。

Improved Baselines with Visual Instruction Tuning — 场景示意:这论文要解决的现实问题
Plate Nº IImproved Baselines with Visual Instruction Tuning — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • 初代 LLaVA(2023 年 4 月):CLIP(对比语言-图像预训练)的视觉编码器 + 单层线性投影 + Vicuna 语言模型,用 GPT-4 自动生成的多模态指令数据训练;能聊天但 VQA 榜单分数不高
  • MiniGPT-4 / mPLUG-Owl:思路类似,用 Q-Former 或单层投影把视觉 token 接到 LLM 上,注重对话流畅
  • BLIP-2:用 Q-Former 这种"压缩 + 查询"的桥接方式,参数效率高但训练复杂
  • InstructBLIP:在 BLIP-2 基础上加 instruction tuning,VQA 强但开源生态不如 LLaVA
  • 共同短板:要么对话强但答 VQA 不准,要么 VQA 准但配方复杂、不好复现;学术 VQA / OCR 任务普遍弱

这篇论文的关键想法

一句话:做菜不在锅多花哨,而在选对食材。同一口家用锅,把食材换全了就能做出大餐。

具体两个赌注:

  1. 桥接层不需要花哨(连接图像和文字的"翻译镜片"):像把一片镜片换成两片叠加——把单层线性投影换成两层 MLP(多层感知机,中间夹一个非线性激活),表达能力够用,多出来的参数可以忽略
  2. 数据才是真正的瓶颈:好比模型之前没复习过"看图识字"和"看图答常识题"两类作业。把 OCR(光学字符识别)类(OCR-VQA、TextCaps)和学术 VQA(视觉问答)类(A-OKVQA、OKVQA)数据加进指令微调阶段,对应榜单立刻补齐短板

底层信仰:左边的"看图模块"(CLIP-ViT)和右边的"会说话模块"(Vicuna 大语言模型)都已经够强了,中间那座桥不必复杂;真正决定 VLM 上限的是它做过哪些类型的题

Improved Baselines with Visual Instruction Tuning — 方法示意:核心 pipeline
Plate Nº IIImproved Baselines with Visual Instruction Tuning — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

架构:CLIP-ViT-L/336px(输入分辨率 336×336 的 ViT-Large)→ 两层 MLP 投影 → Vicuna-13B(基于 LLaMA 微调的对话模型)。视觉编码器输出的 patch token 经过 MLP 投影后,被当作"软 token"拼接到文本 token 序列前面,整段一起喂给 LLM 自回归预测下一个 token。

两阶段训练

  • 阶段 1(特征对齐):冻住视觉编码器和 LLM,只训 MLP 投影层,用图像-文本配对数据让投影学会把视觉特征映射到 LLM 的词嵌入空间
  • 阶段 2(visual instruction tuning,视觉指令微调):解冻 LLM 一起训练,喂多任务混合的指令数据;这一步是分数提升的关键

数据配方:在初代 LLaVA 的 LLaVA-Instruct-150K(GPT-4 自动构造的多模态对话指令)基础上,混入:

  • VQAv2 / GQA(通用 VQA)
  • OCR-VQA / TextCaps(文字识别相关)
  • A-OKVQA / OKVQA(需常识推理的 VQA)
  • 学术 VQA 数据集若干

具体配比和 epoch 数需读原文。

轻量化:分辨率提到 336×336(比初代的 224×224 更清晰);输入 prompt 加上"用一个词或短语回答"这类 response format prompt,让模型在 VQA 短答场景不啰嗦。整套配方训练成本约 1 天 8×A100,比 InstructBLIP 等复杂配方便宜很多。

实验在做什么

主要在 12 个左右的多模态 benchmark 上对比:

  • 学术 VQA:VQAv2 / GQA / VizWiz / ScienceQA-IMG / TextVQA — 测"看图答事实问题"
  • 多模态对话 / 综合:MME / MMBench / SEED-Bench / LLaVA-Bench-Wild — 测综合理解和指令跟随
  • OCR 相关:TextVQA / OCR-VQA — 测读图中文字的能力
  • POPE:测幻觉(hallucination,模型胡编不存在的物体)

核心结论:在多个榜单上超越同期开源 VLM(含 InstructBLIP、Qwen-VL 早期版等),并在部分 benchmark 接近或超过闭源 GPT-4V 当时的水平。具体数字需读原文。

值得注意的消融:

  • 单层投影 → 两层 MLP,分数稳定提升
  • 加入学术 VQA 数据,对应任务分数大幅上升,但通用对话能力没退化
  • 提分辨率 224 → 336,OCR / 细节任务受益最明显

你应该懂的几个新词 — 4-6 个

  • Visual Instruction Tuning(视觉指令微调):把"图像 + 任务指令 + 答案"的三元组组织成监督数据,让 VLM 学会按指令完成多样任务,而不只是图像描述
  • MLP(Multi-Layer Perceptron,多层感知机):最基础的神经网络结构,多层全连接 + 非线性激活;这里特指视觉特征到 LLM 嵌入空间的两层桥接
  • Projector / Connector(投影层 / 连接器):视觉编码器输出和 LLM 输入之间的桥接模块,负责把视觉 token 映射到 LLM 能"听懂"的向量空间;LLaVA 系列的 projector 极简,是其特色
  • VQA(Visual Question Answering,视觉问答):给一张图和一个自然语言问题,模型用文字回答;学术上分通用 VQA、OCR-VQA、知识 VQA 等
  • Response Format Prompt:在 prompt 末尾加一句格式约束(如"用一个词回答"),让模型在不同 benchmark 输出对的格式;LLaVA-1.5 用这招避免在短答 VQA 上输出长句被判错
  • POPE(Polling-based Object Probing Evaluation):一种测多模态幻觉的标准化评测,问模型"图里有没有 X",统计假阳性率

它和其他论文什么关系

  • 上游基础:CLIP(视觉编码器)+ LLaMA / Vicuna(语言模型);初代 LLaVA(visual instruction tuning 的开创工作)
  • 同期对比:InstructBLIP(更复杂的 Q-Former 配方)/ Qwen-VL(阿里同期开源 VLM,用 cross-attention 桥接)/ MiniGPT-4
  • 下游影响:成为开源 VLM 的事实标准 baseline,几乎所有后续工作都会在 LLaVA-1.5 上对比;衍生出 LLaVA-NeXT(1.6)、LLaVA-OneVision、ShareGPT4V、VILA 等一系列工作
  • 机器人 / 具身方向:LLaVA 系列的简单架构和开源权重,让它常被当作具身 VLM(如 RoboFlamingo、OpenVLA 早期对比)的视觉理解 backbone

在你的笔记体系里:

  • 上一篇 llava(初代)→ 本篇 → 下一篇可看 qwen-vl(同期不同流派)
  • 视觉骨干理解可回 clip / siglip
  • 想看 VLM 在机器人里怎么用 → openvla / rt-2

我建议这样读 — 3-4 步

  1. 先扫摘要 + 表 1(大表):直接看 LLaVA-1.5 在哪些 benchmark 上提升最大,建立"它到底强在哪"的直觉
  2. 读方法节的两个改动:MLP projector 一段 + 数据配方一段,重点看为什么两层 MLP 够、为什么这几类数据有效
  3. 看消融实验:分辨率、projector、数据三项消融分别贡献了多少分;这是作者给的"配方解构",对你做后续 baseline 改造最有用
  4. 跳读对话样例:附录里的 demo case 看几个,体会一下 OCR / 推理 / 描述各场景的输出风格

不建议第一次就钻训练超参细节,那部分对理解贡献不大。

为什么值得读

  • 开源 VLM 的"标准件":你做 VLM 相关任何研究 / 项目,几乎都会先在 LLaVA-1.5 上跑通再说,先理解它的配方等于理解整个生态的起点
  • "少即是多"的代表作:在大家堆复杂结构的 2023 年,它用最简单的 MLP + 加数据打赢,提醒你架构不是一切
  • 可复现性:训练成本、数据、代码、权重全公开,是第一个让普通研究者真能在 8×A100 一天复现的 VLM
  • 后续工作的对比锚:读 LLaVA-NeXT、Qwen-VL-2、InternVL 等任何后续 VLM 论文,都会反复出现 "vs LLaVA-1.5",理解它能让你看懂 90% 的 VLM 论文比较表

引用本笔记 / Cite this note
BibTeX
@online{eai_llava_1_5_2026,
  title       = {(readable note) Improved Baselines with Visual Instruction Tuning},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2024 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/llava-1-5/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim