回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
VLM Foundation · Plate Nº 141

LLaVA-OneVision: Easy Visual Task Transfer

6 min read · 1987 字 · ⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

一套配方教会一个模型同时看懂单张图、几张图、和视频,开源圈第一次在视频上接近 GPT-4V。

这是个什么场景

想象你拿出手机相册,问 AI 三件事:

  • "这张照片里那只猫在干嘛?"(单张图)
  • "我拍了两张菜,你帮我看看哪盘炒得更熟?"(多张图对比)
  • "这段 30 秒的监控里小孩什么时候摔倒的?"(视频)

放在 2024 年之前,开源圈得给你三个不同的 App:一个看单图、一个比对照片、一个看视频,三家用的模型、教材、考试都不一样。在单图 App 里训练得再好的模型,换到视频 App 还是相当于从幼儿园重读。

LLaVA-OneVision 干的事就像把这三个 App 合成一个:"同一个 AI,三种场景都能用"。而且它还发现:让模型先学会"两张图找不同",它再去看视频时反而更敏锐了——因为视频本质就是"很多张图按时间排好",多图训练出的对比能力会自动迁移过去。

LLaVA-OneVision — 场景示意:这论文要解决的现实问题
Plate Nº ILLaVA-OneVision — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • LLaVA-1.5 / LLaVA-NeXT 系列主打单图理解,多图和视频是后来零散打补丁加上的
  • 视频 VLM 通常是另起炉灶(VideoChat、Video-LLaVA 等),数据和单图模型不互通
  • 多图对比任务(mantis 等)被当成第三类小赛道,规模小,数据稀缺
  • 闭源模型(GPT-4V、Gemini)天生就在三场景统一训练,但权重和数据都拿不到
  • 开源社区缺的不是模型结构,是"覆盖三场景的高质量数据集 + 训练阶段切分"

这篇论文的关键想法

像教孩子读书一样:先学单字(图)、再学比较(多图)、最后才看动画片(视频)。每个阶段都是下一阶段的台阶,不需要重新教。

核心赌注:视觉任务之间是能"互相借力"的——只要前面的课程喂得对,单图学到的本事能自己"长"到多图和视频上,不必为视频专门造一个新模型。

具体说:

  • 训练分成几个阶段(语言-图像对齐 → 高质量知识灌输 → 视觉指令微调),每阶段端上桌的数据都是精心配比的
  • 视频不是从零开始(cold start),而是建在"已经会看单图和多图"的模型之上,所以视频数据量可以少,但要精
  • 视觉编码器用 SigLIP,语言部分用 Qwen-2,结构本身没什么花活——所有创新都压在"喂什么数据、按什么顺序喂"上
LLaVA-OneVision — 方法示意:核心 pipeline
Plate Nº IILLaVA-OneVision — 方法示意:核心 pipeline

它怎么做的(方法)

架构(像三明治一样朴素):眼睛(视觉编码器 SigLIP)+ 翻译官(projector)+ 大脑(LLM Qwen-2)。和前几代 LLaVA 几乎一模一样,没加什么花哨的跨模态 attention 或 Q-Former。作者故意保持简单,就是想说:"瞧,不靠结构,光靠配方就能赢。"

等等,先慢一拍 —— 这里面的 visual token 是什么?

  • 想象 LLM(语言大脑)只认识"词",给它一张图它一脸懵
  • 那就把图切成一格一格,每格压成一个"假词"喂给它,这个"假词"就叫 visual token
  • 一张图 = 一段假句子,几张图 = 几段假句子拼起来,视频 = 抽几帧拼成的假句子
  • 对 LLM 来说,三种情况都是"一长串词",没区别——这就是统一的诀窍

Higher AnyRes(动态切图):就像扫描一张大海报,扫描仪一次只能放 A4 大小,那就把海报切成 A4 一张张扫,再拼起来。一张高清图被切成多个 sub-image 分别编码;多张图就是各扫各的拼一起;视频就是按时间抽几帧再扫。最后都变成同一种"一串 visual token + 文字"的格式。

训练数据配方(像孩子上学)

  • 幼儿园:海量普通图文对做语言-图像对齐,先认得"猫狗汽车"
  • 小学:喂高质量知识密集数据(OCR 文字识别、图表、文档),灌"硬知识"
  • 中学:才上单图/多图/视频混合的指令微调(具体配比和数据集列表需读原文)
  • 视频数据量相对少,但因为前面两阶段打了底,少量也够用

任务迁移的证据:作者发现,模型在很多它"没专门刷过"的视频测试集上也表现不错。他们把功劳归给多图阶段——因为模型在多图里练出了"跨画面对比"的肌肉,看视频(本质上就是跨帧对比)时自然就会了。

实验在做什么

  • 在大量单图 benchmark(MMBench、MMMU、MathVista、DocVQA 等)上对比 LLaVA-NeXT、InternVL、Qwen-VL 等开源模型
  • 在多图 benchmark(Mantis-Eval、BLINK 等)上验证多图能力不是"白送"
  • 在视频 benchmark(VideoMME、MVBench、EgoSchema 等)上对比视频专用模型,并和 GPT-4V 这类闭源做参考
  • 做 ablation 看数据配比、训练阶段顺序的影响(具体 ablation 设计需读原文)
  • 模型规模做了 0.5B / 7B / 72B 三档,验证 scaling

你应该懂的几个新词 — 4-6 个

  • VLM(Visual-Language Model):能同时处理图像和文字的模型,输入图、输出字
  • AnyRes / Higher AnyRes:动态分辨率方案,把任意尺寸的图切成固定大小的 patch 再喂给视觉编码器,避免暴力 resize 丢信息
  • SigLIP:Google 提的图文对齐模型,比 CLIP 用 sigmoid loss 替代 softmax,训练更稳;这里当视觉特征提取器
  • Visual Instruction Tuning:用"看图回答"格式的数据对 VLM 做监督微调,是 LLaVA 系列的招牌动作
  • Task Transfer(任务迁移):在 A 任务训练,模型在没专门训练的 B 任务上也表现不错;本文的核心宣称
  • Visual Token:图像被切片+编码后变成的一串向量,长得像 word embedding,LLM 可以无差别处理

它和其他论文什么关系

  • 直接前作LLaVALLaVA-1.5、LLaVA-NeXT——架构传承几乎一比一,OneVision 是数据维度的扩展
  • 同期开源对手InternVL-2.5Qwen-VLDeepSeek-VLPixtral-12B 走的是相似路线(统一架构 + 大量数据),但各家配方不同
  • 视觉编码器:用 SigLIP 作为前端,和 CLIP / EVA-CLIP 系是一支
  • 视频路线对照:和 Video-LLaVA、VideoChat 这种"专攻视频"的方案构成对比,OneVision 主张视频不需要专门架构
  • embodied 关联:对 OpenVLART-2 这类机器人 VLA 很重要——VLA 的视觉塔就是 VLM,OneVision 这种"全场景统一"的预训练塔可以直接搬过来

我建议这样读 — 3-4 步

  1. 先看 abstract + Figure 1(数据配方总览图)+ 主表,搞清楚"统一三场景"具体指什么、收益多大
  2. 跳到方法节看训练阶段切分和数据混合比例,这是真正的贡献,结构部分可以快速扫
  3. 看 ablation:哪个阶段最关键?多图数据加进来后视频涨了多少?这是判断方法可信度的地方
  4. 想做下游应用(embodied / agent)的话,关注 7B 档的指标是否够用,72B 部署成本太高

为什么值得读

  • 它代表 2024 年开源 VLM 的一个重要拐点:结构稳定下来,竞争转向数据工程
  • 对做 embodied AI 的人,这是目前比较省事的"通用视觉塔"候选之一——单图/多图/视频都能接,不用换骨干
  • 它把"任务迁移"从口号变成可量化的实验,告诉你哪些场景迁移有效、哪些靠不住
  • 数据配方虽然没有完全开源所有数据,但训练 recipe 写得相对清楚,是想自己复刻 VLM 训练的人的好教材
  • 读完后再回头看 LLaVA-1.5 / Qwen-VL,会更清楚"VLM 这两年到底进步在哪"——大部分 delta 不在网络结构上

引用本笔记 / Cite this note
BibTeX
@online{eai_llava_onevision_2026,
  title       = {(readable note) LLaVA-OneVision: Easy Visual Task Transfer},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2024 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/llava-onevision/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim