回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
Topic I · 视觉-语言基座

VLM Foundation

VLM Foundation — 视觉-语言基座
22papers
3founder
12classic
7frontier

把图片和文字塞进同一个坐标系——这是具身智能的视觉地基。先有 CLIP 把'狗'这个词和狗的样子绑在一起,后面所有'机器人看着图听人说话'的模型,骨子里都是这套对齐。


Primer · 入门 3 篇

先读这三篇

CLIP 看懂 → BLIP-2 桥接 LLM → LLaVA 把视觉变成对话的一部分。

  1. 1
    Learning Transferable Visual Models From Natural Language Supervision 2021 · ICML · ⭐⭐⭐

    教 AI 同时认图和认字,把 4 亿对网上图文塞进同一张坐标。之后你说"一只猫",它就能从新图里挑出猫——不用为新任务再训一遍。

  2. 2
    BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models 2023 · ICML · ⭐⭐⭐⭐

    BLIP-2 不动两个大模型——一个负责看图、一个负责说话——只在中间训练一个小"翻译",就让 AI 学会了看图说话。

  3. 3
    Improved Baselines with Visual Instruction Tuning 2024 · CVPR · ⭐⭐

    给会聊天的 AI 配一副"看图眼镜"。把眼镜从一片镜片换成两片,再多给它看点带字的图片,看图答题就刷榜了。


Distribution · 年份分布

2021 到 2024,22 篇怎么排开。

祖师爷 经典 前沿
All papers · 按 era 排

VLM Foundation 全部 22 篇。

erayeartitlevenue
祖师爷 2023 LLaVA: Visual Instruction Tuning NeurIPS
经典 2023 3DShape2VecSet: 3D Shape Representation for Diffusion Models SIGGRAPH
祖师爷 2021 Learning Transferable Visual Models From Natural Language Supervision ICML
祖师爷 2022 Flamingo: a Visual Language Model for Few-Shot Learning NeurIPS
经典 2022 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation ICML
经典 2022 FILIP: Fine-grained Interactive Language-Image Pre-Training ICLR
经典 2023 BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models ICML
经典 2023 EVA-CLIP: Improved Training Techniques for CLIP at Scale arXiv
经典 2023 OBELICS NeurIPS
经典 2023 Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond arXiv
经典 2023 Sigmoid Loss for Language Image Pre-Training ICCV
经典 2024 DeepSeek-VL: Towards Real-World Vision-Language Understanding arXiv
经典 2024 Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks CVPR
经典 2024 InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks CVPR
经典 2024 Improved Baselines with Visual Instruction Tuning CVPR
前沿 2024 What matters when building vision-language models? NeurIPS
前沿 2024 Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling arXiv
前沿 2024 The Llama 3 Herd of Models arXiv
前沿 2024 LLaVA-NeXT-Interleave arXiv
前沿 2024 LLaVA-OneVision: Easy Visual Task Transfer arXiv
前沿 2024 Long-CLIP: Unlocking the Long-Text Capability of CLIP ECCV
前沿 2024 Pixtral 12B arXiv