回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
VLM Foundation · Plate Nº 134

OBELICS

6 min read · 2181 字 · ⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

HuggingFace 把网上 1.41 亿个"图文穿插"的网页洗干净打包开源,让大家也能像 DeepMind 那样训出会看图读长文的模型。

这是个什么场景 — 日常类比

想象你刷小红书看一篇旅行攻略:作者先写两段"今天去了京都岚山",配一张竹林照片,下面又写"中午吃了汤豆腐"再配一张餐厅照。你之所以看得懂第二张图是餐厅,是因为它夹在那段文字中间——图和它前后的文字共同讲了一件事。

现在换个角度:如果你想教 AI 也这样"看图读长文",你得喂它什么样的教材?

  • 图配单句标注:每张图配一句"这是一碗汤豆腐"——干净但脱离上下文,就像把小红书拆成单张图+一句话标签。这是 LAION / COCO 这类 image-caption 数据集
  • 图文交织的真实网页:完整保留小红书那种"段-图-段-图"的混排顺序——这才是人类真正的阅读体验

DeepMind 的 Flamingo 证明:用第二种教材训出来的模型,只要给它看几个例子就能学着照做(叫 in-context learning,下文会细说)。但 Flamingo 用的训练语料 M3W 闭源,外面的人想复现根本拿不到数据。OBELICS 就是把这本"图文混排教材"公开搬出来给所有人用。

OBELICS — 场景示意:这论文要解决的现实问题
Plate Nº IOBELICS — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • LAION-5B / COCO / CC3M:图 + 单句 caption,规模够大但缺上下文,模型学不会"看图读长文"
  • Flamingo (DeepMind, 2022):用闭源 M3W 数据集(4300 万网页)证明了交错图文训练的威力,但数据和模型都不放出
  • MMC4 (Multimodal C4):早一点的开源尝试,但不是从 HTML DOM 树原生抽取,而是把 caption "贴回"到 C4 文本里,图文对齐质量较低
  • WIT / Wikipedia-based 数据集:质量高但规模小,且领域偏百科
  • 整体困境:开源社区想复现 Flamingo 的"few-shot 多模态"能力,但卡在数据上

这篇论文的关键想法

类比:你抄菜谱时如果把所有图片都剪下来扔一边,再回头看"步骤 3 加葱"的"葱"长什么样就完蛋了。图和它前后的文字必须保持原有的先后顺序,否则信息就丢了。

核心点:交错图文的"结构"本身就是宝贵信号——一段文字、一张图、再一段文字、再一张图,这种顺序里隐含了图和文的指代关系。所以抽取时必须保留 HTML 文档的原生顺序,而不是把图文分开再拼回去。

具体策略:

  1. 从 Common Crawl 出发而不是从图床/图库出发——保证语料分布贴近"真实网页"
  2. 保留 DOM 顺序:网页 → 简化 DOM 树 → 按出现顺序输出 [文本, 图, 文本, 图, ...] 序列
  3. 大规模过滤:色情 / 低质 / 重复 / 文本太短 / 图太小 / 图文比例失衡的全部丢
  4. 完全开源:数据集、过滤代码、训练代码、训出来的 IDEFICS 模型权重一起放
OBELICS — 方法示意:核心 pipeline
Plate Nº IIOBELICS — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

第一步:原始抓取。像在二手市场扫货——先把货堆全收回来再说。从 Common Crawl 的 25 个 dump(一个 dump 就是某个月互联网公开网页的完整存档)出发,初始网页数量在百亿级(具体数字需读原文)。先做 URL 去重、英文过滤、HTML 解析,得到带图的网页池。

第二步:DOM 简化与序列化。像装修师傅拆房子——只留承重墙和家具,墙纸吊顶全敲掉。这是 OBELICS 最有特色的环节。

等等,先慢一拍 — DOM 是什么?浏览器拿到 HTML 后会把它解析成一棵树:<body> 是根,<div> <p> <img> 是它下面分叉的枝条。"DOM 顺序"就是这棵树从上到下、从左到右遍历时节点出现的先后。

把 HTML 解析成 DOM 树,只保留对图文阅读真正有意义的节点(段落、图、标题、列表),剔除导航栏、广告、脚本、样式、侧边栏。然后按 DOM 中出现的物理顺序,把保留下来的节点拍平成 [text_block_1, img_1, text_block_2, img_2, ...] 这样的线性序列。这样模型训练时直接吃这个序列,自然学到"图前面的文是介绍,图后面的文是延伸"。

第三步:多级过滤。像机场安检的多道关卡——证件、行李、液体、电子产品分别过一遍。文档级(语言、字符数、句子完整性)、段落级(重复、广告标记)、图像级(分辨率、长宽比、NSFW、logo 检测)、文档-图配对级(图文是否相关、有没有空 alt)。论文里报告了每一级过滤后的剩余比例(具体数字需读原文)。

第四步:去重。像查重软件抓抄作业——同一段话换个网站发,照样能识别。基于 MinHash + LSH 做近似去重,避免同一篇博客被多个站点转载导致训练时重复看。最终得到 1.41 亿文档、3.53 亿图、约 1150 亿 token(量级数字依摘要,精确值需读原文)。然后基于此训练 IDEFICS-9B / 80B,作为 Flamingo 的开源复现。

实验在做什么

  • 数据统计对比:OBELICS vs MMC4 vs LAION 在文档长度、每文档图数、图分辨率、文本质量分上的分布对比
  • 训练 IDEFICS:基于 LLaMA-1 + 视觉 encoder + Flamingo-style 交叉注意力(cross-attention),在 OBELICS 上训练 9B / 80B 两个规模
  • 下游 benchmark:VQA、image captioning、visual dialogue 等多模态任务的 zero-shot / few-shot 评测,对比闭源 Flamingo 同规模版本
  • 消融:用 LAION-only 训 vs 用 OBELICS-only 训 vs 混训,看交错语料对 in-context learning 能力的边际贡献
  • 结论方向:在等量训练 token 下,交错语料显著提升 few-shot 表现;这印证了 Flamingo 论文的论断,并证明可在开源数据上复现(具体提升幅度需读原文)

你应该懂的几个新词 — 4-6 个

  • interleaved image-text(交错图文):图和文按真实出现顺序混排成一个序列,区别于"图—单句 caption"对
  • Common Crawl:一个非营利组织,每月抓一遍互联网公开网页存档供研究用——OBELICS 的原料
  • DOM (Document Object Model):浏览器解析 HTML 后的树结构,节点是元素(div / img / p)
  • MinHash + LSH:一对工具,前者把文档变成短指纹,后者快速找相似指纹——一起做"近似去重"
  • in-context learning:大模型不更新参数,只在 prompt 里看几个例子就能学会做任务的能力——Flamingo 强调的核心多模态能力
  • IDEFICS:HuggingFace 基于 OBELICS 训练的开源 Flamingo 复现模型,9B / 80B 两个规模

它和其他论文什么关系

  • 直接对标:DeepMind Flamingo (2022)——OBELICS 是它的开源数据 + 模型复现
  • 承接:MMC4——同样想做开源交错图文,但 OBELICS 在原生 DOM 抽取这点上更干净
  • 对比:LAION-5B——纯 image-caption,规模大但缺交错结构,互补而非替代
  • 后继:Idefics2 (2024) / Idefics3 / 一系列开源 VLM 都把 OBELICS 列为训练语料的核心组件之一
  • 生态影响:和 The Stack(代码)、RedPajama(文本)一起,构成 2023 年"开源大模型基础语料"三件套的多模态那一块

我建议这样读 — 3-4 步

  1. 先读 Flamingo 论文 §3 数据部分:理解为什么需要交错图文,"M3W" 长什么样——OBELICS 的所有动机都从这里来
  2. 读 OBELICS 论文 §3 数据 pipeline 流程图:重点看 DOM 简化和过滤级联两步,这是技术贡献核心
  3. 跳过实验细节,直接看 §5 消融表:看"OBELICS only" vs "LAION only" vs "mix" 在 few-shot benchmark 上的差距,这是结论
  4. 附加:去 HuggingFace HuggingFaceM4/OBELICS 数据卡片浏览几个真实样例,比读 100 行描述都直观

为什么值得读

  • 历史地位:是 2023 年开源多模态社区的转折点之一,没有 OBELICS 就没有 IDEFICS、没有后续一系列开源 VLM 的快速迭代
  • 方法朴素但有效:通篇没有什么花哨技术,就是"老老实实从 Common Crawl 清数据",但执行得彻底——这种"工程为王"的论文对从业者价值很大
  • 对你(具身 / VLM 路线)的意义:理解视觉语言模型的训练语料长什么样、过滤逻辑怎么写,是评估任何 VLM 能力上限的基础——模型能做什么,归根结底取决于它见过什么
  • 可复现性范本:数据 + 代码 + 模型全开源,是开源社区"复现闭源工作"的标杆案例,方法论可迁移到任何"想开源 X" 的项目上

引用本笔记 / Cite this note
BibTeX
@online{eai_obelics_2026,
  title       = {(readable note) OBELICS},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2023 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/obelics/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim