回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
VLM Foundation · Plate Nº 139

The Llama 3 Herd of Models

6 min read · 1959 字 · ⭐⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

Meta 把训练 Llama 3 大模型的全套"菜谱"公开了——用了什么料、多少张卡、跑多久、考多少分。

这是个什么场景 — 日常类比

想象你常去的米其林三星餐厅,平时只把成品端到桌上,菜谱、食材产地、火候温度一概不说。某天他突然把整本后厨工作手册甩出来:哪个农场的牛肉、几号灶台、几度烤几分钟、试菜请了多少评委、评委打了几分——一口气全摊给你看。Llama 3 这份报告就是这种级别的"全套菜谱"。市面上的对手是 GPT-4 / Claude 这类"只让你尝菜不让看后厨"的闭源餐厅;Meta 干脆把后厨大门打开,告诉你训一个前沿大模型到底要烧掉多少本钱。

The Llama 3 Herd of Models — 场景示意:这论文要解决的现实问题
Plate Nº IThe Llama 3 Herd of Models — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • 闭源派(GPT-4 / Gemini / Claude):只放 API 和有限技术报告,数据规模、算力、训练细节都藏着
  • 早期 Llama(Llama 2):开源权重 + 较粗的报告,多模态能力缺失
  • 其他开源基座(Mistral / Qwen / DeepSeek 早期版本):规模更小,或者只放权重不公开训练曲线
  • 多模态接法(LLaVA / BLIP-2):在小语言模型上接视觉,但底座本身不是前沿规模
  • 结果:开源社区缺一个"接近 GPT-4 级别 + 训练栈完全透明 + 自带视觉支路"的参考实现

这篇论文的关键想法

三件事一起做:

  1. 把规模拉到 405B:开源模型第一次正面冲击闭源 SOTA 量级,证明开源社区可以触及前沿
  2. 训练全栈透明:数据 pipeline、tokenizer、并行策略、训练损失曲线、failure recovery、scaling law 拟合,都写进报告
  3. 视觉适配器后挂:保留语言主干不动,把图像编码器通过 cross-attention 适配器接进去,避免重新训练破坏语言能力

核心立场是"规模 + 数据质量 + 工程稳定性 = 大部分能力",没有引入新的架构奇技淫巧(仍然是稠密 Transformer,没上 MoE)。

The Llama 3 Herd of Models — 方法示意:核心 pipeline
Plate Nº IIThe Llama 3 Herd of Models — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

预训练数据 — 像采购食材:先去几百个网站抓原料,再过一遍质检流水线把烂菜叶挑出去。约 15T tokens(Llama 2 是 1.8T,扩了近 10 倍),多语言、代码、推理类样本占比上调。整套 pipeline 包括去重、质量分类器、毒性过滤、个人信息脱敏。比例怎么配也不是拍脑袋——先用小模型当替身(proxy)试不同搭配,效果好的那套再喂给大模型。

架构与 scaling — 像盖楼前先算钢筋用量:稠密 decoder-only Transformer,配上 GQA(Grouped-Query Attention,多人共享一份 KV 缓存)+ RoPE + SwiGLU,上下文 128K(先 8K 训完再扩展)。盖之前论文先拟合了一条 scaling law(规模与效果的经验曲线),用来反推 405B 在 15T tokens 下该停在哪、loss 该到多少。预训练动用了 16K 张 H100 GPU 量级,跑了数月(具体数字需读原文)。

等等,先慢一拍——稠密 Transformer 是什么?

稠密(dense)= 每过一次模型,所有参数都要参与计算;与之相对的 MoE(专家混合)= 每次只激活其中一小部分专家,省算力。Llama 3 选了"老老实实全员上场"这条路。

后训练(post-training)— 像反复试菜调味:先 SFT(教它说人话),再用 DPO(Direct Preference Optimization,直接告诉它"这个回答比那个好")配上拒绝采样(生成 N 个候选挑最好),来回 6 轮左右。没用更复杂的 PPO(强化学习那套),因为 DPO 更稳更便宜。

多模态适配器 — 像在主菜上加配菜:语言主干这道主菜不动,旁边接一个图像 encoder(ViT 类)+ 一组 cross-attention 层(让语言模型能"看见"图像 token)。分阶段训练:先冻住主干只训配菜部分,再联合微调。视频和语音也走同样的挂载思路,一个语言主干长出多条感知支路。

实验在做什么

  • 基础语言评测:MMLU / GSM8K / HumanEval / MATH 等,405B 对标 GPT-4,70B 对标 GPT-3.5 / Claude Haiku 量级(具体数字需读原文)
  • 长上下文:128K 上的 needle-in-a-haystack 类大海捞针测试
  • 多语言:8 种主要语言的评测对比
  • 代码与推理:分代码生成、debug、数学推理多个子任务
  • 多模态:图像问答(VQA)、文档理解、图表解读、视频问答
  • 安全与红队:jailbreak 抵抗、有害内容生成率、refuse rate 平衡
  • 人类偏好:Arena 类盲测,看实际对话偏好胜率

你应该懂的几个新词 — 4-6 个

  • GQA(Grouped-Query Attention):注意力的中间方案,多个 query head 共享一组 key/value head,省 KV cache。日常类比:一群学生(query)共用一份课本(kv),不用人手一本
  • DPO(Direct Preference Optimization):偏好对齐方法,给一对回答(好 vs 坏)直接优化模型,不用先训 reward model 再 RL。比 PPO 简单一截
  • 拒绝采样(Rejection Sampling):让模型生成 N 个候选,用判别器/奖励模型挑最好那个加进训练集,相当于自己给自己出"优等生答案"
  • Cross-attention 适配器:在已有 Transformer 层之间插入新的注意力层,让外部信息(如图像 token)能"被看见",而不动原始主干权重
  • Scaling Law:参数量、数据量、算力之间的经验幂律关系,用来在小规模拟合曲线后,预测大规模该停在哪
  • Data mixing:训练时不同来源(网页/代码/书/多语言)按什么比例喂入,比例选错性能差异巨大

它和其他论文什么关系

  • 承接 Llama 2(2023):同家族升级,规模 ×10,加多模态分支
  • 对标闭源前沿:GPT-4(OpenAI)、Gemini 1.5(Google)、Claude 3(Anthropic)——同一档位的稠密大模型
  • 对比 MoE 路线:Mixtral / DeepSeek-V2 / Qwen-MoE 走稀疏激活,Llama 3 坚持稠密
  • 后被引用:成为 2024-2025 开源基座事实标准,很多 RLHF / agent / VLM 工作直接 finetune Llama 3
  • 多模态思路相关:Flamingo(cross-attention 视觉适配器祖师爷)、LLaVA(投影层接法)、BLIP-2(Q-Former),Llama 3 视觉支路接近 Flamingo 派
  • 训练栈透明度对标:BLOOM 报告、OPT 报告、GPT-NeoX 报告——但 Llama 3 是第一份"前沿规模 + 全栈细节"的开源报告

我建议这样读 — 3-4 步

  1. 先读 §1 + §2 + §10(结论):搞清楚他们想证明什么、最后证明到了什么
  2. 再读 §3 数据 pipeline + §5 预训练:这是工程含金量最高、最值得抄作业的部分
  3. 跳到 §7 后训练(DPO + 拒绝采样的迭代循环):理解 SFT 之后到底是怎么把模型调"听话"的
  4. 多模态部分(§8)单独对照 Flamingo / LLaVA 看:把它当成"视觉适配器的工业实现案例",而不是新架构

如果只看 30 分钟:读 §1、§5.1(数据)、§7(后训练循环图)、§9(评测表)就够。

为什么值得读

  • 行业基线手册:要做大模型训练,这是 2024 年最权威的"应该怎么做"参考,回避了一堆隐性陷阱
  • 工程透明度天花板:从 tokenizer 到 failure recovery 都写出来了,对工程同学的价值远超论文本身
  • 多模态接法的工业模板:报告里的"主干冻结 + 适配器后挂 + 分阶段联合训"是后续 VLM / 视频/ 语音模型反复用的范式
  • 理解开源生态:Llama 3 是 2024-2025 年 fine-tune / agent / 具身智能上层应用的事实底座,下游论文几乎都建在它上面,读了它才知道下游论文的"地基"长什么样
  • Scaling law 实战:工业上真把 scaling law 用到 405B 这种规模并把过程写出来,对学习"如何决定下一个模型多大"非常有价值

引用本笔记 / Cite this note
BibTeX
@online{eai_llama_3_herd_2026,
  title       = {(readable note) The Llama 3 Herd of Models},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2024 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/llama-3-herd/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim