从 2011 到 2025,156 篇论文连成的演化路径。
把 156 篇笔记按年份排开。同一年内按主题分组,颜色对应主题。 看这一页,你会看到具身智能这五年里"先有什么、后有什么"的真实顺序。
2024–2025
Foundation models 时代
VLA 工业化 / 数据集成熟 / 评测体系建立
2025· 23 papers
- VI Dreamer V3: Mastering Diverse Domains through World Models Nature
- VII VLAS: VLA Model With Speech Instructions ICLR
- VI Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control arXiv
- IV DiT-Policy ICRA
- IV Diffusion Policy Policy Optimization (DPPO) ICLR
- IV FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching AAAI
- IV FAST: Efficient Action Tokenization for VLA RSS
- IV pi_0.5: VLA with Open-World Generalization arXiv
- V Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3) RSS
- V SmolVLA arXiv
- VII Tactile Beyond Pixels (Sparsh-X) CoRL
- VII Tactile-VLA CoRL
- VII TLA: Tactile-Language-Action ICRA
- VIII Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion arXiv
- XI Isaac Lab arXiv
- XI MuJoCo Playground arXiv
- III DexVLA arXiv
- III OpenHelix arXiv
- III OpenVLA-OFT RSS
- III SpatialVLA arXiv
- VI 1X World Model Challenge arXiv
- VI Cosmos World Foundation Model Platform arXiv
- VI Navigation World Models CVPR
2024· 52 papers
- III OpenVLA: An Open-Source Vision-Language-Action Model CoRL
- VIII mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment SenSys 2024
- IX NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators MobiCom
- IV 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations RSS
- IV Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation RSS
- IV EquiBot: SIM(3)-Equivariant Diffusion Policy CoRL
- VII OneLLM CVPR
- II GenSim ICLR
- II RoboFlamingo ICLR
- II Tree-Planner ICLR
- III 3D Diffusion Policy (DP3) RSS
- III Octo: An Open-Source Generalist Robot Policy RSS
- I DeepSeek-VL: Towards Real-World Vision-Language Understanding arXiv
- I Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks CVPR
- I InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks CVPR
- I Improved Baselines with Visual Instruction Tuning CVPR
- VII MLA: Multisensory Language-Action Model arXiv
- IX Proactive Hearing Assistants that Isolate Egocentric Conversations UIST
- IX Stable Audio ICML
- IX Universal Source Separation with Weakly Labelled Data TASLP
- X DROID RSS
- X RoboCasa RSS
- X SimplerEnv NeurIPS
- IV Affordance-based Robot Manipulation with Flow Matching IROS
- IV pi_0: Vision-Language-Action Flow Model arXiv
- V ALOHA 2 Tech Report
- V DexCap RSS
- V HumanPlus CoRL
- V Mobile ALOHA CoRL
- V Universal Manipulation Interface RSS
- V Behavior Generation with Latent Actions (VQ-BeT) ICML
- VII Sparsh: Self-supervised Touch Representations CoRL
- VIII Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on SenSys
- VIII Diffusion Model is a Good Pose Estimator from 3D RF-Vision CVPR
- VIII Enabling Visual Recognition at Radio Frequency (PanoRadar) MobiCom
- XI BEHAVIOR-1K CoRL
- XI Habitat 3.0 ICLR
- III 3D-VLA ICML
- III GR-2: Generative Video-Language-Action Model arXiv
- III RDT-1B: Diffusion Foundation Model for Bimanual Manipulation ICLR
- III RoboMamba NeurIPS
- III TinyVLA RA-L
- III TraceVLA: Visual Trace Prompting ICLR
- I What matters when building vision-language models? NeurIPS
- I Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling arXiv
- I The Llama 3 Herd of Models arXiv
- I LLaVA-NeXT-Interleave arXiv
- I LLaVA-OneVision: Easy Visual Task Transfer arXiv
- I Long-CLIP: Unlocking the Long-Text Capability of CLIP ECCV
- I Pixtral 12B arXiv
- VI Genie: Generative Interactive Environments ICML
- VI UniSim ICLR
2022–2023
VLA 元年
RT-1/RT-2 / Diffusion Policy / OpenVLA
2023· 40 papers
- I LLaVA: Visual Instruction Tuning NeurIPS
- IX Creating speech zones with self-distributing acoustic swarms Nature
- IV Diffusion Policy: Visuomotor Policy Learning via Action Diffusion RSS
- VII ImageBind: One Embedding Space To Bind Them All CVPR
- II Code as Policies: Language Model Programs for Embodied Control ICRA
- II LLM+P: Empowering LLMs with Optimal Planning arXiv
- II PaLM-E: An Embodied Multimodal Language Model ICML
- II ProgPrompt ICRA
- I 3DShape2VecSet: 3D Shape Representation for Diffusion Models SIGGRAPH
- VIII CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches MobiCom 2025 (Best Artifact Award)
- IX AudioLM TASLP
- IX EnCodec TMLR
- IX MusicLM arXiv
- IX Robust Speech Recognition via Large-Scale Weak Supervision ICML
- X BridgeData V2 dataset-eval
- X LIBERO NeurIPS
- X RH20T RSS Workshop
- V Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA) RSS
- V AnyTeleop CoRL
- V RoboCat TMLR
- VII AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model EACL
- VII AudioPaLM arXiv
- VII FROMAGe: Grounding LLMs to Images ICML
- II ChatGPT for Robotics IEEE Access
- II VoxPoser CoRL
- VIII High Resolution Point Clouds from mmWave Radar ICRA
- VIII RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory TCSVT
- III RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control CoRL
- III RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches ICLR
- I BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models ICML
- I EVA-CLIP: Improved Training Techniques for CLIP at Scale arXiv
- I OBELICS NeurIPS
- I Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond arXiv
- I Sigmoid Loss for Language Image Pre-Training ICCV
- VI Transformers are Sample-Efficient World Models ICLR
- VI TWM: Transformer-based World Models ICLR
- VIII mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation MobiSys 2025
- IX SeamlessM4T arXiv
- X Open X-Embodiment ICRA
- VI GAIA-1 arXiv
2022· 14 papers
- II SayCan: Do As I Can, Not As I Say CoRL
- IX SoundStream: An End-to-End Neural Audio Codec IEEE/ACM TASLP
- II Inner Monologue: Embodied Reasoning through Planning with Language Models CoRL
- III RT-1: Robotics Transformer for Real-World Control at Scale RSS
- I Flamingo: a Visual Language Model for Few-Shot Learning NeurIPS
- X CALVIN RA-L
- V Behavior Transformers: Cloning k Modes with One Stone NeurIPS
- VII X-VLM: Multi-Grained Vision Language Pre-Training ICML
- VIII RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals TMM
- XI DexMV ECCV
- XI ProcTHOR NeurIPS
- I BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation ICML
- I FILIP: Fine-grained Interactive Language-Image Pre-Training ICLR
- VI DayDreamer CoRL
2018–2021
VLM 基座建立
CLIP / Habitat / 早期 RL 仿真
2021· 9 papers
- XI Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning NeurIPS Datasets
- I Learning Transferable Visual Models From Natural Language Supervision ICML
- IX Meta-StyleSpeech ICML
- X What Matters in Learning from Offline Human Demonstrations for Robot Manipulation CoRL
- V Implicit Behavioral Cloning CoRL
- VIII 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning IPCCC
- XI Habitat 2.0 NeurIPS
- XI ManiSkill NeurIPS
- VI Mastering Atari with Discrete World Models ICLR
2020· 8 papers
- X robosuite: A Modular Simulation Framework and Benchmark for Robot Learning arXiv
- VIII See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar SenSys
- VI Dream to Control: Learning Behaviors by Latent Imagination ICLR
- IX Conformer Interspeech
- IX Dual-path RNN ICASSP
- VIII milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion SenSys
- VIII RadarSLAM: Radar based Large-Scale SLAM in All Weathers BMVC
- XI SAPIEN: A SimulAted Part-based Interactive ENvironment CVPR
2019· 7 papers
- IX Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation IEEE/ACM TASLP
- X Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning CoRL
- X RLBench: The Robot Learning Benchmark & Learning Environment RA-L
- VII Connecting Touch and Vision via Cross-Modal Prediction CVPR
- VIII Can WiFi Estimate Person Pose? ICCV
- XI Habitat: A Platform for Embodied AI Research ICCV
- VIII Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm arXiv
2018· 1 paper
≤ 2017
前 transformer 时期
World Models / GAIL / DAgger