| 2025 |
VLAS: VLA Model With Speech Instructions |
Multimodal Ecology |
ICLR |
| 2025 |
FAST: Efficient Action Tokenization for VLA |
Diffusion Policy |
RSS |
| 2025 |
pi_0.5: VLA with Open-World Generalization |
Diffusion Policy |
arXiv |
| 2025 |
Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3) |
Imitation Learning |
RSS |
| 2025 |
SmolVLA |
Imitation Learning |
arXiv |
| 2025 |
Tactile-VLA |
Multimodal Ecology |
CoRL |
| 2025 |
TLA: Tactile-Language-Action |
Multimodal Ecology |
ICRA |
| 2025 |
OpenHelix |
End-to-End VLA |
arXiv |
| 2025 |
OpenVLA-OFT |
End-to-End VLA |
RSS |
| 2025 |
1X World Model Challenge |
World Model & Video Policy |
arXiv |
| 2025 |
Cosmos World Foundation Model Platform |
World Model & Video Policy |
arXiv |
| 2024 |
OpenVLA: An Open-Source Vision-Language-Action Model |
End-to-End VLA |
CoRL |
| 2024 |
MLA: Multisensory Language-Action Model |
Multimodal Ecology |
arXiv |
| 2024 |
mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment |
RF Perception & Mapping |
SenSys 2024 |
| 2024 |
DROID |
Datasets & Benchmarks |
RSS |
| 2024 |
pi_0: Vision-Language-Action Flow Model |
Diffusion Policy |
arXiv |
| 2024 |
Behavior Generation with Latent Actions (VQ-BeT) |
Imitation Learning |
ICML |
| 2024 |
OneLLM |
Multimodal Ecology |
CVPR |
| 2024 |
GenSim |
High-Level Planning |
ICLR |
| 2024 |
RoboFlamingo |
High-Level Planning |
ICLR |
| 2024 |
Tree-Planner |
High-Level Planning |
ICLR |
| 2024 |
Habitat 3.0 |
Simulation & Sim2Real |
ICLR |
| 2024 |
Octo: An Open-Source Generalist Robot Policy |
End-to-End VLA |
RSS |
| 2024 |
3D-VLA |
End-to-End VLA |
ICML |
| 2024 |
GR-2: Generative Video-Language-Action Model |
End-to-End VLA |
arXiv |
| 2024 |
RDT-1B: Diffusion Foundation Model for Bimanual Manipulation |
End-to-End VLA |
ICLR |
| 2024 |
RoboMamba |
End-to-End VLA |
NeurIPS |
| 2024 |
TinyVLA |
End-to-End VLA |
RA-L |
| 2024 |
TraceVLA: Visual Trace Prompting |
End-to-End VLA |
ICLR |
| 2024 |
DeepSeek-VL: Towards Real-World Vision-Language Understanding |
VLM Foundation |
arXiv |
| 2024 |
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks |
VLM Foundation |
CVPR |
| 2024 |
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks |
VLM Foundation |
CVPR |
| 2024 |
Improved Baselines with Visual Instruction Tuning |
VLM Foundation |
CVPR |
| 2024 |
What matters when building vision-language models? |
VLM Foundation |
NeurIPS |
| 2024 |
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling |
VLM Foundation |
arXiv |
| 2024 |
The Llama 3 Herd of Models |
VLM Foundation |
arXiv |
| 2024 |
LLaVA-NeXT-Interleave |
VLM Foundation |
arXiv |
| 2024 |
LLaVA-OneVision: Easy Visual Task Transfer |
VLM Foundation |
arXiv |
| 2024 |
Long-CLIP: Unlocking the Long-Text Capability of CLIP |
VLM Foundation |
ECCV |
| 2024 |
Pixtral 12B |
VLM Foundation |
arXiv |
| 2023 |
LLaVA: Visual Instruction Tuning |
VLM Foundation |
NeurIPS |
| 2023 |
AudioLM |
Auditory & Acoustic |
TASLP |
| 2023 |
EnCodec |
Auditory & Acoustic |
TMLR |
| 2023 |
Robust Speech Recognition via Large-Scale Weak Supervision |
Auditory & Acoustic |
ICML |
| 2023 |
SeamlessM4T |
Auditory & Acoustic |
arXiv |
| 2023 |
Open X-Embodiment |
Datasets & Benchmarks |
ICRA |
| 2023 |
RoboCat |
Imitation Learning |
TMLR |
| 2023 |
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model |
Multimodal Ecology |
EACL |
| 2023 |
AudioPaLM |
Multimodal Ecology |
arXiv |
| 2023 |
FROMAGe: Grounding LLMs to Images |
Multimodal Ecology |
ICML |
| 2023 |
Code as Policies: Language Model Programs for Embodied Control |
High-Level Planning |
ICRA |
| 2023 |
LLM+P: Empowering LLMs with Optimal Planning |
High-Level Planning |
arXiv |
| 2023 |
PaLM-E: An Embodied Multimodal Language Model |
High-Level Planning |
ICML |
| 2023 |
ProgPrompt |
High-Level Planning |
ICRA |
| 2023 |
ChatGPT for Robotics |
High-Level Planning |
IEEE Access |
| 2023 |
VoxPoser |
High-Level Planning |
CoRL |
| 2023 |
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control |
End-to-End VLA |
CoRL |
| 2023 |
RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches |
End-to-End VLA |
ICLR |
| 2023 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models |
VLM Foundation |
ICML |
| 2023 |
OBELICS |
VLM Foundation |
NeurIPS |
| 2023 |
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond |
VLM Foundation |
arXiv |
| 2023 |
Transformers are Sample-Efficient World Models |
World Model & Video Policy |
ICLR |
| 2023 |
TWM: Transformer-based World Models |
World Model & Video Policy |
ICLR |
| 2023 |
GAIA-1 |
World Model & Video Policy |
arXiv |
| 2022 |
SayCan: Do As I Can, Not As I Say |
High-Level Planning |
CoRL |
| 2022 |
Behavior Transformers: Cloning k Modes with One Stone |
Imitation Learning |
NeurIPS |
| 2022 |
X-VLM: Multi-Grained Vision Language Pre-Training |
Multimodal Ecology |
ICML |
| 2022 |
Inner Monologue: Embodied Reasoning through Planning with Language Models |
High-Level Planning |
CoRL |
| 2022 |
ProcTHOR |
Simulation & Sim2Real |
NeurIPS |
| 2022 |
RT-1: Robotics Transformer for Real-World Control at Scale |
End-to-End VLA |
RSS |
| 2022 |
Flamingo: a Visual Language Model for Few-Shot Learning |
VLM Foundation |
NeurIPS |
| 2022 |
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation |
VLM Foundation |
ICML |
| 2021 |
Learning Transferable Visual Models From Natural Language Supervision |
VLM Foundation |
ICML |