| 2025 |
DexVLA |
End-to-End VLA |
arXiv |
| 2025 |
OpenHelix |
End-to-End VLA |
arXiv |
| 2025 |
SpatialVLA |
End-to-End VLA |
arXiv |
| 2024 |
TraceVLA: Visual Trace Prompting |
End-to-End VLA |
ICLR |
| 2024 |
DeepSeek-VL: Towards Real-World Vision-Language Understanding |
VLM Foundation |
arXiv |
| 2024 |
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks |
VLM Foundation |
CVPR |
| 2024 |
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks |
VLM Foundation |
CVPR |
| 2024 |
Improved Baselines with Visual Instruction Tuning |
VLM Foundation |
CVPR |
| 2024 |
What matters when building vision-language models? |
VLM Foundation |
NeurIPS |
| 2024 |
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling |
VLM Foundation |
arXiv |
| 2024 |
The Llama 3 Herd of Models |
VLM Foundation |
arXiv |
| 2024 |
LLaVA-NeXT-Interleave |
VLM Foundation |
arXiv |
| 2024 |
LLaVA-OneVision: Easy Visual Task Transfer |
VLM Foundation |
arXiv |
| 2024 |
Long-CLIP: Unlocking the Long-Text Capability of CLIP |
VLM Foundation |
ECCV |
| 2024 |
Pixtral 12B |
VLM Foundation |
arXiv |
| 2024 |
UniSim |
World Model & Video Policy |
ICLR |
| 2023 |
LIBERO |
Datasets & Benchmarks |
NeurIPS |
| 2023 |
Open X-Embodiment |
Datasets & Benchmarks |
ICRA |
| 2023 |
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA) |
Imitation Learning |
RSS |
| 2023 |
PaLM-E: An Embodied Multimodal Language Model |
High-Level Planning |
ICML |
| 2023 |
ProgPrompt |
High-Level Planning |
ICRA |
| 2023 |
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control |
End-to-End VLA |
CoRL |
| 2023 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models |
VLM Foundation |
ICML |
| 2023 |
OBELICS |
VLM Foundation |
NeurIPS |
| 2023 |
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond |
VLM Foundation |
arXiv |
| 2023 |
Sigmoid Loss for Language Image Pre-Training |
VLM Foundation |
ICCV |
| 2022 |
X-VLM: Multi-Grained Vision Language Pre-Training |
Multimodal Ecology |
ICML |