| 2025 |
DiT-Policy |
Diffusion Policy |
ICRA |
| 2025 |
FAST: Efficient Action Tokenization for VLA |
Diffusion Policy |
RSS |
| 2025 |
pi_0.5: VLA with Open-World Generalization |
Diffusion Policy |
arXiv |
| 2025 |
Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3) |
Imitation Learning |
RSS |
| 2025 |
Tactile Beyond Pixels (Sparsh-X) |
Multimodal Ecology |
CoRL |
| 2025 |
Tactile-VLA |
Multimodal Ecology |
CoRL |
| 2025 |
TLA: Tactile-Language-Action |
Multimodal Ecology |
ICRA |
| 2025 |
Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion |
RF Perception & Mapping |
arXiv |
| 2025 |
OpenHelix |
End-to-End VLA |
arXiv |
| 2025 |
OpenVLA-OFT |
End-to-End VLA |
RSS |
| 2025 |
SpatialVLA |
End-to-End VLA |
arXiv |
| 2025 |
Dreamer V3: Mastering Diverse Domains through World Models |
World Model & Video Policy |
Nature |
| 2025 |
1X World Model Challenge |
World Model & Video Policy |
arXiv |
| 2025 |
Cosmos World Foundation Model Platform |
World Model & Video Policy |
arXiv |
| 2025 |
Navigation World Models |
World Model & Video Policy |
CVPR |
| 2024 |
Stable Audio |
Auditory & Acoustic |
ICML |
| 2024 |
DROID |
Datasets & Benchmarks |
RSS |
| 2024 |
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations |
Diffusion Policy |
RSS |
| 2024 |
Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation |
Diffusion Policy |
RSS |
| 2024 |
EquiBot: SIM(3)-Equivariant Diffusion Policy |
Diffusion Policy |
CoRL |
| 2024 |
Affordance-based Robot Manipulation with Flow Matching |
Diffusion Policy |
IROS |
| 2024 |
pi_0: Vision-Language-Action Flow Model |
Diffusion Policy |
arXiv |
| 2024 |
DexCap |
Imitation Learning |
RSS |
| 2024 |
HumanPlus |
Imitation Learning |
CoRL |
| 2024 |
Mobile ALOHA |
Imitation Learning |
CoRL |
| 2024 |
Behavior Generation with Latent Actions (VQ-BeT) |
Imitation Learning |
ICML |
| 2024 |
OneLLM |
Multimodal Ecology |
CVPR |
| 2024 |
Sparsh: Self-supervised Touch Representations |
Multimodal Ecology |
CoRL |
| 2024 |
RoboFlamingo |
High-Level Planning |
ICLR |
| 2024 |
Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on |
RF Perception & Mapping |
SenSys |
| 2024 |
Diffusion Model is a Good Pose Estimator from 3D RF-Vision |
RF Perception & Mapping |
CVPR |
| 2024 |
3D Diffusion Policy (DP3) |
End-to-End VLA |
RSS |
| 2024 |
Octo: An Open-Source Generalist Robot Policy |
End-to-End VLA |
RSS |
| 2024 |
3D-VLA |
End-to-End VLA |
ICML |
| 2024 |
GR-2: Generative Video-Language-Action Model |
End-to-End VLA |
arXiv |
| 2024 |
RDT-1B: Diffusion Foundation Model for Bimanual Manipulation |
End-to-End VLA |
ICLR |
| 2024 |
RoboMamba |
End-to-End VLA |
NeurIPS |
| 2024 |
TinyVLA |
End-to-End VLA |
RA-L |
| 2024 |
TraceVLA: Visual Trace Prompting |
End-to-End VLA |
ICLR |
| 2024 |
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks |
VLM Foundation |
CVPR |
| 2024 |
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks |
VLM Foundation |
CVPR |
| 2024 |
What matters when building vision-language models? |
VLM Foundation |
NeurIPS |
| 2024 |
The Llama 3 Herd of Models |
VLM Foundation |
arXiv |
| 2024 |
Long-CLIP: Unlocking the Long-Text Capability of CLIP |
VLM Foundation |
ECCV |
| 2024 |
Pixtral 12B |
VLM Foundation |
arXiv |
| 2024 |
Genie: Generative Interactive Environments |
World Model & Video Policy |
ICML |
| 2023 |
AudioLM |
Auditory & Acoustic |
TASLP |
| 2023 |
EnCodec |
Auditory & Acoustic |
TMLR |
| 2023 |
MusicLM |
Auditory & Acoustic |
arXiv |
| 2023 |
Robust Speech Recognition via Large-Scale Weak Supervision |
Auditory & Acoustic |
ICML |
| 2023 |
SeamlessM4T |
Auditory & Acoustic |
arXiv |
| 2023 |
BridgeData V2 |
Datasets & Benchmarks |
dataset-eval |
| 2023 |
LIBERO |
Datasets & Benchmarks |
NeurIPS |
| 2023 |
Open X-Embodiment |
Datasets & Benchmarks |
ICRA |
| 2023 |
Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA) |
Imitation Learning |
RSS |
| 2023 |
RoboCat |
Imitation Learning |
TMLR |
| 2023 |
ImageBind: One Embedding Space To Bind Them All |
Multimodal Ecology |
CVPR |
| 2023 |
AudioPaLM |
Multimodal Ecology |
arXiv |
| 2023 |
PaLM-E: An Embodied Multimodal Language Model |
High-Level Planning |
ICML |
| 2023 |
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control |
End-to-End VLA |
CoRL |
| 2023 |
RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches |
End-to-End VLA |
ICLR |
| 2023 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models |
VLM Foundation |
ICML |
| 2023 |
EVA-CLIP: Improved Training Techniques for CLIP at Scale |
VLM Foundation |
arXiv |
| 2023 |
Sigmoid Loss for Language Image Pre-Training |
VLM Foundation |
ICCV |
| 2023 |
Transformers are Sample-Efficient World Models |
World Model & Video Policy |
ICLR |
| 2023 |
TWM: Transformer-based World Models |
World Model & Video Policy |
ICLR |
| 2023 |
GAIA-1 |
World Model & Video Policy |
arXiv |
| 2022 |
Behavior Transformers: Cloning k Modes with One Stone |
Imitation Learning |
NeurIPS |
| 2022 |
X-VLM: Multi-Grained Vision Language Pre-Training |
Multimodal Ecology |
ICML |
| 2022 |
RT-1: Robotics Transformer for Real-World Control at Scale |
End-to-End VLA |
RSS |
| 2022 |
Flamingo: a Visual Language Model for Few-Shot Learning |
VLM Foundation |
NeurIPS |
| 2022 |
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation |
VLM Foundation |
ICML |
| 2022 |
FILIP: Fine-grained Interactive Language-Image Pre-Training |
VLM Foundation |
ICLR |
| 2021 |
Meta-StyleSpeech |
Auditory & Acoustic |
ICML |
| 2021 |
Learning Transferable Visual Models From Natural Language Supervision |
VLM Foundation |
ICML |
| 2020 |
Conformer |
Auditory & Acoustic |
Interspeech |
| 2020 |
Dual-path RNN |
Auditory & Acoustic |
ICASSP |
| 2020 |
milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion |
RF Perception & Mapping |
SenSys |