| 2025 |
DiT-Policy |
Diffusion Policy |
ICRA |
| 2025 |
Diffusion Policy Policy Optimization (DPPO) |
Diffusion Policy |
ICLR |
| 2025 |
FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching |
Diffusion Policy |
AAAI |
| 2025 |
Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3) |
Imitation Learning |
RSS |
| 2025 |
Tactile Beyond Pixels (Sparsh-X) |
Multimodal Ecology |
CoRL |
| 2025 |
Tactile-VLA |
Multimodal Ecology |
CoRL |
| 2025 |
TLA: Tactile-Language-Action |
Multimodal Ecology |
ICRA |
| 2025 |
Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion |
RF Perception & Mapping |
arXiv |
| 2025 |
DexVLA |
End-to-End VLA |
arXiv |
| 2025 |
OpenVLA-OFT |
End-to-End VLA |
RSS |
| 2025 |
SpatialVLA |
End-to-End VLA |
arXiv |
| 2024 |
OpenVLA: An Open-Source Vision-Language-Action Model |
End-to-End VLA |
CoRL |
| 2024 |
mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment |
RF Perception & Mapping |
SenSys 2024 |
| 2024 |
Stable Audio |
Auditory & Acoustic |
ICML |
| 2024 |
Universal Source Separation with Weakly Labelled Data |
Auditory & Acoustic |
TASLP |
| 2024 |
DROID |
Datasets & Benchmarks |
RSS |
| 2024 |
SimplerEnv |
Datasets & Benchmarks |
NeurIPS |
| 2024 |
3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations |
Diffusion Policy |
RSS |
| 2024 |
EquiBot: SIM(3)-Equivariant Diffusion Policy |
Diffusion Policy |
CoRL |
| 2024 |
Affordance-based Robot Manipulation with Flow Matching |
Diffusion Policy |
IROS |
| 2024 |
pi_0: Vision-Language-Action Flow Model |
Diffusion Policy |
arXiv |
| 2024 |
DexCap |
Imitation Learning |
RSS |
| 2024 |
Mobile ALOHA |
Imitation Learning |
CoRL |
| 2024 |
Universal Manipulation Interface |
Imitation Learning |
RSS |
| 2024 |
OneLLM |
Multimodal Ecology |
CVPR |
| 2024 |
Sparsh: Self-supervised Touch Representations |
Multimodal Ecology |
CoRL |
| 2024 |
GenSim |
High-Level Planning |
ICLR |
| 2024 |
RoboFlamingo |
High-Level Planning |
ICLR |
| 2024 |
Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on |
RF Perception & Mapping |
SenSys |
| 2024 |
Diffusion Model is a Good Pose Estimator from 3D RF-Vision |
RF Perception & Mapping |
CVPR |
| 2024 |
Enabling Visual Recognition at Radio Frequency (PanoRadar) |
RF Perception & Mapping |
MobiCom |
| 2024 |
3D Diffusion Policy (DP3) |
End-to-End VLA |
RSS |
| 2024 |
Octo: An Open-Source Generalist Robot Policy |
End-to-End VLA |
RSS |
| 2024 |
3D-VLA |
End-to-End VLA |
ICML |
| 2024 |
RDT-1B: Diffusion Foundation Model for Bimanual Manipulation |
End-to-End VLA |
ICLR |
| 2024 |
RoboMamba |
End-to-End VLA |
NeurIPS |
| 2024 |
TinyVLA |
End-to-End VLA |
RA-L |
| 2024 |
TraceVLA: Visual Trace Prompting |
End-to-End VLA |
ICLR |
| 2024 |
DeepSeek-VL: Towards Real-World Vision-Language Understanding |
VLM Foundation |
arXiv |
| 2024 |
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks |
VLM Foundation |
CVPR |
| 2024 |
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks |
VLM Foundation |
CVPR |
| 2024 |
Improved Baselines with Visual Instruction Tuning |
VLM Foundation |
CVPR |
| 2024 |
What matters when building vision-language models? |
VLM Foundation |
NeurIPS |
| 2024 |
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling |
VLM Foundation |
arXiv |
| 2024 |
The Llama 3 Herd of Models |
VLM Foundation |
arXiv |
| 2024 |
LLaVA-NeXT-Interleave |
VLM Foundation |
arXiv |
| 2024 |
LLaVA-OneVision: Easy Visual Task Transfer |
VLM Foundation |
arXiv |
| 2024 |
Long-CLIP: Unlocking the Long-Text Capability of CLIP |
VLM Foundation |
ECCV |
| 2024 |
Pixtral 12B |
VLM Foundation |
arXiv |
| 2024 |
Genie: Generative Interactive Environments |
World Model & Video Policy |
ICML |
| 2024 |
UniSim |
World Model & Video Policy |
ICLR |
| 2023 |
LLaVA: Visual Instruction Tuning |
VLM Foundation |
NeurIPS |
| 2023 |
MusicLM |
Auditory & Acoustic |
arXiv |
| 2023 |
Robust Speech Recognition via Large-Scale Weak Supervision |
Auditory & Acoustic |
ICML |
| 2023 |
BridgeData V2 |
Datasets & Benchmarks |
dataset-eval |
| 2023 |
LIBERO |
Datasets & Benchmarks |
NeurIPS |
| 2023 |
RH20T |
Datasets & Benchmarks |
RSS Workshop |
| 2023 |
AnyTeleop |
Imitation Learning |
CoRL |
| 2023 |
RoboCat |
Imitation Learning |
TMLR |
| 2023 |
ImageBind: One Embedding Space To Bind Them All |
Multimodal Ecology |
CVPR |
| 2023 |
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model |
Multimodal Ecology |
EACL |
| 2023 |
FROMAGe: Grounding LLMs to Images |
Multimodal Ecology |
ICML |
| 2023 |
PaLM-E: An Embodied Multimodal Language Model |
High-Level Planning |
ICML |
| 2023 |
VoxPoser |
High-Level Planning |
CoRL |
| 2023 |
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control |
End-to-End VLA |
CoRL |
| 2023 |
RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches |
End-to-End VLA |
ICLR |
| 2023 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models |
VLM Foundation |
ICML |
| 2023 |
EVA-CLIP: Improved Training Techniques for CLIP at Scale |
VLM Foundation |
arXiv |
| 2023 |
OBELICS |
VLM Foundation |
NeurIPS |
| 2023 |
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond |
VLM Foundation |
arXiv |
| 2023 |
Sigmoid Loss for Language Image Pre-Training |
VLM Foundation |
ICCV |
| 2023 |
GAIA-1 |
World Model & Video Policy |
arXiv |
| 2022 |
CALVIN |
Datasets & Benchmarks |
RA-L |
| 2022 |
X-VLM: Multi-Grained Vision Language Pre-Training |
Multimodal Ecology |
ICML |
| 2022 |
Inner Monologue: Embodied Reasoning through Planning with Language Models |
High-Level Planning |
CoRL |
| 2022 |
RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals |
RF Perception & Mapping |
TMM |
| 2022 |
DexMV |
Simulation & Sim2Real |
ECCV |
| 2022 |
Flamingo: a Visual Language Model for Few-Shot Learning |
VLM Foundation |
NeurIPS |
| 2022 |
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation |
VLM Foundation |
ICML |
| 2022 |
FILIP: Fine-grained Interactive Language-Image Pre-Training |
VLM Foundation |
ICLR |
| 2022 |
DayDreamer |
World Model & Video Policy |
CoRL |
| 2021 |
ManiSkill |
Simulation & Sim2Real |
NeurIPS |
| 2021 |
Learning Transferable Visual Models From Natural Language Supervision |
VLM Foundation |
ICML |
| 2021 |
Mastering Atari with Discrete World Models |
World Model & Video Policy |
ICLR |
| 2020 |
Conformer |
Auditory & Acoustic |
Interspeech |
| 2020 |
See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar |
RF Perception & Mapping |
SenSys |
| 2020 |
milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion |
RF Perception & Mapping |
SenSys |
| 2020 |
RadarSLAM: Radar based Large-Scale SLAM in All Weathers |
RF Perception & Mapping |
BMVC |
| 2019 |
RLBench: The Robot Learning Benchmark & Learning Environment |
Datasets & Benchmarks |
RA-L |
| 2019 |
Connecting Touch and Vision via Cross-Modal Prediction |
Multimodal Ecology |
CVPR |
| 2019 |
Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm |
RF Perception & Mapping |
arXiv |
| 2019 |
Habitat: A Platform for Embodied AI Research |
Simulation & Sim2Real |
ICCV |