| 2025 |
Diffusion Policy Policy Optimization (DPPO) |
Diffusion Policy |
ICLR |
| 2025 |
FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching |
Diffusion Policy |
AAAI |
| 2025 |
FAST: Efficient Action Tokenization for VLA |
Diffusion Policy |
RSS |
| 2025 |
Tactile Beyond Pixels (Sparsh-X) |
Multimodal Ecology |
CoRL |
| 2025 |
Isaac Lab |
Simulation & Sim2Real |
arXiv |
| 2025 |
SpatialVLA |
End-to-End VLA |
arXiv |
| 2024 |
mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment |
RF Perception & Mapping |
SenSys 2024 |
| 2024 |
Stable Audio |
Auditory & Acoustic |
ICML |
| 2024 |
Universal Source Separation with Weakly Labelled Data |
Auditory & Acoustic |
TASLP |
| 2024 |
ALOHA 2 |
Imitation Learning |
Tech Report |
| 2024 |
OneLLM |
Multimodal Ecology |
CVPR |
| 2024 |
Sparsh: Self-supervised Touch Representations |
Multimodal Ecology |
CoRL |
| 2024 |
GenSim |
High-Level Planning |
ICLR |
| 2024 |
Tree-Planner |
High-Level Planning |
ICLR |
| 2024 |
Diffusion Model is a Good Pose Estimator from 3D RF-Vision |
RF Perception & Mapping |
CVPR |
| 2024 |
BEHAVIOR-1K |
Simulation & Sim2Real |
CoRL |
| 2024 |
DeepSeek-VL: Towards Real-World Vision-Language Understanding |
VLM Foundation |
arXiv |
| 2024 |
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks |
VLM Foundation |
CVPR |
| 2024 |
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks |
VLM Foundation |
CVPR |
| 2024 |
Improved Baselines with Visual Instruction Tuning |
VLM Foundation |
CVPR |
| 2024 |
What matters when building vision-language models? |
VLM Foundation |
NeurIPS |
| 2024 |
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling |
VLM Foundation |
arXiv |
| 2024 |
LLaVA-NeXT-Interleave |
VLM Foundation |
arXiv |
| 2024 |
LLaVA-OneVision: Easy Visual Task Transfer |
VLM Foundation |
arXiv |
| 2024 |
Pixtral 12B |
VLM Foundation |
arXiv |
| 2023 |
MusicLM |
Auditory & Acoustic |
arXiv |
| 2023 |
RH20T |
Datasets & Benchmarks |
RSS Workshop |
| 2023 |
AudioPaLM |
Multimodal Ecology |
arXiv |
| 2023 |
OBELICS |
VLM Foundation |
NeurIPS |
| 2023 |
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond |
VLM Foundation |
arXiv |
| 2023 |
TWM: Transformer-based World Models |
World Model & Video Policy |
ICLR |
| 2022 |
ProcTHOR |
Simulation & Sim2Real |
NeurIPS |
| 2021 |
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation |
Datasets & Benchmarks |
CoRL |
| 2021 |
3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning |
RF Perception & Mapping |
IPCCC |
| 2021 |
Habitat 2.0 |
Simulation & Sim2Real |
NeurIPS |
| 2020 |
Dual-path RNN |
Auditory & Acoustic |
ICASSP |
| 2020 |
robosuite: A Modular Simulation Framework and Benchmark for Robot Learning |
Datasets & Benchmarks |
arXiv |
| 2020 |
RadarSLAM: Radar based Large-Scale SLAM in All Weathers |
RF Perception & Mapping |
BMVC |
| 2019 |
Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning |
Datasets & Benchmarks |
CoRL |
| 2019 |
RLBench: The Robot Learning Benchmark & Learning Environment |
Datasets & Benchmarks |
RA-L |