Index by · topic

11 chapters · 156 papers.

Topic taxonomy — seven labeled doors

I

VLM Foundation

22 papers

视觉-语言基座 · read primer →

№ 01 LLaVA: Visual Instruction Tuning ⭐⭐ auto
№ 02 3DShape2VecSet: 3D Shape Representation for Diffusion Models ⭐⭐⭐⭐ auto
№ 124 Learning Transferable Visual Models From Natural Language Supervision ⭐⭐⭐ auto
№ 125 Flamingo: a Visual Language Model for Few-Shot Learning ⭐⭐⭐⭐ auto
№ 126 BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models ⭐⭐⭐⭐ auto
№ 127 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation ⭐⭐⭐ auto
№ 128 DeepSeek-VL: Towards Real-World Vision-Language Understanding ⭐⭐⭐ auto
№ 129 EVA-CLIP: Improved Training Techniques for CLIP at Scale ⭐⭐⭐ auto
№ 130 FILIP: Fine-grained Interactive Language-Image Pre-Training ⭐⭐⭐ auto
№ 131 Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks ⭐⭐⭐ auto
№ 132 InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks ⭐⭐⭐⭐ auto
№ 133 Improved Baselines with Visual Instruction Tuning ⭐⭐ auto
№ 134 OBELICS ⭐⭐⭐ auto
№ 135 Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond ⭐⭐⭐ auto
№ 136 Sigmoid Loss for Language Image Pre-Training ⭐⭐⭐ auto
№ 137 What matters when building vision-language models? ⭐⭐⭐ auto
№ 138 Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling ⭐⭐⭐⭐ auto
№ 139 The Llama 3 Herd of Models ⭐⭐⭐⭐ auto
№ 140 LLaVA-NeXT-Interleave ⭐⭐⭐ auto
№ 141 LLaVA-OneVision: Easy Visual Task Transfer ⭐⭐⭐ auto
№ 142 Long-CLIP: Unlocking the Long-Text Capability of CLIP ⭐⭐⭐ auto
№ 143 Pixtral 12B ⭐⭐⭐ auto

II

High-Level Planning

11 papers

高层任务规划 · read primer →

№ 03 SayCan: Do As I Can, Not As I Say ⭐⭐ auto
№ 75 Code as Policies: Language Model Programs for Embodied Control ⭐⭐⭐ auto
№ 76 Inner Monologue: Embodied Reasoning through Planning with Language Models ⭐⭐⭐ auto
№ 77 LLM+P: Empowering LLMs with Optimal Planning ⭐⭐⭐ auto
№ 78 PaLM-E: An Embodied Multimodal Language Model ⭐⭐⭐⭐ auto
№ 79 ProgPrompt ⭐⭐ auto
№ 80 ChatGPT for Robotics ⭐⭐ auto
№ 81 GenSim ⭐⭐⭐ auto
№ 82 RoboFlamingo ⭐⭐⭐⭐ auto
№ 83 Tree-Planner ⭐⭐⭐ auto
№ 84 VoxPoser ⭐⭐⭐⭐ auto

III

End-to-End VLA

16 papers

端到端视觉-语言-动作 · read primer →

№ 04 OpenVLA: An Open-Source Vision-Language-Action Model ⭐⭐⭐ auto
№ 109 RT-1: Robotics Transformer for Real-World Control at Scale ⭐⭐⭐ auto
№ 110 3D Diffusion Policy (DP3) ⭐⭐⭐ auto
№ 111 Octo: An Open-Source Generalist Robot Policy ⭐⭐⭐ auto
№ 112 RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control ⭐⭐⭐⭐ auto
№ 113 RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches ⭐⭐⭐ auto
№ 114 3D-VLA ⭐⭐⭐⭐ auto
№ 115 DexVLA ⭐⭐⭐⭐ auto
№ 116 GR-2: Generative Video-Language-Action Model ⭐⭐⭐⭐ auto
№ 117 OpenHelix ⭐⭐⭐ auto
№ 118 OpenVLA-OFT ⭐⭐⭐ auto
№ 119 RDT-1B: Diffusion Foundation Model for Bimanual Manipulation ⭐⭐⭐⭐ auto
№ 120 RoboMamba ⭐⭐⭐ auto
№ 121 SpatialVLA ⭐⭐⭐⭐ auto
№ 122 TinyVLA ⭐⭐⭐ auto
№ 123 TraceVLA: Visual Trace Prompting ⭐⭐⭐ auto

IV

Diffusion Policy

11 papers

扩散策略与流匹配 · read primer →

№ 38 Diffusion Policy: Visuomotor Policy Learning via Action Diffusion ⭐⭐⭐ auto
№ 39 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations ⭐⭐⭐ auto
№ 40 Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation ⭐⭐⭐ auto
№ 41 EquiBot: SIM(3)-Equivariant Diffusion Policy ⭐⭐⭐⭐ auto
№ 42 DiT-Policy ⭐⭐⭐⭐ auto
№ 43 Diffusion Policy Policy Optimization (DPPO) ⭐⭐⭐⭐ auto
№ 44 Affordance-based Robot Manipulation with Flow Matching ⭐⭐⭐ auto
№ 45 FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching ⭐⭐⭐⭐ auto
№ 46 FAST: Efficient Action Tokenization for VLA ⭐⭐⭐⭐ auto
№ 47 pi_0: Vision-Language-Action Flow Model ⭐⭐⭐⭐ auto
№ 48 pi_0.5: VLA with Open-World Generalization ⭐⭐⭐⭐⭐ auto

V

Imitation Learning

15 papers

模仿学习与遥操作 · read primer →

№ 49 A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning ⭐⭐⭐⭐ auto
№ 50 Generative Adversarial Imitation Learning ⭐⭐⭐⭐ auto
№ 51 Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA) ⭐⭐⭐ auto
№ 52 AnyTeleop ⭐⭐⭐ auto
№ 53 Behavior Transformers: Cloning k Modes with One Stone ⭐⭐⭐ auto
№ 54 Implicit Behavioral Cloning ⭐⭐⭐⭐ auto
№ 55 RoboCat ⭐⭐⭐⭐ auto
№ 56 ALOHA 2 ⭐⭐ auto
№ 57 DexCap ⭐⭐⭐ auto
№ 58 HumanPlus ⭐⭐⭐⭐ auto
№ 59 Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3) ⭐⭐⭐⭐ auto
№ 60 Mobile ALOHA ⭐⭐⭐ auto
№ 61 SmolVLA ⭐⭐⭐ auto
№ 62 Universal Manipulation Interface ⭐⭐⭐ auto
№ 63 Behavior Generation with Latent Actions (VQ-BeT) ⭐⭐⭐⭐ auto

VI

World Model & Video Policy

14 papers

世界模型与视频策略 · read primer →

№ 07 Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control ⭐⭐⭐⭐⭐ auto
№ 144 Dream to Control: Learning Behaviors by Latent Imagination ⭐⭐⭐⭐ auto
№ 145 World Models ⭐⭐⭐ auto
№ 146 DayDreamer ⭐⭐⭐ auto
№ 147 Mastering Atari with Discrete World Models ⭐⭐⭐⭐ auto
№ 148 Dreamer V3: Mastering Diverse Domains through World Models ⭐⭐⭐⭐ auto
№ 149 Transformers are Sample-Efficient World Models ⭐⭐⭐⭐ auto
№ 150 TWM: Transformer-based World Models ⭐⭐⭐⭐ auto
№ 151 1X World Model Challenge ⭐⭐⭐ auto
№ 152 Cosmos World Foundation Model Platform ⭐⭐⭐⭐⭐ auto
№ 153 GAIA-1 ⭐⭐⭐⭐ auto
№ 154 Genie: Generative Interactive Environments ⭐⭐⭐⭐ auto
№ 155 Navigation World Models ⭐⭐⭐⭐ auto
№ 156 UniSim ⭐⭐⭐⭐ auto

VII

Multimodal Ecology

13 papers

多模态交互与数据生态 · read primer →

№ 05 VLAS: VLA Model With Speech Instructions ⭐⭐⭐ auto
№ 06 MLA: Multisensory Language-Action Model ⭐⭐⭐⭐ auto
№ 64 ImageBind: One Embedding Space To Bind Them All ⭐⭐⭐ auto
№ 65 Connecting Touch and Vision via Cross-Modal Prediction ⭐⭐⭐ auto
№ 66 AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model ⭐⭐⭐ auto
№ 67 AudioPaLM ⭐⭐⭐⭐ auto
№ 68 FROMAGe: Grounding LLMs to Images ⭐⭐⭐ auto
№ 69 OneLLM ⭐⭐⭐ auto
№ 70 X-VLM: Multi-Grained Vision Language Pre-Training ⭐⭐⭐⭐ auto
№ 71 Tactile Beyond Pixels (Sparsh-X) ⭐⭐⭐⭐ auto
№ 72 Sparsh: Self-supervised Touch Representations ⭐⭐⭐⭐ auto
№ 73 Tactile-VLA ⭐⭐⭐⭐ auto
№ 74 TLA: Tactile-Language-Action ⭐⭐⭐⭐ auto

VIII

RF Perception & Mapping

16 papers

射频感知与空间建图 · read primer →

№ 08 CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches ⭐⭐⭐⭐ auto
№ 09 mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment ⭐⭐⭐⭐ auto
№ 10 mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation ⭐⭐⭐⭐ auto
№ 85 See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar ⭐⭐⭐ auto
№ 86 Can WiFi Estimate Person Pose? ⭐⭐⭐ auto
№ 87 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning ⭐⭐⭐ auto
№ 88 milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion ⭐⭐⭐ auto
№ 89 High Resolution Point Clouds from mmWave Radar ⭐⭐⭐ auto
№ 90 RadarSLAM: Radar based Large-Scale SLAM in All Weathers ⭐⭐⭐⭐ auto
№ 91 Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm ⭐⭐⭐⭐ auto
№ 92 RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals ⭐⭐⭐ auto
№ 93 RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory ⭐⭐⭐⭐ auto
№ 94 Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on ⭐⭐⭐⭐ auto
№ 95 Diffusion Model is a Good Pose Estimator from 3D RF-Vision ⭐⭐⭐⭐ auto
№ 96 Enabling Visual Recognition at Radio Frequency (PanoRadar) ⭐⭐⭐⭐ auto
№ 97 Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion ⭐⭐⭐⭐ auto

IX

Auditory & Acoustic

15 papers

听觉智能与声学空间交互 · read primer →

№ 11 Proactive Hearing Assistants that Isolate Egocentric Conversations ⭐⭐⭐ auto
№ 12 NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators ⭐⭐⭐ auto
№ 13 Creating speech zones with self-distributing acoustic swarms ⭐⭐⭐ auto
№ 14 Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation ⭐⭐⭐ auto
№ 15 SoundStream: An End-to-End Neural Audio Codec ⭐⭐⭐⭐ auto
№ 16 AudioLM ⭐⭐⭐⭐ auto
№ 17 Conformer ⭐⭐⭐ auto
№ 18 Dual-path RNN ⭐⭐⭐⭐ auto
№ 19 EnCodec ⭐⭐⭐⭐ auto
№ 20 Meta-StyleSpeech ⭐⭐⭐ auto
№ 21 MusicLM ⭐⭐⭐⭐ auto
№ 22 Robust Speech Recognition via Large-Scale Weak Supervision ⭐⭐⭐ auto
№ 23 SeamlessM4T ⭐⭐⭐⭐ auto
№ 24 Stable Audio ⭐⭐⭐⭐ auto
№ 25 Universal Source Separation with Weakly Labelled Data ⭐⭐⭐⭐ auto

X

Datasets & Benchmarks

12 papers

数据集与评测基准 · read primer →

№ 26 Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning ⭐⭐ auto
№ 27 RLBench: The Robot Learning Benchmark & Learning Environment ⭐⭐ auto
№ 28 robosuite: A Modular Simulation Framework and Benchmark for Robot Learning ⭐⭐ auto
№ 29 BridgeData V2 ⭐⭐ auto
№ 30 CALVIN ⭐⭐⭐ auto
№ 31 LIBERO ⭐⭐⭐ auto
№ 32 RH20T ⭐⭐⭐ auto
№ 33 What Matters in Learning from Offline Human Demonstrations for Robot Manipulation ⭐⭐⭐ auto
№ 34 DROID ⭐⭐⭐ auto
№ 35 Open X-Embodiment ⭐⭐⭐ auto
№ 36 RoboCasa ⭐⭐⭐ auto
№ 37 SimplerEnv ⭐⭐⭐⭐ auto

XI

Simulation & Sim2Real

11 papers

仿真与真实迁移 · read primer →

№ 98 Habitat: A Platform for Embodied AI Research ⭐⭐ auto
№ 99 Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning ⭐⭐⭐ auto
№ 100 DexMV ⭐⭐⭐⭐ auto
№ 101 Habitat 2.0 ⭐⭐⭐ auto
№ 102 ManiSkill ⭐⭐⭐ auto
№ 103 ProcTHOR ⭐⭐⭐ auto
№ 104 SAPIEN: A SimulAted Part-based Interactive ENvironment ⭐⭐⭐ auto
№ 105 BEHAVIOR-1K ⭐⭐⭐⭐ auto
№ 106 Habitat 3.0 ⭐⭐⭐ auto
№ 107 Isaac Lab ⭐⭐⭐ auto
№ 108 MuJoCo Playground ⭐⭐⭐ auto