11 chapters · 156 papers.
视觉-语言基座 · read primer →
- № 01 LLaVA: Visual Instruction Tuning ⭐⭐ auto
- № 02 3DShape2VecSet: 3D Shape Representation for Diffusion Models ⭐⭐⭐⭐ auto
- № 124 Learning Transferable Visual Models From Natural Language Supervision ⭐⭐⭐ auto
- № 125 Flamingo: a Visual Language Model for Few-Shot Learning ⭐⭐⭐⭐ auto
- № 126 BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models ⭐⭐⭐⭐ auto
- № 127 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation ⭐⭐⭐ auto
- № 128 DeepSeek-VL: Towards Real-World Vision-Language Understanding ⭐⭐⭐ auto
- № 129 EVA-CLIP: Improved Training Techniques for CLIP at Scale ⭐⭐⭐ auto
- № 130 FILIP: Fine-grained Interactive Language-Image Pre-Training ⭐⭐⭐ auto
- № 131 Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks ⭐⭐⭐ auto
- № 132 InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks ⭐⭐⭐⭐ auto
- № 133 Improved Baselines with Visual Instruction Tuning ⭐⭐ auto
- № 134 OBELICS ⭐⭐⭐ auto
- № 135 Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond ⭐⭐⭐ auto
- № 136 Sigmoid Loss for Language Image Pre-Training ⭐⭐⭐ auto
- № 137 What matters when building vision-language models? ⭐⭐⭐ auto
- № 138 Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling ⭐⭐⭐⭐ auto
- № 139 The Llama 3 Herd of Models ⭐⭐⭐⭐ auto
- № 140 LLaVA-NeXT-Interleave ⭐⭐⭐ auto
- № 141 LLaVA-OneVision: Easy Visual Task Transfer ⭐⭐⭐ auto
- № 142 Long-CLIP: Unlocking the Long-Text Capability of CLIP ⭐⭐⭐ auto
- № 143 Pixtral 12B ⭐⭐⭐ auto
高层任务规划 · read primer →
- № 03 SayCan: Do As I Can, Not As I Say ⭐⭐ auto
- № 75 Code as Policies: Language Model Programs for Embodied Control ⭐⭐⭐ auto
- № 76 Inner Monologue: Embodied Reasoning through Planning with Language Models ⭐⭐⭐ auto
- № 77 LLM+P: Empowering LLMs with Optimal Planning ⭐⭐⭐ auto
- № 78 PaLM-E: An Embodied Multimodal Language Model ⭐⭐⭐⭐ auto
- № 79 ProgPrompt ⭐⭐ auto
- № 80 ChatGPT for Robotics ⭐⭐ auto
- № 81 GenSim ⭐⭐⭐ auto
- № 82 RoboFlamingo ⭐⭐⭐⭐ auto
- № 83 Tree-Planner ⭐⭐⭐ auto
- № 84 VoxPoser ⭐⭐⭐⭐ auto
端到端视觉-语言-动作 · read primer →
- № 04 OpenVLA: An Open-Source Vision-Language-Action Model ⭐⭐⭐ auto
- № 109 RT-1: Robotics Transformer for Real-World Control at Scale ⭐⭐⭐ auto
- № 110 3D Diffusion Policy (DP3) ⭐⭐⭐ auto
- № 111 Octo: An Open-Source Generalist Robot Policy ⭐⭐⭐ auto
- № 112 RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control ⭐⭐⭐⭐ auto
- № 113 RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches ⭐⭐⭐ auto
- № 114 3D-VLA ⭐⭐⭐⭐ auto
- № 115 DexVLA ⭐⭐⭐⭐ auto
- № 116 GR-2: Generative Video-Language-Action Model ⭐⭐⭐⭐ auto
- № 117 OpenHelix ⭐⭐⭐ auto
- № 118 OpenVLA-OFT ⭐⭐⭐ auto
- № 119 RDT-1B: Diffusion Foundation Model for Bimanual Manipulation ⭐⭐⭐⭐ auto
- № 120 RoboMamba ⭐⭐⭐ auto
- № 121 SpatialVLA ⭐⭐⭐⭐ auto
- № 122 TinyVLA ⭐⭐⭐ auto
- № 123 TraceVLA: Visual Trace Prompting ⭐⭐⭐ auto
扩散策略与流匹配 · read primer →
- № 38 Diffusion Policy: Visuomotor Policy Learning via Action Diffusion ⭐⭐⭐ auto
- № 39 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations ⭐⭐⭐ auto
- № 40 Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation ⭐⭐⭐ auto
- № 41 EquiBot: SIM(3)-Equivariant Diffusion Policy ⭐⭐⭐⭐ auto
- № 42 DiT-Policy ⭐⭐⭐⭐ auto
- № 43 Diffusion Policy Policy Optimization (DPPO) ⭐⭐⭐⭐ auto
- № 44 Affordance-based Robot Manipulation with Flow Matching ⭐⭐⭐ auto
- № 45 FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching ⭐⭐⭐⭐ auto
- № 46 FAST: Efficient Action Tokenization for VLA ⭐⭐⭐⭐ auto
- № 47 pi_0: Vision-Language-Action Flow Model ⭐⭐⭐⭐ auto
- № 48 pi_0.5: VLA with Open-World Generalization ⭐⭐⭐⭐⭐ auto
模仿学习与遥操作 · read primer →
- № 49 A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning ⭐⭐⭐⭐ auto
- № 50 Generative Adversarial Imitation Learning ⭐⭐⭐⭐ auto
- № 51 Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA) ⭐⭐⭐ auto
- № 52 AnyTeleop ⭐⭐⭐ auto
- № 53 Behavior Transformers: Cloning k Modes with One Stone ⭐⭐⭐ auto
- № 54 Implicit Behavioral Cloning ⭐⭐⭐⭐ auto
- № 55 RoboCat ⭐⭐⭐⭐ auto
- № 56 ALOHA 2 ⭐⭐ auto
- № 57 DexCap ⭐⭐⭐ auto
- № 58 HumanPlus ⭐⭐⭐⭐ auto
- № 59 Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3) ⭐⭐⭐⭐ auto
- № 60 Mobile ALOHA ⭐⭐⭐ auto
- № 61 SmolVLA ⭐⭐⭐ auto
- № 62 Universal Manipulation Interface ⭐⭐⭐ auto
- № 63 Behavior Generation with Latent Actions (VQ-BeT) ⭐⭐⭐⭐ auto
世界模型与视频策略 · read primer →
- № 07 Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control ⭐⭐⭐⭐⭐ auto
- № 144 Dream to Control: Learning Behaviors by Latent Imagination ⭐⭐⭐⭐ auto
- № 145 World Models ⭐⭐⭐ auto
- № 146 DayDreamer ⭐⭐⭐ auto
- № 147 Mastering Atari with Discrete World Models ⭐⭐⭐⭐ auto
- № 148 Dreamer V3: Mastering Diverse Domains through World Models ⭐⭐⭐⭐ auto
- № 149 Transformers are Sample-Efficient World Models ⭐⭐⭐⭐ auto
- № 150 TWM: Transformer-based World Models ⭐⭐⭐⭐ auto
- № 151 1X World Model Challenge ⭐⭐⭐ auto
- № 152 Cosmos World Foundation Model Platform ⭐⭐⭐⭐⭐ auto
- № 153 GAIA-1 ⭐⭐⭐⭐ auto
- № 154 Genie: Generative Interactive Environments ⭐⭐⭐⭐ auto
- № 155 Navigation World Models ⭐⭐⭐⭐ auto
- № 156 UniSim ⭐⭐⭐⭐ auto
多模态交互与数据生态 · read primer →
- № 05 VLAS: VLA Model With Speech Instructions ⭐⭐⭐ auto
- № 06 MLA: Multisensory Language-Action Model ⭐⭐⭐⭐ auto
- № 64 ImageBind: One Embedding Space To Bind Them All ⭐⭐⭐ auto
- № 65 Connecting Touch and Vision via Cross-Modal Prediction ⭐⭐⭐ auto
- № 66 AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model ⭐⭐⭐ auto
- № 67 AudioPaLM ⭐⭐⭐⭐ auto
- № 68 FROMAGe: Grounding LLMs to Images ⭐⭐⭐ auto
- № 69 OneLLM ⭐⭐⭐ auto
- № 70 X-VLM: Multi-Grained Vision Language Pre-Training ⭐⭐⭐⭐ auto
- № 71 Tactile Beyond Pixels (Sparsh-X) ⭐⭐⭐⭐ auto
- № 72 Sparsh: Self-supervised Touch Representations ⭐⭐⭐⭐ auto
- № 73 Tactile-VLA ⭐⭐⭐⭐ auto
- № 74 TLA: Tactile-Language-Action ⭐⭐⭐⭐ auto
射频感知与空间建图 · read primer →
- № 08 CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches ⭐⭐⭐⭐ auto
- № 09 mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment ⭐⭐⭐⭐ auto
- № 10 mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation ⭐⭐⭐⭐ auto
- № 85 See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar ⭐⭐⭐ auto
- № 86 Can WiFi Estimate Person Pose? ⭐⭐⭐ auto
- № 87 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning ⭐⭐⭐ auto
- № 88 milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion ⭐⭐⭐ auto
- № 89 High Resolution Point Clouds from mmWave Radar ⭐⭐⭐ auto
- № 90 RadarSLAM: Radar based Large-Scale SLAM in All Weathers ⭐⭐⭐⭐ auto
- № 91 Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm ⭐⭐⭐⭐ auto
- № 92 RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals ⭐⭐⭐ auto
- № 93 RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory ⭐⭐⭐⭐ auto
- № 94 Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on ⭐⭐⭐⭐ auto
- № 95 Diffusion Model is a Good Pose Estimator from 3D RF-Vision ⭐⭐⭐⭐ auto
- № 96 Enabling Visual Recognition at Radio Frequency (PanoRadar) ⭐⭐⭐⭐ auto
- № 97 Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion ⭐⭐⭐⭐ auto
听觉智能与声学空间交互 · read primer →
- № 11 Proactive Hearing Assistants that Isolate Egocentric Conversations ⭐⭐⭐ auto
- № 12 NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators ⭐⭐⭐ auto
- № 13 Creating speech zones with self-distributing acoustic swarms ⭐⭐⭐ auto
- № 14 Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation ⭐⭐⭐ auto
- № 15 SoundStream: An End-to-End Neural Audio Codec ⭐⭐⭐⭐ auto
- № 16 AudioLM ⭐⭐⭐⭐ auto
- № 17 Conformer ⭐⭐⭐ auto
- № 18 Dual-path RNN ⭐⭐⭐⭐ auto
- № 19 EnCodec ⭐⭐⭐⭐ auto
- № 20 Meta-StyleSpeech ⭐⭐⭐ auto
- № 21 MusicLM ⭐⭐⭐⭐ auto
- № 22 Robust Speech Recognition via Large-Scale Weak Supervision ⭐⭐⭐ auto
- № 23 SeamlessM4T ⭐⭐⭐⭐ auto
- № 24 Stable Audio ⭐⭐⭐⭐ auto
- № 25 Universal Source Separation with Weakly Labelled Data ⭐⭐⭐⭐ auto
数据集与评测基准 · read primer →
- № 26 Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning ⭐⭐ auto
- № 27 RLBench: The Robot Learning Benchmark & Learning Environment ⭐⭐ auto
- № 28 robosuite: A Modular Simulation Framework and Benchmark for Robot Learning ⭐⭐ auto
- № 29 BridgeData V2 ⭐⭐ auto
- № 30 CALVIN ⭐⭐⭐ auto
- № 31 LIBERO ⭐⭐⭐ auto
- № 32 RH20T ⭐⭐⭐ auto
- № 33 What Matters in Learning from Offline Human Demonstrations for Robot Manipulation ⭐⭐⭐ auto
- № 34 DROID ⭐⭐⭐ auto
- № 35 Open X-Embodiment ⭐⭐⭐ auto
- № 36 RoboCasa ⭐⭐⭐ auto
- № 37 SimplerEnv ⭐⭐⭐⭐ auto
仿真与真实迁移 · read primer →
- № 98 Habitat: A Platform for Embodied AI Research ⭐⭐ auto
- № 99 Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning ⭐⭐⭐ auto
- № 100 DexMV ⭐⭐⭐⭐ auto
- № 101 Habitat 2.0 ⭐⭐⭐ auto
- № 102 ManiSkill ⭐⭐⭐ auto
- № 103 ProcTHOR ⭐⭐⭐ auto
- № 104 SAPIEN: A SimulAted Part-based Interactive ENvironment ⭐⭐⭐ auto
- № 105 BEHAVIOR-1K ⭐⭐⭐⭐ auto
- № 106 Habitat 3.0 ⭐⭐⭐ auto
- № 107 Isaac Lab ⭐⭐⭐ auto
- № 108 MuJoCo Playground ⭐⭐⭐ auto