回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
Tag

#transformer (78 篇)

yeartitletopicvenue
2025 DiT-Policy Diffusion Policy ICRA
2025 FAST: Efficient Action Tokenization for VLA Diffusion Policy RSS
2025 pi_0.5: VLA with Open-World Generalization Diffusion Policy arXiv
2025 Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3) Imitation Learning RSS
2025 Tactile Beyond Pixels (Sparsh-X) Multimodal Ecology CoRL
2025 Tactile-VLA Multimodal Ecology CoRL
2025 TLA: Tactile-Language-Action Multimodal Ecology ICRA
2025 Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion RF Perception & Mapping arXiv
2025 OpenHelix End-to-End VLA arXiv
2025 OpenVLA-OFT End-to-End VLA RSS
2025 SpatialVLA End-to-End VLA arXiv
2025 Dreamer V3: Mastering Diverse Domains through World Models World Model & Video Policy Nature
2025 1X World Model Challenge World Model & Video Policy arXiv
2025 Cosmos World Foundation Model Platform World Model & Video Policy arXiv
2025 Navigation World Models World Model & Video Policy CVPR
2024 Stable Audio Auditory & Acoustic ICML
2024 DROID Datasets & Benchmarks RSS
2024 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations Diffusion Policy RSS
2024 Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation Diffusion Policy RSS
2024 EquiBot: SIM(3)-Equivariant Diffusion Policy Diffusion Policy CoRL
2024 Affordance-based Robot Manipulation with Flow Matching Diffusion Policy IROS
2024 pi_0: Vision-Language-Action Flow Model Diffusion Policy arXiv
2024 DexCap Imitation Learning RSS
2024 HumanPlus Imitation Learning CoRL
2024 Mobile ALOHA Imitation Learning CoRL
2024 Behavior Generation with Latent Actions (VQ-BeT) Imitation Learning ICML
2024 OneLLM Multimodal Ecology CVPR
2024 Sparsh: Self-supervised Touch Representations Multimodal Ecology CoRL
2024 RoboFlamingo High-Level Planning ICLR
2024 Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on RF Perception & Mapping SenSys
2024 Diffusion Model is a Good Pose Estimator from 3D RF-Vision RF Perception & Mapping CVPR
2024 3D Diffusion Policy (DP3) End-to-End VLA RSS
2024 Octo: An Open-Source Generalist Robot Policy End-to-End VLA RSS
2024 3D-VLA End-to-End VLA ICML
2024 GR-2: Generative Video-Language-Action Model End-to-End VLA arXiv
2024 RDT-1B: Diffusion Foundation Model for Bimanual Manipulation End-to-End VLA ICLR
2024 RoboMamba End-to-End VLA NeurIPS
2024 TinyVLA End-to-End VLA RA-L
2024 TraceVLA: Visual Trace Prompting End-to-End VLA ICLR
2024 Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks VLM Foundation CVPR
2024 InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks VLM Foundation CVPR
2024 What matters when building vision-language models? VLM Foundation NeurIPS
2024 The Llama 3 Herd of Models VLM Foundation arXiv
2024 Long-CLIP: Unlocking the Long-Text Capability of CLIP VLM Foundation ECCV
2024 Pixtral 12B VLM Foundation arXiv
2024 Genie: Generative Interactive Environments World Model & Video Policy ICML
2023 AudioLM Auditory & Acoustic TASLP
2023 EnCodec Auditory & Acoustic TMLR
2023 MusicLM Auditory & Acoustic arXiv
2023 Robust Speech Recognition via Large-Scale Weak Supervision Auditory & Acoustic ICML
2023 SeamlessM4T Auditory & Acoustic arXiv
2023 BridgeData V2 Datasets & Benchmarks dataset-eval
2023 LIBERO Datasets & Benchmarks NeurIPS
2023 Open X-Embodiment Datasets & Benchmarks ICRA
2023 Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA) Imitation Learning RSS
2023 RoboCat Imitation Learning TMLR
2023 ImageBind: One Embedding Space To Bind Them All Multimodal Ecology CVPR
2023 AudioPaLM Multimodal Ecology arXiv
2023 PaLM-E: An Embodied Multimodal Language Model High-Level Planning ICML
2023 RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control End-to-End VLA CoRL
2023 RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches End-to-End VLA ICLR
2023 BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models VLM Foundation ICML
2023 EVA-CLIP: Improved Training Techniques for CLIP at Scale VLM Foundation arXiv
2023 Sigmoid Loss for Language Image Pre-Training VLM Foundation ICCV
2023 Transformers are Sample-Efficient World Models World Model & Video Policy ICLR
2023 TWM: Transformer-based World Models World Model & Video Policy ICLR
2023 GAIA-1 World Model & Video Policy arXiv
2022 Behavior Transformers: Cloning k Modes with One Stone Imitation Learning NeurIPS
2022 X-VLM: Multi-Grained Vision Language Pre-Training Multimodal Ecology ICML
2022 RT-1: Robotics Transformer for Real-World Control at Scale End-to-End VLA RSS
2022 Flamingo: a Visual Language Model for Few-Shot Learning VLM Foundation NeurIPS
2022 BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation VLM Foundation ICML
2022 FILIP: Fine-grained Interactive Language-Image Pre-Training VLM Foundation ICLR
2021 Meta-StyleSpeech Auditory & Acoustic ICML
2021 Learning Transferable Visual Models From Natural Language Supervision VLM Foundation ICML
2020 Conformer Auditory & Acoustic Interspeech
2020 Dual-path RNN Auditory & Acoustic ICASSP
2020 milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion RF Perception & Mapping SenSys