Tag

#transformer (78 篇)

year	title	topic	venue
2025	DiT-Policy	Diffusion Policy	ICRA
2025	FAST: Efficient Action Tokenization for VLA	Diffusion Policy	RSS
2025	pi_0.5: VLA with Open-World Generalization	Diffusion Policy	arXiv
2025	Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)	Imitation Learning	RSS
2025	Tactile Beyond Pixels (Sparsh-X)	Multimodal Ecology	CoRL
2025	Tactile-VLA	Multimodal Ecology	CoRL
2025	TLA: Tactile-Language-Action	Multimodal Ecology	ICRA
2025	Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion	RF Perception & Mapping	arXiv
2025	OpenHelix	End-to-End VLA	arXiv
2025	OpenVLA-OFT	End-to-End VLA	RSS
2025	SpatialVLA	End-to-End VLA	arXiv
2025	Dreamer V3: Mastering Diverse Domains through World Models	World Model & Video Policy	Nature
2025	1X World Model Challenge	World Model & Video Policy	arXiv
2025	Cosmos World Foundation Model Platform	World Model & Video Policy	arXiv
2025	Navigation World Models	World Model & Video Policy	CVPR
2024	Stable Audio	Auditory & Acoustic	ICML
2024	DROID	Datasets & Benchmarks	RSS
2024	3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations	Diffusion Policy	RSS
2024	Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation	Diffusion Policy	RSS
2024	EquiBot: SIM(3)-Equivariant Diffusion Policy	Diffusion Policy	CoRL
2024	Affordance-based Robot Manipulation with Flow Matching	Diffusion Policy	IROS
2024	pi_0: Vision-Language-Action Flow Model	Diffusion Policy	arXiv
2024	DexCap	Imitation Learning	RSS
2024	HumanPlus	Imitation Learning	CoRL
2024	Mobile ALOHA	Imitation Learning	CoRL
2024	Behavior Generation with Latent Actions (VQ-BeT)	Imitation Learning	ICML
2024	OneLLM	Multimodal Ecology	CVPR
2024	Sparsh: Self-supervised Touch Representations	Multimodal Ecology	CoRL
2024	RoboFlamingo	High-Level Planning	ICLR
2024	Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on	RF Perception & Mapping	SenSys
2024	Diffusion Model is a Good Pose Estimator from 3D RF-Vision	RF Perception & Mapping	CVPR
2024	3D Diffusion Policy (DP3)	End-to-End VLA	RSS
2024	Octo: An Open-Source Generalist Robot Policy	End-to-End VLA	RSS
2024	3D-VLA	End-to-End VLA	ICML
2024	GR-2: Generative Video-Language-Action Model	End-to-End VLA	arXiv
2024	RDT-1B: Diffusion Foundation Model for Bimanual Manipulation	End-to-End VLA	ICLR
2024	RoboMamba	End-to-End VLA	NeurIPS
2024	TinyVLA	End-to-End VLA	RA-L
2024	TraceVLA: Visual Trace Prompting	End-to-End VLA	ICLR
2024	Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks	VLM Foundation	CVPR
2024	InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks	VLM Foundation	CVPR
2024	What matters when building vision-language models?	VLM Foundation	NeurIPS
2024	The Llama 3 Herd of Models	VLM Foundation	arXiv
2024	Long-CLIP: Unlocking the Long-Text Capability of CLIP	VLM Foundation	ECCV
2024	Pixtral 12B	VLM Foundation	arXiv
2024	Genie: Generative Interactive Environments	World Model & Video Policy	ICML
2023	AudioLM	Auditory & Acoustic	TASLP
2023	EnCodec	Auditory & Acoustic	TMLR
2023	MusicLM	Auditory & Acoustic	arXiv
2023	Robust Speech Recognition via Large-Scale Weak Supervision	Auditory & Acoustic	ICML
2023	SeamlessM4T	Auditory & Acoustic	arXiv
2023	BridgeData V2	Datasets & Benchmarks	dataset-eval
2023	LIBERO	Datasets & Benchmarks	NeurIPS
2023	Open X-Embodiment	Datasets & Benchmarks	ICRA
2023	Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)	Imitation Learning	RSS
2023	RoboCat	Imitation Learning	TMLR
2023	ImageBind: One Embedding Space To Bind Them All	Multimodal Ecology	CVPR
2023	AudioPaLM	Multimodal Ecology	arXiv
2023	PaLM-E: An Embodied Multimodal Language Model	High-Level Planning	ICML
2023	RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control	End-to-End VLA	CoRL
2023	RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches	End-to-End VLA	ICLR
2023	BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	VLM Foundation	ICML
2023	EVA-CLIP: Improved Training Techniques for CLIP at Scale	VLM Foundation	arXiv
2023	Sigmoid Loss for Language Image Pre-Training	VLM Foundation	ICCV
2023	Transformers are Sample-Efficient World Models	World Model & Video Policy	ICLR
2023	TWM: Transformer-based World Models	World Model & Video Policy	ICLR
2023	GAIA-1	World Model & Video Policy	arXiv
2022	Behavior Transformers: Cloning k Modes with One Stone	Imitation Learning	NeurIPS
2022	X-VLM: Multi-Grained Vision Language Pre-Training	Multimodal Ecology	ICML
2022	RT-1: Robotics Transformer for Real-World Control at Scale	End-to-End VLA	RSS
2022	Flamingo: a Visual Language Model for Few-Shot Learning	VLM Foundation	NeurIPS
2022	BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation	VLM Foundation	ICML
2022	FILIP: Fine-grained Interactive Language-Image Pre-Training	VLM Foundation	ICLR
2021	Meta-StyleSpeech	Auditory & Acoustic	ICML
2021	Learning Transferable Visual Models From Natural Language Supervision	VLM Foundation	ICML
2020	Conformer	Auditory & Acoustic	Interspeech
2020	Dual-path RNN	Auditory & Acoustic	ICASSP
2020	milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion	RF Perception & Mapping	SenSys