Tag

#language (73 篇)

year	title	topic	venue
2025	VLAS: VLA Model With Speech Instructions	Multimodal Ecology	ICLR
2025	FAST: Efficient Action Tokenization for VLA	Diffusion Policy	RSS
2025	pi_0.5: VLA with Open-World Generalization	Diffusion Policy	arXiv
2025	Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)	Imitation Learning	RSS
2025	SmolVLA	Imitation Learning	arXiv
2025	Tactile-VLA	Multimodal Ecology	CoRL
2025	TLA: Tactile-Language-Action	Multimodal Ecology	ICRA
2025	OpenHelix	End-to-End VLA	arXiv
2025	OpenVLA-OFT	End-to-End VLA	RSS
2025	1X World Model Challenge	World Model & Video Policy	arXiv
2025	Cosmos World Foundation Model Platform	World Model & Video Policy	arXiv
2024	OpenVLA: An Open-Source Vision-Language-Action Model	End-to-End VLA	CoRL
2024	MLA: Multisensory Language-Action Model	Multimodal Ecology	arXiv
2024	mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment	RF Perception & Mapping	SenSys 2024
2024	DROID	Datasets & Benchmarks	RSS
2024	pi_0: Vision-Language-Action Flow Model	Diffusion Policy	arXiv
2024	Behavior Generation with Latent Actions (VQ-BeT)	Imitation Learning	ICML
2024	OneLLM	Multimodal Ecology	CVPR
2024	GenSim	High-Level Planning	ICLR
2024	RoboFlamingo	High-Level Planning	ICLR
2024	Tree-Planner	High-Level Planning	ICLR
2024	Habitat 3.0	Simulation & Sim2Real	ICLR
2024	Octo: An Open-Source Generalist Robot Policy	End-to-End VLA	RSS
2024	3D-VLA	End-to-End VLA	ICML
2024	GR-2: Generative Video-Language-Action Model	End-to-End VLA	arXiv
2024	RDT-1B: Diffusion Foundation Model for Bimanual Manipulation	End-to-End VLA	ICLR
2024	RoboMamba	End-to-End VLA	NeurIPS
2024	TinyVLA	End-to-End VLA	RA-L
2024	TraceVLA: Visual Trace Prompting	End-to-End VLA	ICLR
2024	DeepSeek-VL: Towards Real-World Vision-Language Understanding	VLM Foundation	arXiv
2024	Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks	VLM Foundation	CVPR
2024	InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks	VLM Foundation	CVPR
2024	Improved Baselines with Visual Instruction Tuning	VLM Foundation	CVPR
2024	What matters when building vision-language models?	VLM Foundation	NeurIPS
2024	Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	VLM Foundation	arXiv
2024	The Llama 3 Herd of Models	VLM Foundation	arXiv
2024	LLaVA-NeXT-Interleave	VLM Foundation	arXiv
2024	LLaVA-OneVision: Easy Visual Task Transfer	VLM Foundation	arXiv
2024	Long-CLIP: Unlocking the Long-Text Capability of CLIP	VLM Foundation	ECCV
2024	Pixtral 12B	VLM Foundation	arXiv
2023	LLaVA: Visual Instruction Tuning	VLM Foundation	NeurIPS
2023	AudioLM	Auditory & Acoustic	TASLP
2023	EnCodec	Auditory & Acoustic	TMLR
2023	Robust Speech Recognition via Large-Scale Weak Supervision	Auditory & Acoustic	ICML
2023	SeamlessM4T	Auditory & Acoustic	arXiv
2023	Open X-Embodiment	Datasets & Benchmarks	ICRA
2023	RoboCat	Imitation Learning	TMLR
2023	AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model	Multimodal Ecology	EACL
2023	AudioPaLM	Multimodal Ecology	arXiv
2023	FROMAGe: Grounding LLMs to Images	Multimodal Ecology	ICML
2023	Code as Policies: Language Model Programs for Embodied Control	High-Level Planning	ICRA
2023	LLM+P: Empowering LLMs with Optimal Planning	High-Level Planning	arXiv
2023	PaLM-E: An Embodied Multimodal Language Model	High-Level Planning	ICML
2023	ProgPrompt	High-Level Planning	ICRA
2023	ChatGPT for Robotics	High-Level Planning	IEEE Access
2023	VoxPoser	High-Level Planning	CoRL
2023	RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control	End-to-End VLA	CoRL
2023	RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches	End-to-End VLA	ICLR
2023	BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	VLM Foundation	ICML
2023	OBELICS	VLM Foundation	NeurIPS
2023	Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	VLM Foundation	arXiv
2023	Transformers are Sample-Efficient World Models	World Model & Video Policy	ICLR
2023	TWM: Transformer-based World Models	World Model & Video Policy	ICLR
2023	GAIA-1	World Model & Video Policy	arXiv
2022	SayCan: Do As I Can, Not As I Say	High-Level Planning	CoRL
2022	Behavior Transformers: Cloning k Modes with One Stone	Imitation Learning	NeurIPS
2022	X-VLM: Multi-Grained Vision Language Pre-Training	Multimodal Ecology	ICML
2022	Inner Monologue: Embodied Reasoning through Planning with Language Models	High-Level Planning	CoRL
2022	ProcTHOR	Simulation & Sim2Real	NeurIPS
2022	RT-1: Robotics Transformer for Real-World Control at Scale	End-to-End VLA	RSS
2022	Flamingo: a Visual Language Model for Few-Shot Learning	VLM Foundation	NeurIPS
2022	BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation	VLM Foundation	ICML
2021	Learning Transferable Visual Models From Natural Language Supervision	VLM Foundation	ICML