Tag

#VLM (27 篇)

year	title	topic	venue
2025	DexVLA	End-to-End VLA	arXiv
2025	OpenHelix	End-to-End VLA	arXiv
2025	SpatialVLA	End-to-End VLA	arXiv
2024	TraceVLA: Visual Trace Prompting	End-to-End VLA	ICLR
2024	DeepSeek-VL: Towards Real-World Vision-Language Understanding	VLM Foundation	arXiv
2024	Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks	VLM Foundation	CVPR
2024	InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks	VLM Foundation	CVPR
2024	Improved Baselines with Visual Instruction Tuning	VLM Foundation	CVPR
2024	What matters when building vision-language models?	VLM Foundation	NeurIPS
2024	Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	VLM Foundation	arXiv
2024	The Llama 3 Herd of Models	VLM Foundation	arXiv
2024	LLaVA-NeXT-Interleave	VLM Foundation	arXiv
2024	LLaVA-OneVision: Easy Visual Task Transfer	VLM Foundation	arXiv
2024	Long-CLIP: Unlocking the Long-Text Capability of CLIP	VLM Foundation	ECCV
2024	Pixtral 12B	VLM Foundation	arXiv
2024	UniSim	World Model & Video Policy	ICLR
2023	LIBERO	Datasets & Benchmarks	NeurIPS
2023	Open X-Embodiment	Datasets & Benchmarks	ICRA
2023	Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)	Imitation Learning	RSS
2023	PaLM-E: An Embodied Multimodal Language Model	High-Level Planning	ICML
2023	ProgPrompt	High-Level Planning	ICRA
2023	RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control	End-to-End VLA	CoRL
2023	BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	VLM Foundation	ICML
2023	OBELICS	VLM Foundation	NeurIPS
2023	Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	VLM Foundation	arXiv
2023	Sigmoid Loss for Language Image Pre-Training	VLM Foundation	ICCV
2022	X-VLM: Multi-Grained Vision Language Pre-Training	Multimodal Ecology	ICML