回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
Tag

#VLM (27 篇)

yeartitletopicvenue
2025 DexVLA End-to-End VLA arXiv
2025 OpenHelix End-to-End VLA arXiv
2025 SpatialVLA End-to-End VLA arXiv
2024 TraceVLA: Visual Trace Prompting End-to-End VLA ICLR
2024 DeepSeek-VL: Towards Real-World Vision-Language Understanding VLM Foundation arXiv
2024 Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks VLM Foundation CVPR
2024 InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks VLM Foundation CVPR
2024 Improved Baselines with Visual Instruction Tuning VLM Foundation CVPR
2024 What matters when building vision-language models? VLM Foundation NeurIPS
2024 Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling VLM Foundation arXiv
2024 The Llama 3 Herd of Models VLM Foundation arXiv
2024 LLaVA-NeXT-Interleave VLM Foundation arXiv
2024 LLaVA-OneVision: Easy Visual Task Transfer VLM Foundation arXiv
2024 Long-CLIP: Unlocking the Long-Text Capability of CLIP VLM Foundation ECCV
2024 Pixtral 12B VLM Foundation arXiv
2024 UniSim World Model & Video Policy ICLR
2023 LIBERO Datasets & Benchmarks NeurIPS
2023 Open X-Embodiment Datasets & Benchmarks ICRA
2023 Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA) Imitation Learning RSS
2023 PaLM-E: An Embodied Multimodal Language Model High-Level Planning ICML
2023 ProgPrompt High-Level Planning ICRA
2023 RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control End-to-End VLA CoRL
2023 BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models VLM Foundation ICML
2023 OBELICS VLM Foundation NeurIPS
2023 Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond VLM Foundation arXiv
2023 Sigmoid Loss for Language Image Pre-Training VLM Foundation ICCV
2022 X-VLM: Multi-Grained Vision Language Pre-Training Multimodal Ecology ICML