Tag

#vision (92 篇)

year	title	topic	venue
2025	DiT-Policy	Diffusion Policy	ICRA
2025	Diffusion Policy Policy Optimization (DPPO)	Diffusion Policy	ICLR
2025	FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching	Diffusion Policy	AAAI
2025	Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)	Imitation Learning	RSS
2025	Tactile Beyond Pixels (Sparsh-X)	Multimodal Ecology	CoRL
2025	Tactile-VLA	Multimodal Ecology	CoRL
2025	TLA: Tactile-Language-Action	Multimodal Ecology	ICRA
2025	Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion	RF Perception & Mapping	arXiv
2025	DexVLA	End-to-End VLA	arXiv
2025	OpenVLA-OFT	End-to-End VLA	RSS
2025	SpatialVLA	End-to-End VLA	arXiv
2024	OpenVLA: An Open-Source Vision-Language-Action Model	End-to-End VLA	CoRL
2024	mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment	RF Perception & Mapping	SenSys 2024
2024	Stable Audio	Auditory & Acoustic	ICML
2024	Universal Source Separation with Weakly Labelled Data	Auditory & Acoustic	TASLP
2024	DROID	Datasets & Benchmarks	RSS
2024	SimplerEnv	Datasets & Benchmarks	NeurIPS
2024	3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations	Diffusion Policy	RSS
2024	EquiBot: SIM(3)-Equivariant Diffusion Policy	Diffusion Policy	CoRL
2024	Affordance-based Robot Manipulation with Flow Matching	Diffusion Policy	IROS
2024	pi_0: Vision-Language-Action Flow Model	Diffusion Policy	arXiv
2024	DexCap	Imitation Learning	RSS
2024	Mobile ALOHA	Imitation Learning	CoRL
2024	Universal Manipulation Interface	Imitation Learning	RSS
2024	OneLLM	Multimodal Ecology	CVPR
2024	Sparsh: Self-supervised Touch Representations	Multimodal Ecology	CoRL
2024	GenSim	High-Level Planning	ICLR
2024	RoboFlamingo	High-Level Planning	ICLR
2024	Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on	RF Perception & Mapping	SenSys
2024	Diffusion Model is a Good Pose Estimator from 3D RF-Vision	RF Perception & Mapping	CVPR
2024	Enabling Visual Recognition at Radio Frequency (PanoRadar)	RF Perception & Mapping	MobiCom
2024	3D Diffusion Policy (DP3)	End-to-End VLA	RSS
2024	Octo: An Open-Source Generalist Robot Policy	End-to-End VLA	RSS
2024	3D-VLA	End-to-End VLA	ICML
2024	RDT-1B: Diffusion Foundation Model for Bimanual Manipulation	End-to-End VLA	ICLR
2024	RoboMamba	End-to-End VLA	NeurIPS
2024	TinyVLA	End-to-End VLA	RA-L
2024	TraceVLA: Visual Trace Prompting	End-to-End VLA	ICLR
2024	DeepSeek-VL: Towards Real-World Vision-Language Understanding	VLM Foundation	arXiv
2024	Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks	VLM Foundation	CVPR
2024	InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks	VLM Foundation	CVPR
2024	Improved Baselines with Visual Instruction Tuning	VLM Foundation	CVPR
2024	What matters when building vision-language models?	VLM Foundation	NeurIPS
2024	Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	VLM Foundation	arXiv
2024	The Llama 3 Herd of Models	VLM Foundation	arXiv
2024	LLaVA-NeXT-Interleave	VLM Foundation	arXiv
2024	LLaVA-OneVision: Easy Visual Task Transfer	VLM Foundation	arXiv
2024	Long-CLIP: Unlocking the Long-Text Capability of CLIP	VLM Foundation	ECCV
2024	Pixtral 12B	VLM Foundation	arXiv
2024	Genie: Generative Interactive Environments	World Model & Video Policy	ICML
2024	UniSim	World Model & Video Policy	ICLR
2023	LLaVA: Visual Instruction Tuning	VLM Foundation	NeurIPS
2023	MusicLM	Auditory & Acoustic	arXiv
2023	Robust Speech Recognition via Large-Scale Weak Supervision	Auditory & Acoustic	ICML
2023	BridgeData V2	Datasets & Benchmarks	dataset-eval
2023	LIBERO	Datasets & Benchmarks	NeurIPS
2023	RH20T	Datasets & Benchmarks	RSS Workshop
2023	AnyTeleop	Imitation Learning	CoRL
2023	RoboCat	Imitation Learning	TMLR
2023	ImageBind: One Embedding Space To Bind Them All	Multimodal Ecology	CVPR
2023	AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model	Multimodal Ecology	EACL
2023	FROMAGe: Grounding LLMs to Images	Multimodal Ecology	ICML
2023	PaLM-E: An Embodied Multimodal Language Model	High-Level Planning	ICML
2023	VoxPoser	High-Level Planning	CoRL
2023	RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control	End-to-End VLA	CoRL
2023	RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches	End-to-End VLA	ICLR
2023	BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	VLM Foundation	ICML
2023	EVA-CLIP: Improved Training Techniques for CLIP at Scale	VLM Foundation	arXiv
2023	OBELICS	VLM Foundation	NeurIPS
2023	Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	VLM Foundation	arXiv
2023	Sigmoid Loss for Language Image Pre-Training	VLM Foundation	ICCV
2023	GAIA-1	World Model & Video Policy	arXiv
2022	CALVIN	Datasets & Benchmarks	RA-L
2022	X-VLM: Multi-Grained Vision Language Pre-Training	Multimodal Ecology	ICML
2022	Inner Monologue: Embodied Reasoning through Planning with Language Models	High-Level Planning	CoRL
2022	RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals	RF Perception & Mapping	TMM
2022	DexMV	Simulation & Sim2Real	ECCV
2022	Flamingo: a Visual Language Model for Few-Shot Learning	VLM Foundation	NeurIPS
2022	BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation	VLM Foundation	ICML
2022	FILIP: Fine-grained Interactive Language-Image Pre-Training	VLM Foundation	ICLR
2022	DayDreamer	World Model & Video Policy	CoRL
2021	ManiSkill	Simulation & Sim2Real	NeurIPS
2021	Learning Transferable Visual Models From Natural Language Supervision	VLM Foundation	ICML
2021	Mastering Atari with Discrete World Models	World Model & Video Policy	ICLR
2020	Conformer	Auditory & Acoustic	Interspeech
2020	See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar	RF Perception & Mapping	SenSys
2020	milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion	RF Perception & Mapping	SenSys
2020	RadarSLAM: Radar based Large-Scale SLAM in All Weathers	RF Perception & Mapping	BMVC
2019	RLBench: The Robot Learning Benchmark & Learning Environment	Datasets & Benchmarks	RA-L
2019	Connecting Touch and Vision via Cross-Modal Prediction	Multimodal Ecology	CVPR
2019	Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm	RF Perception & Mapping	arXiv
2019	Habitat: A Platform for Embodied AI Research	Simulation & Sim2Real	ICCV