Topic IX · 听觉智能与声学空间交互

Auditory & Acoustic

15papers

3founder

8classic

4frontier

让机器人听懂世界——从语音识别到环境声音分离到声学定位。这是具身感知里最被低估的一块，但所有家务机器人都需要它。

Primer · 入门 3 篇

先读这三篇。

Whisper 用 68 万小时弱标注做 ASR → AudioLM 把音频建模成 token 序列 → Acoustic Swarms 用一群麦克风分离声源。

1
Robust Speech Recognition via Large-Scale Weak Supervision 2023 · ICML · ⭐⭐⭐
Whisper 把网上 68 万小时音频和字幕一锅烩，喂进普通 Transformer，开箱就能听各种口音、噪声和长录音，还顺手翻译——靠数据杂取胜。
2
AudioLM 2023 · TASLP · ⭐⭐⭐⭐
把声音切成两种"音频字"——一种管说啥、一种管音色，模型像写句子一样续写，给 3 秒就能接出像本人的语音。
3
Creating speech zones with self-distributing acoustic swarms 2023 · Nature · ⭐⭐⭐
七个像骰子那么大的小机器人，自己爬上桌散成一圈，桌上几个人同时讲话，它能分清谁说了啥。

Distribution · 年份分布

祖师爷经典前沿

All papers · 按 era 排

era	year	title	venue
前沿	2024	Proactive Hearing Assistants that Isolate Egocentric Conversations	UIST
经典	2024	NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators	MobiCom
祖师爷	2023	Creating speech zones with self-distributing acoustic swarms	Nature
祖师爷	2019	Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation	IEEE/ACM TASLP
祖师爷	2022	SoundStream: An End-to-End Neural Audio Codec	IEEE/ACM TASLP
经典	2020	Conformer	Interspeech
经典	2020	Dual-path RNN	ICASSP
经典	2021	Meta-StyleSpeech	ICML
经典	2023	AudioLM	TASLP
经典	2023	EnCodec	TMLR
经典	2023	MusicLM	arXiv
经典	2023	Robust Speech Recognition via Large-Scale Weak Supervision	ICML
前沿	2023	SeamlessM4T	arXiv
前沿	2024	Stable Audio	ICML
前沿	2024	Universal Source Separation with Weakly Labelled Data	TASLP