RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
本笔记基于摘要 + 公开资料,未读全文。
一句话讲什么(TL;DR)
漆黑屋子里相机看不见,但雷达回波能"听"出人形。RFMask 让模型把雷达信号直接画成每个人的精细剪影——头、肩、胳膊都画出来。
这是个什么场景 — 日常类比
家里浴室不能装摄像头,可你又担心独居的爷爷会不会在里面摔倒——这是个让人头疼的两难。
蝙蝠在伸手不见五指的洞里能照常飞,靠的是发出超声波、听回声判断前面有什么。换成人也是一样:闭上眼睛在屋里拍手,墙、桌子、人体反弹回来的回声不一样,你大致能听出"前面两米有个人形"。毫米波雷达(mmWave radar)干的就是这件事,只不过用的是毫米级电磁波而不是声波,"拍手频率"高得多,反射信号还带着距离、速度、相位等多维信息。
RFMask 想做的事,是把蝙蝠的本事再升级一档:不光告诉你"那里有个人",还要把这个人画出来——头在哪、肩膀在哪、抬起的胳膊伸到哪。相当于听一段回声,给你画一张这个人的剪影 PS 图。
什么时候特别有用:
- 养老监护:浴室、卧室不方便装摄像头,又想知道老人有没有摔倒
- 消防救援:屋里全是烟,普通相机直接糊掉
- 黑夜 / 强逆光:相机睁眼瞎的场合,雷达照常工作
- 隐私敏感场所:医院、公共厕所,不想拍清楚脸但需要知道有几个人、什么姿势

之前的人怎么做的 — 3-5 bullet
- RGB 相机 + 语义分割:Mask R-CNN 这条线,效果好但靠光、靠视野、不抗遮挡,且涉及隐私
- 深度相机 / Kinect:能给点云轮廓,但室外日光下退化,距离也短
- WiFi / CSI 信号:Person-in-WiFi 等工作用 WiFi CSI 估计 2D 关键点,但分辨率粗,做轮廓 mask 很勉强
- RF-Pose / RF-Pose3D(MIT 系列):把雷达 / WiFi 信号映射到人体骨架(关键点),但只输出"火柴人"而不是稠密 mask
- 传统雷达成像(SAR/ISAR):能成像,但需要复杂的相位校准、阵列设计,且对动态多人场景不友好
RFMask 的位置:把"RF 到关键点"这条路升级成"RF 到稠密轮廓",并且尽量保持工程上的简洁。
这篇论文的关键想法
像装修公司同时拍俯视图和侧视图量房间——单一角度容易漏,两个角度叠起来才完整。RFMask 的三件事就是这个思路:
- 两路雷达视角融合:用水平和垂直两组天线阵列分别扫"水平面热图"和"垂直面热图",模型同时拿到俯视图和侧视图,立体感才出得来
- 借用成熟的 2D 检测/分割框架:把人当作热图上的目标,套一套类似 Mask R-CNN 的两阶段流水线——先框出每个人的位置,再在框内画轮廓
- 用 RGB 摄像头当"老师"教雷达:训练时让相机和雷达同步开机,相机拍到的人由现成模型自动画出轮廓当作"标准答案",雷达就跟着这个答案学。学完之后正式上岗时把相机撤掉,光靠雷达干活
最关键的思路翻转:不要把雷达当成"另一种相机"从头造轮子,而是想办法把雷达数据揉成"长得像图"的张量,再直接接已经验证好的视觉网络。

它怎么做的(方法)— 3-4 段
信号到张量。 像把一段录音转成五线谱:原始波形人看不懂,得先变换成结构化的形式才能交给后续处理。毫米波雷达发射一系列 chirp(频率扫频信号),接收阵列拿回原始 IF 信号。经过 range-FFT / Doppler-FFT / angle-FFT 三步傅里叶变换,得到一个 range × angle × Doppler 的 3D 立方体。RFMask 把这个立方体压扁成水平视角和垂直视角两张"伪图像"——每张图两个轴是空间坐标,像素值代表反射强度。这一步是工程经验活,决定了下游网络能拿到多少信息。
等等,先慢一拍——chirp 是啥?想象你按门铃时音调从低到高扫一遍:"哆-唻-咪-发-嗦",遇到障碍物反弹回来你听到的回声音调和发出去时差多少,就能算出障碍物多远。雷达的 chirp 就是这个原理,只不过扫的是几十 GHz 的高频电磁波而不是音符。
双视角主干 + 检测头。 像两个法医分别从正面和侧面看现场照片,各提一组线索再汇总。水平视角图和垂直视角图分别进 CNN 提取特征,在某一层做融合(具体融合方式需读原文)。融合后的特征接 Region Proposal Network(RPN,候选框生成网络)出候选框——每个框对应"这里疑似有个人"的位置。这是 Faster R-CNN 的经典两阶段套路。
Mask 头。 像裁缝先量了你大致尺码(候选框),再按统一的纸样裁衣(ROI Align 把不同大小的框统一到固定尺寸特征图),最后过几层卷积细描出轮廓。这部分基本照抄 Mask R-CNN。损失函数是框位置损失 + 分类损失 + 每个像素的二值交叉熵。
训练监督。 像让小孩跟着字帖练字——字帖是相机,学生是雷达。训练阶段相机和雷达同步采集,相机这一路跑预训练的实例分割模型(如 Mask R-CNN on COCO)自动生成"标准答案 mask",再对齐到雷达坐标系当作 RFMask 的监督信号。这样省下人工标雷达数据的钱,相机白送标签。具体的相机-雷达标定细节需读原文。
实验在做什么
论文应当至少包含这几类实验(具体数字需读原文):
- 主结果:在自采数据集上的 mask AP(average precision),和 baseline 对比,比如直接用关键点反推 mask 的方法、单视角变体、不做雷达预处理的端到端版本
- 消融:水平 vs 垂直 vs 双视角融合;不同主干网络;不同的雷达预处理方式
- 鲁棒性:在弱光、烟雾、遮挡条件下,RFMask 与 RGB-based Mask R-CNN 的对比,证明 RGB 严重退化而 RF 几乎不变
- 多人场景:1 人、2 人、3 人时的精度衰减曲线
- 泛化:跨房间、跨被试、跨日期采集的测试集
数据集大概率是作者自采(雷达 + 相机同步),规模在数小时到数十小时人活动量级。
你应该懂的几个新词 — 4-6 个
- mmWave Radar(毫米波雷达):工作频段 30-300 GHz 的雷达,波长毫米级。优点是分辨率高、体积小、不受光照影响;缺点是穿透能力比微波弱,多径反射会带噪声
- Range-Doppler-Angle Cube:雷达原始数据经过三次 FFT 变换后得到的 3D 张量,三个轴分别是径向距离、径向速度、来波角度。是雷达深度学习的标准输入格式
- Heatmap 投影:把 3D Cube 沿某一轴聚合成 2D 图。RFMask 用的水平/垂直双视角就是两种聚合方式
- Mask R-CNN 风格的两阶段:先 RPN 出框、再 ROI 出 mask 的 pipeline。本论文把它从 RGB 域移植到 RF 域
- Cross-modal Supervision(跨模态监督):用 A 模态的标签去训 B 模态的模型。这里 A=RGB,B=RF
- ROI Align:把不同尺寸的候选框对齐到固定大小特征图的算子,相比 ROI Pooling 避免了量化误差
它和其他论文什么关系
- MIT RF-Pose / RF-Pose3D 系列(CVPR 2018, 2019):同领域里程碑,做的是 RF → 关键点。RFMask 把任务从"火柴人"升级成"剪影"
- Person-in-WiFi(ICCV 2019):用 WiFi CSI 估计姿态,分辨率比毫米波雷达粗,是更早期的尝试
- Mask R-CNN(ICCV 2017):本文方法论的直接父类,几乎所有结构都在借鉴
- NLOS mmWave / Through-Wall RF imaging 系列:研究信号本身的物理建模,RFMask 选择的是"工程务实"路线,把物理建模留给信号预处理那一步
- 在你这本笔记里:和 rf-pose-through-wall.md、person-in-wifi.md 是同主题,可以串起来读
我建议这样读 — 3-4 步
- 先读 abstract + intro + figure 1:搞清楚输入输出形状(输入:雷达 cube 或两张热图;输出:每个人的 mask)。任何 RF 论文,先把"我喂啥、我吐啥"画在草稿纸上
- 跳到方法图:找双视角融合在哪一层、ROI Align 之后接了几层卷积。如果有伪代码或网络配置表更好
- 读监督信号那一节:相机-雷达同步标定怎么做的、伪标签怎么生成的、有没有人工矫正环节。这部分最容易藏坑
- 最后看实验:重点看消融,特别是"双视角 vs 单视角"和"有/无烟雾遮挡"两组,这两组才是论文真正的 selling point
阅读时间预算:精读 3-4 小时,泛读 1 小时。
为什么值得读
第一,这是 RF-based 稠密视觉感知的一个干净基线。读完你会知道一个标准 RF 视觉 pipeline 长什么样:信号预处理 → 投影成图 → 套视觉主干 → 跨模态监督。这个套路在后续工作里反复出现。
第二,思维方式可迁移。"把非视觉模态规整成图,再喂视觉网络"这一招在 LiDAR、声纳、热成像里都能用。RFMask 是一个把这个抽象表达得很清楚的样本。
第三,对 embodied AI 研究有用。机器人在低光、烟雾、隐私敏感场景下需要继续感知,RF 模态是一条不依赖光的退路。即便你不做 RF,也值得知道这条退路的精度上限大致在哪儿。
第四,论文标题里的"Simple Baseline"是诚实的——它没有炫技的新组件,但把工程链路打通了。这类论文是新人入坑该领域的好起点,也是判断后续工作"真创新"还是"刷点"的参照系。
◼
引用本笔记 / Cite this note
@online{eai_rfmask_2026,
title = {(readable note) RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals},
author = {Zhou, Jason},
year = {2026},
note = {Note on a 2022 paper},
howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/rfmask/}},
organization = {Embodied AI Reading Station}
}
All 156 papers (full index)
- 1. LLaVA: Visual Instruction Tuning
- 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
- 3. SayCan: Do As I Can, Not As I Say
- 4. OpenVLA: An Open-Source Vision-Language-Action Model
- 5. VLAS: VLA Model With Speech Instructions
- 6. MLA: Multisensory Language-Action Model
- 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
- 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
- 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
- 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
- 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
- 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
- 13. Creating speech zones with self-distributing acoustic swarms
- 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
- 15. SoundStream: An End-to-End Neural Audio Codec
- 16. AudioLM
- 17. Conformer
- 18. Dual-path RNN
- 19. EnCodec
- 20. Meta-StyleSpeech
- 21. MusicLM
- 22. Robust Speech Recognition via Large-Scale Weak Supervision
- 23. SeamlessM4T
- 24. Stable Audio
- 25. Universal Source Separation with Weakly Labelled Data
- 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
- 27. RLBench: The Robot Learning Benchmark & Learning Environment
- 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
- 29. BridgeData V2
- 30. CALVIN
- 31. LIBERO
- 32. RH20T
- 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
- 34. DROID
- 35. Open X-Embodiment
- 36. RoboCasa
- 37. SimplerEnv
- 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
- 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
- 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
- 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
- 42. DiT-Policy
- 43. Diffusion Policy Policy Optimization (DPPO)
- 44. Affordance-based Robot Manipulation with Flow Matching
- 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
- 46. FAST: Efficient Action Tokenization for VLA
- 47. pi_0: Vision-Language-Action Flow Model
- 48. pi_0.5: VLA with Open-World Generalization
- 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
- 50. Generative Adversarial Imitation Learning
- 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
- 52. AnyTeleop
- 53. Behavior Transformers: Cloning k Modes with One Stone
- 54. Implicit Behavioral Cloning
- 55. RoboCat
- 56. ALOHA 2
- 57. DexCap
- 58. HumanPlus
- 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
- 60. Mobile ALOHA
- 61. SmolVLA
- 62. Universal Manipulation Interface
- 63. Behavior Generation with Latent Actions (VQ-BeT)
- 64. ImageBind: One Embedding Space To Bind Them All
- 65. Connecting Touch and Vision via Cross-Modal Prediction
- 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
- 67. AudioPaLM
- 68. FROMAGe: Grounding LLMs to Images
- 69. OneLLM
- 70. X-VLM: Multi-Grained Vision Language Pre-Training
- 71. Tactile Beyond Pixels (Sparsh-X)
- 72. Sparsh: Self-supervised Touch Representations
- 73. Tactile-VLA
- 74. TLA: Tactile-Language-Action
- 75. Code as Policies: Language Model Programs for Embodied Control
- 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
- 77. LLM+P: Empowering LLMs with Optimal Planning
- 78. PaLM-E: An Embodied Multimodal Language Model
- 79. ProgPrompt
- 80. ChatGPT for Robotics
- 81. GenSim
- 82. RoboFlamingo
- 83. Tree-Planner
- 84. VoxPoser
- 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
- 86. Can WiFi Estimate Person Pose?
- 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
- 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
- 89. High Resolution Point Clouds from mmWave Radar
- 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
- 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
- 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
- 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
- 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
- 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
- 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
- 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
- 98. Habitat: A Platform for Embodied AI Research
- 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
- 100. DexMV
- 101. Habitat 2.0
- 102. ManiSkill
- 103. ProcTHOR
- 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
- 105. BEHAVIOR-1K
- 106. Habitat 3.0
- 107. Isaac Lab
- 108. MuJoCo Playground
- 109. RT-1: Robotics Transformer for Real-World Control at Scale
- 110. 3D Diffusion Policy (DP3)
- 111. Octo: An Open-Source Generalist Robot Policy
- 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
- 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
- 114. 3D-VLA
- 115. DexVLA
- 116. GR-2: Generative Video-Language-Action Model
- 117. OpenHelix
- 118. OpenVLA-OFT
- 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
- 120. RoboMamba
- 121. SpatialVLA
- 122. TinyVLA
- 123. TraceVLA: Visual Trace Prompting
- 124. Learning Transferable Visual Models From Natural Language Supervision
- 125. Flamingo: a Visual Language Model for Few-Shot Learning
- 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
- 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
- 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
- 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
- 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
- 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
- 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
- 133. Improved Baselines with Visual Instruction Tuning
- 134. OBELICS
- 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
- 136. Sigmoid Loss for Language Image Pre-Training
- 137. What matters when building vision-language models?
- 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
- 139. The Llama 3 Herd of Models
- 140. LLaVA-NeXT-Interleave
- 141. LLaVA-OneVision: Easy Visual Task Transfer
- 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
- 143. Pixtral 12B
- 144. Dream to Control: Learning Behaviors by Latent Imagination
- 145. World Models
- 146. DayDreamer
- 147. Mastering Atari with Discrete World Models
- 148. Dreamer V3: Mastering Diverse Domains through World Models
- 149. Transformers are Sample-Efficient World Models
- 150. TWM: Transformer-based World Models
- 151. 1X World Model Challenge
- 152. Cosmos World Foundation Model Platform
- 153. GAIA-1
- 154. Genie: Generative Interactive Environments
- 155. Navigation World Models
- 156. UniSim