回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
RF Perception & Mapping · Plate Nº 96

Enabling Visual Recognition at Radio Frequency (PanoRadar)

8 min read · 2696 字 · ⭐⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

PanoRadar 把便宜的小雷达装到一个转台上边转边扫,再让神经网络把模糊回声拼成 3D 地图,让雷达像眼睛一样"看见"房间。

这是个什么场景 — 日常类比

凌晨 3 点想去厨房倒水,没开灯——你怎么知道椅子在哪、墙在哪?大概是伸手摸,或者凭脑子里那张"家的地图"。机器人在浓烟、大雾、黑屋里也是这种处境:摄像头瞎了,怎么办?

三种"看见"环境的办法:

  • 摄像头(vision)= 睁眼看,但需要光,烟雾/黑夜直接瞎
  • LiDAR(激光雷达)= 拿激光笔一格格扫,精确测距,但贵、怕雾
  • 雷达(radar)= 拍手听回声,穿烟、穿雾、不怕黑,但听到的是一团糊声响,分不清是墙还是椅子

PanoRadar 做的事相当于:让你边转身边拍手,每个方向都听一下,再用大脑(神经网络)把所有方向的"糊回声"拼成一张 3D 地图。重点是它便宜——不是装上百根天线的高端雷达,而是一颗几十美元的单芯片雷达 + 一个让它转圈的小电机。

对应到真实场景:消防员冲进浓烟、自动驾驶车开进大雾、家用机器人在黑屋里走动——这些 LiDAR 和相机抓瞎的地方,正是 PanoRadar 想补的位置。

Enabling Visual Recognition at Radio Frequency (PanoRadar) — 场景示意:这论文要解决的现实问题
Plate Nº IEnabling Visual Recognition at Radio Frequency (PanoRadar) — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • 大型 mmWave 阵列雷达:天线数量多 → 角分辨率高,但价格贵、体积大、功耗高,部署受限
  • 单芯片雷达直接用:便宜,但角分辨率差("听一团回声"),只能做粗粒度的人体检测、手势识别、占用感知,距离 3D 成像还有数量级差距
  • RF + DL 之前的工作(如 RF-Pose、Person-in-WiFi、RF-SLAM):能从 RF 信号反推人体姿态、定位、SLAM 占用图,但都不是"通用视觉级 3D 表示"
  • 传统超分/合成孔径:合成孔径雷达(SAR)思路存在很久,但在室内、动态、消费级硬件上做高分辨 3D 还原一直没真正落地
  • 多模态雷达-相机融合:用相机当老师监督雷达,是常见思路,但仍依赖相机训练时段的可靠性

这篇论文的关键想法

打个比方:单芯片雷达就像一个只有一只耳朵的人——能听见声音,但分不清声音是从左边还是右边来的。要怎么让他"分清方向"?

第一性原理上看,单芯片雷达"看不清"的根本原因是角分辨率不够——天线少,不同方向回波分不开。两条路解决:

  1. 加硬件(多装几只耳朵 / 多装几根天线)→ 贵
  2. 加时间(让一只耳朵转着听不同方向,事后拼起来等价于"多耳朵")→ 便宜

PanoRadar 选第二条:机械旋转 + 合成孔径思路,让单芯片雷达在转动中等效成一个大阵列,从而在水平方向获得高角分辨率。

等等,先慢一拍——"合成孔径(synthetic aperture)"是什么?意思是同一根天线在不同位置(不同时刻)采到的信号,事后处理时可以当成多根天线同时采的来用,等价于"拼出"一个大阵列。SAR 卫星扫地球用的就是这个原理。

但只转一圈还不够——旋转会引入运动伪影、信号本身有噪声和多径反射、3D 形状还要从稀疏点反推稠密表面。

所以另一半关键想法是:让神经网络吃掉信号处理留下的不完美。具体来说,把传统 mmWave 信号处理(chirp 解调、FFT、MIMO 处理)的中间产物喂进神经网络,让网络学会去伪影、补稠密、再下接视觉任务头(法向、分割、检测)。

一句话:机械合成孔径解决"分辨率",深度学习解决"信号到语义"的鸿沟

Enabling Visual Recognition at Radio Frequency (PanoRadar) — 方法示意:核心 pipeline
Plate Nº IIEnabling Visual Recognition at Radio Frequency (PanoRadar) — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

硬件 + 数据采集(像装一只会转头的耳朵)。一颗 commodity 单芯片 mmWave 雷达,装在一个旋转平台上,匀速转动 360°。同时挂一个 LiDAR 做 ground truth(真值,训练时当老师,部署时不用)。在多个室内环境采集一个雷达-LiDAR 配对的数据集,覆盖不同房间、家具、障碍物布置。具体规模和配置需读原文。

信号处理前端(像翻译——把原始声波翻成"距离-方向"的图)。先按传统 mmWave 流程做:发射 chirp(一段频率随时间线性上升的信号)→ 接收回波 → 与发射信号做 dechirp(差频)得到中频信号 → 沿距离维做 FFT 得到 range 维 → 沿天线/扫描角做处理得到角度维。旋转过程中每个角度都采集一帧,把所有角度的距离-角度图拼起来,得到一个全景 range-azimuth 体(panoramic range-azimuth volume,可以理解为一个三维数据立方)。这一步还会处理旋转带来的运动补偿、相位对齐等。

学习管线(像让学生抄 LiDAR 老师的作业)。把上面那个"3D 雷达体"喂进神经网络。论文用的是几个堆叠的网络头:

  • 3D 重建头:预测每条视线方向上的占用/距离,等价于雷达版深度图,监督信号来自配对 LiDAR 点云
  • 表面法向头:预测每个表面点的朝向,让墙、地板、家具的几何更稳定
  • 语义分割头:把每个 3D 点分类(地板、墙、家具、人……)
  • 物体检测头:给出 3D bounding box

训练时 LiDAR + 摄像头给监督;推理时只用雷达。

为什么能 work(直觉版)。传统信号处理已经把"能从物理角度榨出来的分辨率"榨干了,剩下的模糊和缺失都来自硬件物理极限。但先验(房间是有几何规则的、墙是平的、家具有典型形状)能补一刀——而神经网络正擅长从大量数据里学这种先验。所以 PanoRadar 不是在"创造分辨率",而是在"用先验填补硬件做不到的部分"。

实验在做什么

  • 几何精度:雷达重建的 3D 点云 vs LiDAR ground truth,比深度误差、表面法向误差。具体数字需读原文
  • 语义任务:在采集的数据集上跑分割、检测的 mIoU/AP 等指标
  • 泛化:在没见过的房间、没见过的家具布置上测,看是否过拟合特定场景
  • 极端条件:烟雾、黑暗、玻璃/反光面(这些是 LiDAR/相机的痛点,雷达本应占优)
  • 消融:去掉旋转(退化为静态单芯片)、去掉某个网络头、换信号处理流程,看每一步贡献多少

你应该懂的几个新词 — 4-6 个

  • mmWave radar(毫米波雷达):工作在 24–77 GHz 等毫米波频段的雷达,波长短 → 同样天线尺寸下分辨率比传统雷达高,常见于汽车 ACC、手势识别
  • chirp / FMCW(线性调频连续波):发射一段频率随时间线性上升的信号,回波和发射信号做差能直接拿到目标距离,是消费级 mmWave 雷达的主流体制
  • synthetic aperture(合成孔径):让一根天线在空间中移动,事后把不同位置采到的信号拼起来,等效成一个"大天线阵列"。这是 PanoRadar 旋转的物理原理
  • angular resolution(角分辨率):能不能把两个角度上靠得很近的目标分开。天线越多越大 → 角分辨率越高
  • range-azimuth heatmap(距离-方位热图):mmWave 信号处理常用的中间表示,X 轴方位角、Y 轴距离、亮度=回波强度,是雷达版的"鸟瞰图"
  • surface normal(表面法向):每个 3D 表面点上"垂直于该表面"的方向向量,对几何理解、SLAM、新视角合成都很基础
  • multipath(多径):信号被墙/家具反射多次再到达接收端,会在雷达图里制造假目标,是 RF 室内成像的常见噪声来源

它和其他论文什么关系

  • RF-Pose / Person-in-WiFi / NLoS mmWave :都是"用 RF 信号做视觉级感知"这条线。前者从 RF 估人体姿态,PanoRadar 把这条线推到 3D 通用场景理解,是同家族的更激进版本
  • MilliMap / RF-SLAM:用毫米波做 SLAM/占用图,关注定位+建图;PanoRadar 关注更细的几何和语义(法向、分割),可以视作"RF 视觉"对 RF-SLAM 的补强
  • NeuralAids / Acoustic-Swarms / Conv-TasNet 这条声学线:思路同构——用 DL 从一类"非视觉传感器"的信号里抽出语义/几何信息。区别是介质(声 vs 电磁)和频段
  • 多模态对齐工作(ImageBind、ClIP、TouchVision):长远看,PanoRadar 这种"RF→视觉表示"的工作给多模态联合空间多了一个 RF 模态,可能未来会被吸纳进类似 ImageBind 的统一表示
  • embodied AI 视角:放进 NeuralAids、Acoustic-Swarms、Proactive-Hearing 这一组里看,PanoRadar 是"机器人在视觉受限环境下也能感知"的那一块拼图

我建议这样读 — 3-4 步

  1. 先看 demo 视频和 figure 1:直接感受"雷达点云能长得跟 LiDAR 一样吗",建立目标感
  2. 读 Method 的信号处理部分:搞清楚机械旋转怎么等效成合成孔径、range-azimuth 体怎么构造。这是物理基础,没搞懂后面网络部分会变魔法
  3. 读网络结构和监督方式:注意它用 LiDAR/相机怎么给 ground truth,部署时怎么去掉
  4. 看实验里的失败案例和极端条件:玻璃、金属、严重多径、稀疏目标。这才是判断"能不能落地到我的场景"的关键

为什么值得读

  • 传感器范式跨界:把 RF(一类大众认为"做不了视觉"的传感器)推到了视觉级输出,是"用 DL 重新定义传感器能力"的代表性工作
  • 硬件平民化:核心硬件是几十美元的单芯片雷达 + 简单旋转机构,不是百万级激光雷达,工程上可复制
  • embodied AI 的传感器多样化:在烟雾、黑暗、隐私敏感(不愿用相机)等场景,RF 视觉是真实需求,机器人/家居/消防/安防都能受益
  • 方法论可迁移:信号处理 + 学习管线 + 几何先验,这一套在声学(acoustic-swarms)、触觉、超声等"非视觉传感器视觉化"问题上都能复用
  • MobiCom 2024 best paper 级别的关注度,是了解 RF + AI 这条线最近一次大跳跃的标志性工作

引用本笔记 / Cite this note
BibTeX
@online{eai_panoradar_2026,
  title       = {(readable note) Enabling Visual Recognition at Radio Frequency (PanoRadar)},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2024 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/panoradar/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim