回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
RF Perception & Mapping · Plate Nº 97

Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion

7 min read · 2485 字 · ⭐⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

毫米波信号能穿过纸箱、布帘,Wave-Former 把弹回来的模糊回声拼成藏在背后的杯子、瓶子的完整 3D 形状。

这是个什么场景 — 日常类比

搬家时你蹲在墙角一堆封好的纸箱前,想找出装马克杯的那一箱,但每个都拆开看一遍太麻烦。你想要的是一双"能透过纸箱看里面"的眼睛。

类似的场景到处都是:

  • 仓库里机器人要从堆叠的箱子里挑出某个零件
  • 家用机器人翻柜子找遥控器,柜门是关着的
  • 搜救场景里要看废墟下面有没有人、有什么东西

可选的"看穿"工具有三种:

  • 用眼睛(RGB 摄像头):看不见,纸箱不透明
  • 用 X 光:能看见但设备贵、有辐射、家里不可能放
  • 用毫米波雷达:信号能穿透纸板、布料、薄木板,弹回来的回波告诉你"里面好像有个圆柱形的东西"

Wave-Former 干的就是第三件事,再多走一步:把雷达回波(一堆稀疏、噪声大、只照到物体半边脸的点)拼成一个完整的、能直接交给机器人手臂去抓的 3D 网格(mesh,由很多三角面拼成的物体外壳模型)。

Wave-Former — 场景示意:这论文要解决的现实问题
Plate Nº IWave-Former — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

毫米波感知不是新事物,但"穿透 + 重建完整 3D 形状"是个新组合:

  • mmWave 人体姿态/活动识别(如 RF-Pose、Person-in-WiFi):穿墙看人,但只输出骨架关键点,不做物体形状
  • mmWave SLAM / 建图(如 milliMap、RF-SLAM):建房间级别的稀疏地图,分辨率不够还原杯子级别的几何
  • NLOS(非视距)成像(如 nlos-mmwave):能看到拐角后的物体,但通常只输出 2D 轮廓或低分辨率体素
  • 视觉点云形状补全(如 PCN、3DShape2VecSet):很成熟,但前提是输入点云来自 LiDAR/深度相机,遮挡场景下根本拿不到点
  • 直接把毫米波点云丢给视觉补全网络:失败,因为毫米波点云的稀疏度、噪声分布、遮挡边缘畸变和 LiDAR 完全不是一回事

这篇论文的关键想法

两个洞察拼在一起:

  1. 毫米波回波不是"乱",是有物理规律的乱:信号穿透遮挡时会发生折射、衰减、多路径反射,这些都能用电磁传播模型描述。如果让网络从零学这些畸变需要海量数据;但如果把物理模型当先验注入,网络只需要学"残差"——剩下没被物理模型解释清楚的部分
  2. Transformer 形状补全在视觉里已经很强:把它的归纳偏置(attention + 大感受野)借过来,输入换成"经过物理先验校正的毫米波点云",输出还是完整的 3D 形状

合起来:物理先验做信号清洗 → Transformer 做几何想象。这种"物理 + 学习"的混合架构是近几年 RF 领域的主流路线,Wave-Former 把它推到了"完整物体 mesh 重建"这个粒度。

Wave-Former — 方法示意:核心 pipeline
Plate Nº IIWave-Former — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

第一步:原始信号 → 物理校正点云。 像隔着一层毛玻璃拍照,照片是糊的,但你知道毛玻璃怎么糊的,就能反推清晰图像。毫米波雷达发射 chirp(调频脉冲)信号,回波经过纸箱/布/木板时会因为材料的介电常数差异发生相位偏移和路径延长。Wave-Former 显式建模这些畸变(具体公式需读原文),把"被遮挡材料污染"的回波反算回"如果没有遮挡应该长啥样"的等效点云。这一步是论文标题里 "Wireless" 的关键——它不是把 RF 当成黑盒输入。

等等,先慢一拍 — 介电常数是啥?简单说就是材料对电磁波的"减速程度":空气是 1,纸板大概 2-4,金属是无穷大(完全弹回去)。毫米波信号穿过纸箱时会被减速,路径就被拉长,看起来像物体往后挪了几厘米。Wave-Former 把这种"系统性偏差"提前算掉。

第二步:稀疏点云编码。 像只拍到正脸的人脸照片,背后什么样得猜。校正后的点云仍然很稀疏(毫米波分辨率比 LiDAR 低一个数量级),用 PointNet 类的编码器或者直接切成 patch 喂进 Transformer。和视觉点云补全的差异是:mmWave 点云只覆盖物体朝向雷达的"近表面",背面、内凹结构完全是黑的。

第三步:Transformer 形状补全。 像考古学家拿着半块陶器碎片想象整个罐子的样子。Decoder 部分参考 PoinTr / 3DShape2VecSet 这类工作,用 cross-attention 让 query token 去"询问"输入点云的不同区域,逐步生成完整形状。输出形式可能是稠密点云、occupancy field 或者 SDF(具体哪种需读原文,从标题 "Shape Completion" 推测应该是高分辨率几何表示)。

第四步:训练数据合成。 像驾校用模拟器代替真车上路,数据便宜量大。真实"穿遮挡"的成对数据极难大规模采集(每个物体要做 RF 扫描 + 真值 mesh),论文大概率用电磁仿真(如 FDTD 或射线追踪)生成大规模合成 RF 数据,再用少量真实数据 fine-tune。这是 RF 学习类工作的标配套路。

实验在做什么

从摘要和标题推测主要实验维度(具体数字需读原文):

  • 物体种类:日常物体集合,可能覆盖杯、瓶、碗、盒等抓取常见类别
  • 遮挡材料:至少要测纸箱、布帘,可能加木板、塑料板,对比不同介电常数下的重建质量
  • 指标:Chamfer Distance、F-Score、IoU 这类标准 3D 重建指标;可能还会有下游任务指标,比如"重建出的 mesh 给抓取规划器用,成功率是多少"
  • 消融:去掉物理先验 vs 保留;纯视觉补全网络在 mmWave 输入下的表现;不同 Transformer 容量
  • 泛化:训练时见过的物体类别 vs 没见过的;训练时见过的遮挡材料 vs 没见过的

关键看点是"穿透不同材料的退化曲线"——如果纸箱很好但木板就崩了,说明物理先验的覆盖范围有限。

你应该懂的几个新词 — 4-6 个

  • mmWave(毫米波):30-300 GHz 频段的电磁波,波长毫米级。能穿透很多非金属材料(纸、布、薄木、干墙),分辨率比 WiFi 高、比 LiDAR 低。商用雷达芯片(TI IWR 系列)便宜易得
  • Shape Completion(形状补全):给一个不完整的 3D 输入(残缺点云、单视角深度图),预测完整的 3D 形状。视觉领域代表作 PCN、PoinTr
  • 物理先验(Physical Prior):把已知的物理规律(这里是电磁传播方程)显式写进模型结构或损失函数,让网络不用从零学这些规律。和"纯数据驱动"对立
  • 介电常数(Dielectric Constant):描述材料对电磁波"减速"程度的物理量。空气 ≈ 1,纸板 ≈ 2-4,金属 = ∞(完全反射)。决定了 mmWave 能不能穿、穿多少
  • NLOS(Non-Line-of-Sight,非视距):物体不在传感器直视方向上。Wave-Former 是 NLOS 感知的一种特例(被前方遮挡,但还在前向)
  • chirp(调频脉冲):mmWave FMCW 雷达的发射波形,频率随时间线性变化。回波和发射波混频后,频差直接对应距离

它和其他论文什么关系

向后看:

  • mmWave 感知谱系rf-pose-through-wall(穿墙骨架)→ millimap(毫米波建图)→ nlos-mmwave(非视距)→ Wave-Former(穿遮挡完整物体重建)。粒度从"人体关键点"细化到"物体级 mesh"
  • 3D 形状补全谱系3dshape2vecset 这类视觉点云补全是直接技术祖先,Wave-Former 把输入模态换成 RF
  • 物理 + 学习混合架构:和 acoustic-swarms(声学先验 + 学习)、neuralaids(助听器物理 + 神经网络)思路同源

向前看:

  • 抓取/操作策略要落地穿遮挡场景,必须有这种感知能力 —— 可以接 diffusion-policyrt-1 这类 manipulation 工作的上游
  • 多模态融合:mmWave + RGB + 触觉(touch-vision-cross-modal)做完整感知栈

我建议这样读 — 3-4 步

  1. 先扫摘要 + intro 的 figure 1:看清楚它的输入(什么样的雷达、什么样的遮挡)、输出(点云?mesh?SDF?)、和已有工作的差异图
  2. 跳到 method 的物理建模部分:这是和纯视觉补全工作的关键差异,搞清楚它把哪些物理量当先验、用什么方式注入网络(loss?feature?输入预处理?)
  3. 看实验里的失败案例 / 退化曲线:看不同遮挡材料、不同物体类别下哪里崩了,这告诉你方法的真实边界
  4. 可选:对照读 nlos-mmwave3dshape2vecset:一个是 RF 侧的最近邻工作,一个是形状补全侧的技术祖先,能看出 Wave-Former 的两条血脉怎么交汇

为什么值得读

  • 方向稀缺:能穿透遮挡做物体级别 3D 重建的工作不多,这是机器人在"现实世界乱糟糟柜子里翻东西"的关键拼图
  • 架构范式好:物理先验 + Transformer 学习的混合套路,在很多传感器模态(声、RF、IMU)都能复用,读完一篇能理解一类
  • 离落地不远:商用 mmWave 芯片便宜,硬件门槛低;如果重建质量真的够用,仓储/家用机器人能直接受益
  • embodied AI 拼图位:感知 → 决策 → 行动的链条上,"看不见的东西"长期是盲区。Wave-Former 这类工作把这个盲区往前推了一截

引用本笔记 / Cite this note
BibTeX
@online{eai_wave_former_2026,
  title       = {(readable note) Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2025 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/wave-former/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim