回 Jason 主站·Embodied AI Reading Station
没主意?快捷入口
RF Perception & Mapping · Plate Nº 88

milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion

6 min read · 2179 字 · ⭐⭐⭐ · 短摘要

本笔记基于摘要 + 公开资料,未读全文。

一句话讲什么(TL;DR)

把便宜的毫米波雷达和身上的"动作感应器"(IMU)用神经网络拼起来,让机器在黑暗、烟雾里也能算出自己走到了哪。

这是个什么场景

晚上停电,你拿手机回卧室,想知道自己走了几步、有没有转弯。

平时机器人靠这几样"感官"回答这种问题,但每样都有死穴:

  • 摄像头 = 睁眼看:灯一灭就抓瞎
  • 激光雷达 = 拿手电摸黑:碰上玻璃、烟雾就穿帮,而且贵
  • 毫米波雷达 = 像蝙蝠喊一声听回声:烟、黑、下雨都不怕,但听回来的"回声图"很糊、很稀,像隔着雾看东西
  • IMU(惯性测量单元,就是手机里那个能感觉你转手腕、走路晃动的小芯片)= 内耳:能立刻感到加速和转头,但走久了会"晕头",越走越偏

milliEgo 要解决的就是:消防员冲进着火的房子、机器人钻进漆黑的地下室、扫地机撞上一面落地镜——这些"看不见"的场合,怎么让设备还能可靠说出自己在动什么轨迹。它的办法是把"糊但抗造"的雷达和"灵但会漂"的 IMU 用神经网络捏在一起,让两个瘸子互相搀着走。

milliEgo — 场景示意:这论文要解决的现实问题
Plate Nº ImilliEgo — 场景示意:这论文要解决的现实问题

之前的人怎么做的 — 3-5 bullet

  • VIO(Visual-Inertial Odometry):摄像头 + IMU,是过去十年里手机 AR、无人机的主流方案;但黑暗/烟雾/低纹理直接报废
  • LIO(LiDAR-Inertial Odometry):激光雷达 + IMU,精度高,但激光雷达贵、对玻璃和烟雾敏感
  • 传统毫米波 SLAM:基于点云配准(ICP 类)做 scan matching,问题是单芯片雷达的点云太稀疏、噪声大,几何方法配不准
  • 早期 RF + IMU 的融合:多用卡尔曼滤波,对噪声分布有强假设,雷达噪声不规则时容易发散
  • 纯学习里程计:DeepVO 这类把 CNN+RNN 堆起来回归位姿,验证了"深度网络可以学里程计",但用在毫米波上还没有成熟方案

这篇论文的关键想法

核心是两个判断:

  1. 单芯片毫米波雷达便宜、抗恶劣环境,但物理上难用 — 与其在几何上死磕稀疏点云,不如让神经网络直接从原始/低层雷达表示里学出运动特征
  2. 雷达和 IMU 是"慢且糊"vs"快且漂"的互补对 — 雷达每帧给一团粗糙但绝对的几何线索,IMU 高频给加速度和角速度。让网络自己学一个跨模态注意力(cross-modal attention),动态决定哪一帧该信谁,比手工权重更鲁棒

一句话总结关键想法:用深度融合替代卡尔曼,用学习替代点云配准,把单芯片雷达从"凑合用"提到"主力传感器"。

milliEgo — 方法示意:核心 pipeline
Plate Nº IImilliEgo — 方法示意:核心 pipeline

它怎么做的(方法)— 3-4 段

输入与表征——好比厨师拿到的食材。雷达这边端上来的是单芯片 mmWave(典型如 TI IWR1443 这类,具体型号需读原文)输出的"距离-速度"或"距离-方位"热力图,可以理解成一张"哪个方向多远有东西"的模糊照片;IMU 这边则是高频送来的三轴加速度 + 三轴角速度,像每秒上百次的"我现在转了多快、晃了多少"。两路按时间戳对齐,送进各自的特征编码器。

双流编码 + 跨模态融合——好比两个翻译官凑一起翻同一句话。雷达流走 CNN 类编码(CNN 即卷积神经网络,擅长在图上找空间结构),IMU 流走小型 RNN/MLP 处理时序信号。

等等,先慢一拍 — 跨模态注意力(cross-modal attention)是什么? 想成一个"音量调节器":每一帧都问"这一刻雷达说的话靠谱,还是 IMU 说的话靠谱?",然后给两边打个权重。雷达回声糊得没法看时(比如对着空房间),多信 IMU;IMU 走久飘了时,多信雷达的绝对几何线索。

论文用的就是这种带注意力的"复合掩码"机制(compositional / cross-modal attention)。这是它和"早期直接把两路特征拼一起"做法最大的区别——权重是模型自己学出来的,不是人手工调的。

位姿回归——好比把一帧帧"我刚才走了多少"加起来变成完整轨迹。融合后的特征送进时序网络(LSTM 类),逐帧回归 6 自由度的相对位姿(Δt 平移 + Δrotation 旋转),累积起来就是一条轨迹。损失是位姿回归损失(位置 + 朝向,朝向通常用四元数或李代数表示),具体形式需读原文。

端到端训练——好比抄作业时连题目带答案一起背。整套网络在带真值轨迹(动捕或高精度 SLAM 提供 ground truth)的数据集上端到端训练。训练完,推理时只需要雷达 + IMU 两路输入,再也不用视觉。

实验在做什么

主要回答三件事:

  • 基线对比:和纯 VIO(如 VINS-Mono)、纯 IMU 积分、传统雷达里程计、以及消融掉注意力的版本比,看轨迹漂移(ATE / RTE 等指标,具体数字需读原文)
  • 恶劣环境鲁棒性:在烟雾、黑暗、镜面墙面、低纹理走廊这些视觉会崩的场景下,验证 milliEgo 还能跑
  • 消融:拆掉跨模态注意力 / 拆掉 IMU / 换成简单拼接,证明融合方式本身有贡献

数据集通常是作者自采的小车 / 手持设备数据,配高精度动捕或 LiDAR-SLAM 真值,覆盖室内多场景。具体里程长度、采集设备、误差数字需读原文。

你应该懂的几个新词 — 4-6 个

  • Egomotion estimation(自我运动估计):设备估计自己怎么动了,输出是相对位姿序列;和 SLAM 的区别是不一定建图
  • mmWave radar(毫米波雷达):波长毫米级(如 77 GHz)的雷达,分辨率比传统雷达高,单芯片版(FMCW 调频连续波)便宜小巧
  • IMU:惯性测量单元,三轴加速度计 + 三轴陀螺仪,高频但有偏置漂移
  • Sensor fusion(传感器融合):多路传感器数据合成更可靠的估计;传统是卡尔曼 / 因子图,这里是神经网络
  • Cross-modal attention:跨模态注意力,让模型在两种不同模态特征之间学会"该听谁的"动态权重
  • 6-DoF pose:6 自由度位姿 = 3D 平移 + 3D 旋转,是里程计的标准输出

它和其他论文什么关系

  • 上游:DeepVO(端到端学习视觉里程计)、VINS-Mono(视觉 + IMU 紧耦合)— milliEgo 把"端到端学里程计"这条路从视觉换到了毫米波
  • 同代 RF 系:RF-SLAM、毫米波建图工作(millimap 等)— 它们更偏建图,milliEgo 偏里程计;但点云稀疏 / 噪声大的痛点是共通的
  • 下游/影响:之后做毫米波 + 视觉 / 毫米波 + LiDAR 三模态融合的工作经常拿它当 RF-only 基线
  • 相邻领域:穿墙感知(rf-pose-through-wall)也用毫米波,但目标不同(关注人体姿态而非自我运动)

我建议这样读 — 3-4 步

  1. 先扫摘要 + 图 1 + 实验表头:搞清楚输入是什么、输出是什么、和谁比、赢在哪类场景
  2. 重点啃方法的融合层:跨模态注意力具体怎么算(query/key/value 哪来)、是逐帧还是逐特征通道做权重
  3. 看消融:把注意力换成 concat 后掉了多少,是判断"融合方式是否真的关键"的最直接证据
  4. (可选)对照一篇 VIO 比如 VINS-Mono:理解传统紧耦合的因子图思路,再回头看 milliEgo 用网络做的"软融合"差在哪

为什么值得读

  • 它是把单芯片毫米波雷达从"几何方法做不动"推到"深度学习能用"的代表作之一,对 RF + 学习这条路线是奠基性的
  • 跨模态注意力 + IMU 互补的设计模式可以迁移到任何"一个模态噪声大、一个模态漂移"的场景,比如 RF + 视觉、RF + 触觉
  • 对具身智能(embodied AI)有实操意义:机器人进入烟雾、地下、夜间环境时,这是少数还能给出可靠 6-DoF 位姿的方案
  • SenSys 2020 的工作放到今天看,硬件成本进一步降低、网络结构可以替换成 Transformer,思路仍然成立 — 是一个"读完能想到怎么改进"的好起点

引用本笔记 / Cite this note
BibTeX
@online{eai_milliego_2026,
  title       = {(readable note) milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion},
  author      = {Zhou, Jason},
  year        = {2026},
  note        = {Note on a 2020 paper},
  howpublished = {\url{https://estelledc.github.io/embodied-ai-reading-station/papers/milliego/}},
  organization = {Embodied AI Reading Station}
}

All 156 papers (full index)
  1. 1. LLaVA: Visual Instruction Tuning
  2. 2. 3DShape2VecSet: 3D Shape Representation for Diffusion Models
  3. 3. SayCan: Do As I Can, Not As I Say
  4. 4. OpenVLA: An Open-Source Vision-Language-Action Model
  5. 5. VLAS: VLA Model With Speech Instructions
  6. 6. MLA: Multisensory Language-Action Model
  7. 7. Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control
  8. 8. CartoRadar: RF-Based 3D SLAM Rivaling Vision Approaches
  9. 9. mmCLIP: Boosting mmWave-based Zero-shot HAR via Signal-Text Alignment
  10. 10. mmNorm: Non-Line-of-Sight 3D Object Reconstruction via mmWave Surface Normal Estimation
  11. 11. Proactive Hearing Assistants that Isolate Egocentric Conversations
  12. 12. NeuralAids: Wireless Hearables With Programmable Speech AI Accelerators
  13. 13. Creating speech zones with self-distributing acoustic swarms
  14. 14. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation
  15. 15. SoundStream: An End-to-End Neural Audio Codec
  16. 16. AudioLM
  17. 17. Conformer
  18. 18. Dual-path RNN
  19. 19. EnCodec
  20. 20. Meta-StyleSpeech
  21. 21. MusicLM
  22. 22. Robust Speech Recognition via Large-Scale Weak Supervision
  23. 23. SeamlessM4T
  24. 24. Stable Audio
  25. 25. Universal Source Separation with Weakly Labelled Data
  26. 26. Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning
  27. 27. RLBench: The Robot Learning Benchmark & Learning Environment
  28. 28. robosuite: A Modular Simulation Framework and Benchmark for Robot Learning
  29. 29. BridgeData V2
  30. 30. CALVIN
  31. 31. LIBERO
  32. 32. RH20T
  33. 33. What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
  34. 34. DROID
  35. 35. Open X-Embodiment
  36. 36. RoboCasa
  37. 37. SimplerEnv
  38. 38. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
  39. 39. 3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations
  40. 40. Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation
  41. 41. EquiBot: SIM(3)-Equivariant Diffusion Policy
  42. 42. DiT-Policy
  43. 43. Diffusion Policy Policy Optimization (DPPO)
  44. 44. Affordance-based Robot Manipulation with Flow Matching
  45. 45. FlowPolicy: 3D Flow-based Policy via Consistency Flow Matching
  46. 46. FAST: Efficient Action Tokenization for VLA
  47. 47. pi_0: Vision-Language-Action Flow Model
  48. 48. pi_0.5: VLA with Open-World Generalization
  49. 49. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
  50. 50. Generative Adversarial Imitation Learning
  51. 51. Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware (ACT/ALOHA)
  52. 52. AnyTeleop
  53. 53. Behavior Transformers: Cloning k Modes with One Stone
  54. 54. Implicit Behavioral Cloning
  55. 55. RoboCat
  56. 56. ALOHA 2
  57. 57. DexCap
  58. 58. HumanPlus
  59. 59. Generalizable Humanoid Manipulation with 3D Diffusion Policies (iDP3)
  60. 60. Mobile ALOHA
  61. 61. SmolVLA
  62. 62. Universal Manipulation Interface
  63. 63. Behavior Generation with Latent Actions (VQ-BeT)
  64. 64. ImageBind: One Embedding Space To Bind Them All
  65. 65. Connecting Touch and Vision via Cross-Modal Prediction
  66. 66. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  67. 67. AudioPaLM
  68. 68. FROMAGe: Grounding LLMs to Images
  69. 69. OneLLM
  70. 70. X-VLM: Multi-Grained Vision Language Pre-Training
  71. 71. Tactile Beyond Pixels (Sparsh-X)
  72. 72. Sparsh: Self-supervised Touch Representations
  73. 73. Tactile-VLA
  74. 74. TLA: Tactile-Language-Action
  75. 75. Code as Policies: Language Model Programs for Embodied Control
  76. 76. Inner Monologue: Embodied Reasoning through Planning with Language Models
  77. 77. LLM+P: Empowering LLMs with Optimal Planning
  78. 78. PaLM-E: An Embodied Multimodal Language Model
  79. 79. ProgPrompt
  80. 80. ChatGPT for Robotics
  81. 81. GenSim
  82. 82. RoboFlamingo
  83. 83. Tree-Planner
  84. 84. VoxPoser
  85. 85. See Through Smoke: Robust Indoor Mapping with Low-cost mmWave Radar
  86. 86. Can WiFi Estimate Person Pose?
  87. 87. 3DRIMR: 3D Reconstruction and Imaging via mmWave Radar based on Deep Learning
  88. 88. milliEgo: Single-chip mmWave Radar Aided Egomotion Estimation via Deep Sensor Fusion
  89. 89. High Resolution Point Clouds from mmWave Radar
  90. 90. RadarSLAM: Radar based Large-Scale SLAM in All Weathers
  91. 91. Through-Wall Pose Imaging in Real-Time with a Many-to-Many Encoder/Decoder Paradigm
  92. 92. RFMask: A Simple Baseline for Human Silhouette Segmentation with Radio Signals
  93. 93. RFPose-OT: RF-Based 3D Human Pose Estimation via Optimal Transport Theory
  94. 94. Argus: Multi-View Egocentric Human Mesh Reconstruction Based on Stripped-Down Wearable mmWave Add-on
  95. 95. Diffusion Model is a Good Pose Estimator from 3D RF-Vision
  96. 96. Enabling Visual Recognition at Radio Frequency (PanoRadar)
  97. 97. Wave-Former: Through-Occlusion 3D Reconstruction via Wireless Shape Completion
  98. 98. Habitat: A Platform for Embodied AI Research
  99. 99. Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
  100. 100. DexMV
  101. 101. Habitat 2.0
  102. 102. ManiSkill
  103. 103. ProcTHOR
  104. 104. SAPIEN: A SimulAted Part-based Interactive ENvironment
  105. 105. BEHAVIOR-1K
  106. 106. Habitat 3.0
  107. 107. Isaac Lab
  108. 108. MuJoCo Playground
  109. 109. RT-1: Robotics Transformer for Real-World Control at Scale
  110. 110. 3D Diffusion Policy (DP3)
  111. 111. Octo: An Open-Source Generalist Robot Policy
  112. 112. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
  113. 113. RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches
  114. 114. 3D-VLA
  115. 115. DexVLA
  116. 116. GR-2: Generative Video-Language-Action Model
  117. 117. OpenHelix
  118. 118. OpenVLA-OFT
  119. 119. RDT-1B: Diffusion Foundation Model for Bimanual Manipulation
  120. 120. RoboMamba
  121. 121. SpatialVLA
  122. 122. TinyVLA
  123. 123. TraceVLA: Visual Trace Prompting
  124. 124. Learning Transferable Visual Models From Natural Language Supervision
  125. 125. Flamingo: a Visual Language Model for Few-Shot Learning
  126. 126. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  127. 127. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
  128. 128. DeepSeek-VL: Towards Real-World Vision-Language Understanding
  129. 129. EVA-CLIP: Improved Training Techniques for CLIP at Scale
  130. 130. FILIP: Fine-grained Interactive Language-Image Pre-Training
  131. 131. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
  132. 132. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
  133. 133. Improved Baselines with Visual Instruction Tuning
  134. 134. OBELICS
  135. 135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. 136. Sigmoid Loss for Language Image Pre-Training
  137. 137. What matters when building vision-language models?
  138. 138. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
  139. 139. The Llama 3 Herd of Models
  140. 140. LLaVA-NeXT-Interleave
  141. 141. LLaVA-OneVision: Easy Visual Task Transfer
  142. 142. Long-CLIP: Unlocking the Long-Text Capability of CLIP
  143. 143. Pixtral 12B
  144. 144. Dream to Control: Learning Behaviors by Latent Imagination
  145. 145. World Models
  146. 146. DayDreamer
  147. 147. Mastering Atari with Discrete World Models
  148. 148. Dreamer V3: Mastering Diverse Domains through World Models
  149. 149. Transformers are Sample-Efficient World Models
  150. 150. TWM: Transformer-based World Models
  151. 151. 1X World Model Challenge
  152. 152. Cosmos World Foundation Model Platform
  153. 153. GAIA-1
  154. 154. Genie: Generative Interactive Environments
  155. 155. Navigation World Models
  156. 156. UniSim