遮盖自编码器能够学习强大的视觉表示并在多个独立的模态中实现最先进的结果,然而很少有研究关注它们在多模态设置下的能力。在本文中,我们专注于点云和RGB图像数据,这两种方式经常在现实世界中一起呈现,并探索它们的有意义的交互作用。为了改善现有作品中的跨模态协同作用,我们提出了PiMAE,这是一个自监督预训练框架,通过三个方面促进3D和2D的交互。具体来说,我们首先注意到在两个源之间的遮盖策略的重要性,并利用一个投影模块来补充地对齐两个模态的掩码和可见标记。然后,我们利用一个精心制作的双分支MAE管道和一种新的共享解码器来促进掩码令牌中的跨模态交互。最后,我们设计了一个独特的跨模态重构模块,以增强对两种模态的表示学习。通过在大型RGB-D场景理解基准测试(SUN RGB-D和ScannetV2)上进行广泛的实验,我们发现交互式地学习点-图像特征是非平凡的,我们成功地提高了多个3D检测器,2D检测器和少样本分类器的性能分别为2.9%,6.7%和2.4%。代码可在https://github.com/BLVLab/PiMAE上获得。
Masked Autoencoders learn strong visual representations and achieve
state-of-the-art results in several independent modalities, yet very few works
have addressed their capabilities in multi-modality settings. In this work, we
focus on point cloud and RGB image data, two modalities that are often
presented together in the real world, and explore their meaningful
interactions. To improve upon the cross-modal synergy in existing works, we
propose PiMAE, a self-supervised pre-training framework that promotes 3D and 2D
interaction through three aspects. Specifically, we first notice the importance
of masking strategies between the two sources and utilize a projection module
to complementarily align the mask and visible tokens of the two modalities.
Then, we utilize a well-crafted two-branch MAE pipeline with a novel shared
decoder to promote cross-modality interaction in the mask tokens. Finally, we
design a unique cross-modal reconstruction module to enhance representation
learning for both modalities. Through extensive experiments performed on
large-scale RGB-D scene understanding benchmarks (SUN RGB-D and ScannetV2), we
discover it is nontrivial to interactively learn point-image features, where we
greatly improve multiple 3D detectors, 2D detectors, and few-shot classifiers
by 2.9%, 6.7%, and 2.4%, respectively. Code is available at
https://github.com/BLVLab/PiMAE.
论文链接:http://arxiv.org/pdf/2303.08129v1
原创文章,作者:fendouai,如若转载,请注明出处:https://panchuang.net/2023/03/15/pimae%ef%bc%9a%e7%94%a8%e4%ba%8e%e4%b8%89%e7%bb%b4%e7%89%a9%e4%bd%93%e6%a3%80%e6%b5%8b%e7%9a%84%e7%82%b9%e4%ba%91%e5%92%8c%e5%9b%be%e5%83%8f%e4%ba%a4%e4%ba%92%e5%bc%8f%e9%81%ae%e8%94%bd%e8%87%aa/