1. 磐创AI首页
  2. arxiv

一种简单的开放词汇分割和检测框架

我们提出了一个简单的开放词汇分割和检测框架——\ourmodel{},该框架可以从不同分割和检测数据集中学习。为了弥合词汇和注释粒度之间的差距,我们首先引入了一个预训练的文本编码器来编码两个任务中的所有视觉概念,并学习它们的共同语义空间。与仅在分割任务上训练的对应方法相比,我们得到了合理的好结果。为了进一步调和这些方法,我们确定了两个不一致之处:$i$) 任务不一致——分割需要提取前景目标和背景物体的掩模,而检测只关心前者;$ii$) 数据不一致——框和掩模的注释具有不同的空间粒度,因此不能直接交换。为了解决这些问题,我们提出了一个解耦编码器,以减少前/背景之间的干扰,以及一个有条件的掩模解码器,以辅助为给定的框生成掩模。为此,我们开发了一个简单的编码器-解码器模型,包括所有的三种技术,并在COCO和Objects365上联合训练。在预训练后,我们的模型在分割和检测领域都表现出有竞争力或更强的零-shot可迁移性。特别是,在5个数据集上,\ourmodel{}超越了开放词汇实例和全景分割的现有最先进方法,并在相似设置下优于LVIS和ODinW的开放词汇检测的先前工作。在转移到特定任务时,我们的模型在COCO和ADE20K上实现了全景分割的新SoTA,以及在ADE20K和Cityscapes上实例分割的新SoTA。最后,我们注意到\ourmodel{}是首个探索联合训练分割和检测潜力的模型,并希望它能够作为开放世界中开发单一模型两个任务的强基线。
We present \ourmodel{}, a simple Open-vocabulary Segmentation and Detection
framework that jointly learns from different segmentation and detection
datasets. To bridge the gap of vocabulary and annotation granularity, we first
introduce a pre-trained text encoder to encode all the visual concepts in two
tasks and learn a common semantic space for them. This gives us reasonably good
results compared with the counterparts trained on segmentation task only. To
further reconcile them, we locate two discrepancies: $i$) task discrepancy —
segmentation requires extracting masks for both foreground objects and
background stuff, while detection merely cares about the former; $ii$) data
discrepancy — box and mask annotations are with different spatial granularity,
and thus not directly interchangeable. To address these issues, we propose a
decoupled decoding to reduce the interference between foreground/background and
a conditioned mask decoding to assist in generating masks for given boxes. To
this end, we develop a simple encoder-decoder model encompassing all three
techniques and train it jointly on COCO and Objects365. After pre-training, our
model exhibits competitive or stronger zero-shot transferability for both
segmentation and detection. Specifically, \ourmodel{} beats the
state-of-the-art method for open-vocabulary instance and panoptic segmentation
across 5 datasets, and outperforms previous work for open-vocabulary detection
on LVIS and ODinW under similar settings. When transferred to specific tasks,
our model achieves new SoTA for panoptic segmentation on COCO and ADE20K, and
instance segmentation on ADE20K and Cityscapes.
Finally, we note that \ourmodel{} is the first to explore the potential of
joint training on segmentation and detection, and hope it can be received as a
strong baseline for developing a single model for both tasks in open world.
论文链接:http://arxiv.org/pdf/2303.08131v1

原创文章,作者:fendouai,如若转载,请注明出处:https://panchuang.net/2023/03/15/%e4%b8%80%e7%a7%8d%e7%ae%80%e5%8d%95%e7%9a%84%e5%bc%80%e6%94%be%e8%af%8d%e6%b1%87%e5%88%86%e5%89%b2%e5%92%8c%e6%a3%80%e6%b5%8b%e6%a1%86%e6%9e%b6/

联系我们

400-800-8888

在线咨询:点击这里给我发消息

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息