回答视觉查询是一项复杂的任务,需要同时进行视觉处理和推理。对于这项任务,端对端模型是主流方法,但其未明确区分这两者,限制了可解释性和泛化性。学习模块化程序提供了一种有前途的替代方案,但由于同时学习程序和模块的难度,已被证明具有挑战性。我们引入了ViperGPT,一种框架,利用代码生成模型将视觉语言模型组合成子程序,为任何查询生成结果。ViperGPT利用提供的API来访问可用的模块,并通过生成Python代码来组合它们,随后执行。这种简单的方法无需进一步训练,并在各种复杂的视觉任务中实现了最先进的结果。
Answering visual queries is a complex task that requires both visual
processing and reasoning. End-to-end models, the dominant approach for this
task, do not explicitly differentiate between the two, limiting
interpretability and generalization. Learning modular programs presents a
promising alternative, but has proven challenging due to the difficulty of
learning both the programs and modules simultaneously. We introduce ViperGPT, a
framework that leverages code-generation models to compose vision-and-language
models into subroutines to produce a result for any query. ViperGPT utilizes a
provided API to access the available modules, and composes them by generating
Python code that is later executed. This simple approach requires no further
training, and achieves state-of-the-art results across various complex visual
tasks.
论文链接:http://arxiv.org/pdf/2303.08128v1
原创文章,作者:fendouai,如若转载,请注明出处:https://panchuang.net/2023/03/15/vipergpt%e6%84%8f%e4%b8%ba%e9%80%9a%e8%bf%87python%e6%89%a7%e8%a1%8c%e7%9a%84%e5%8f%af%e8%a7%86%e5%8c%96%e6%8e%a8%e7%90%86%ef%bc%8c%e7%94%a8%e4%ba%8e%e6%8e%a8%e7%90%86%e3%80%82/