基于细粒度辅助任务的多模态模型微调方法
首发时间:2025-02-28
摘要:随着多模态图文模型的不断发展,其对图文和文本中的粗粒度实体的识别能力有了充足的提升。然而,当前模型对细粒度信息的处理能力还相对薄弱。目前,有许多研究通过引入细粒度任务以提升模型的细粒度能力。但大部分研究忽视了在微调阶段引入细粒度任务的重要作用。因此,本文将细粒度的MLM任务引入多模态图文模型的微调阶段,将其作为辅助任务使用,旨在提升模型的细粒度能力。最后,本文在vlm-probing,food-500-cap等数据集上的实验结果表明:MLM任务作为辅助任务使用可以提升多模态图文模型的细粒度能力,并提升在描述丰富的数据集上的图文检索性能。同时,模型在一般的通用数据集上的图文检索性能不会下降。
For information in English, please click here
Multimodal image-text model fine-tuning method based on fine-grained auxiliary tasks
Abstract:With the development of vision-language multimodal models, their ability to recognize coarse-grained entities has been significantly improved. However, the current model's ability to process fine-grained information is relatively weak. Currently, many studies have improved the fine-grained capability of models by introducing fine-grained tasks. However, most studies have overlooked the important role of introducing fine-grained tasks during the fine-tuning phase. Therefore, this article introduces the fine-grained task Masked Language Modeling into the fine-tuning stage of multimodal models, using it as an auxiliary task to enhance the fine-grained capability of the model. Finally, the experimental results on vlm-probing, food-500 cap and other datasets in this article show that using Masked Language Modeling as auxiliary tasks can improve the fine-grained ability of multimodal models and enhance the image-text retrieval performance on datasets with rich descriptions. Meanwhile, the model's image-text retrieval performance will not decrease on general datasets.
Keywords: Intelligence Science and Technology, Multimodal Image-Text Models, Fine-tuning, Fine-grained Tasks
基金:
引用
No.****
同行评议
勘误表
基于细粒度辅助任务的多模态模型微调方法
评论
全部评论