2024-07-29_FlexAttention for Efficient High-Resolution Vision-Language Models

| arXiv Fulltext PDF| 2024-07-29 | Junyan Li, Delin Chen, Tianle Cai, Peihao Chen, Yining Hong, Zhenfang Chen, Yikang Shen, Chuang Gan | URL | ECCV24

核心思想:

#问题 训练数据用的啥?训练和推理的last text token的不一致咋解决?

1 Method

前Nfa层:普通Self Attention只输入低分+text

后Nsa层:Hierarchical Self-attention Module

如何选择下一层需要的高分特征:根据Last Text Feature对所有低分Image Token的Attention Score,设置Threshold过滤,得到低分token,选择对应的高分区域的token
#idea 我想的是(对Attention Map做插值,也差不多)

Pasted image 20240730155413.png

2 实验设计

分辨率1008x1008

baseline:

三个模型均Finetuned from LLaVA-1.5-7b

V* Bench [53], MagnifierBench [27], TextVQA [45] and RSVQA-HRBEN [38]. The first two benchmarks focus on evaluating the model’s capability on general highresolution VQA, while the last two benchmarks focus on evaluating the model’s performance on domain-specific high-resolution VQA such as TextVQA for text understanding and RSVQA-HRBEN for remote sensing.

Pasted image 20240730163608.png

效果:不如我们的HD Baseline!TextVQA和Vstar都不行!原因?

2.1 消融

基于Attention Map筛选的优势, 以及基于Image Size做消融。
Pasted image 20240730170610.png

2.2 Inference对比

Pasted image 20240730171050.png

3 分析

优势:用了类似Cross Attention的机制节省开销,确实做到了开销基本没变!!

劣势:

好处是,这个东西做高分,然后在VStar和MagnifierBench上测,被认可了。其他benchmark不怎么掉点就行。

4 Abstract

Current high-resolution vision-language models encode images as high-resolution image tokens and exhaustively take all these tokens to compute attention, which significantly increases the computational cost. To address this problem, we propose FlexAttention, a flexible attention mechanism for efficient high-resolution vision-language models. Specifically, a high-resolution image is encoded both as high-resolution tokens and low-resolution tokens, where only the low-resolution tokens and a few selected high-resolution tokens are utilized to calculate the attention map, which greatly shrinks the computational cost. The high-resolution tokens are selected via a high-resolution selection module which could retrieve tokens of relevant regions based on an input attention map. The selected high-resolution tokens are then concatenated to the low-resolution tokens and text tokens, and input to a hierarchical self-attention layer which produces an attention map that could be used for the next-step high-resolution token selection. The hierarchical self-attention process and high-resolution token selection process are performed iteratively for each attention layer. Experiments on multimodal benchmarks prove that our FlexAttention outperforms existing high-resolution VLMs (e.g., relatively ~9% in V* Bench, ~7% in TextVQA), while also significantly reducing the computational cost by nearly 40%.