LLaVA系列:Visual Instruction Tuning

project page, GitHub

Model对比分析

model zoo

LLaVA的核心

data-efficient, 558K for pretrain, 665K for Instruction Tuning

简单架构,MLP Connector

切分子图,实现高分辨率:显著降低幻觉

LLaVA-1.5

LLaVA 1.5 Paper:Improved Baselines with Visual Instruction Tuning

data-efficient, 558K for pretrain, 665K for Instruction Tuning

Abalation

Academic Benchmarks需要short-form answer,而LLaVA在这些benchmark上表现不好。做了如下改进:

Response Formatting prompts+ finetune LLM

2023-06-15_InstructBLIP - Towards General-purpose Vision-Language Models with Instruction Tuning的分析:
First, ambiguous prompts on the response format. For example, Q: {Question} A: {Answer}. Such prompts do not clearly indicate the desirable output format.

Second, not finetuning the LLM. The first issue is worsened by InstructBLIP only finetuning the Qformer for instruction-tuning. It requires the Qformer’s visual output tokens to control the length of the LLM’s output to be either long-form or short-form, as in prefix tuning [25], but Qformer may lack the capability of properly doing so.

changing Linear Projection to MLP

在LLaVA-Instruct的基础上,加入academic task oriented data

Scaling:

LLaVA

LLaVA paper: Visual Instruction Tuning

核心在于利用Caption&BBox,喂给GPT-4,构造了Visual Instruction-Tuning Data

Instruction Tuning Data

从COCO( images + captions + bounding boxes)构建Instruction-following data

Model

CLIP ViT-L/24, Linear Layer for projection

Training

Alignment on filtered 505K image-text pairs from CC3M, only tune projection layer

SFT: projection&LLM with LLaVA-Instruct-158K

Eval

LLaVA-Bench(COCO), 90 questions和LLaVA-Bench(In-the-Wild), 24 images with 60 questions

Abalations

We tried using the last layer feature from CLIP vision encoder, which yields 89.96% and is 0.96% lower than the feature before the last layer. We hypothesize that this is because CLIP’s last layer features may focus more on global and abstract image properties compared to the layer before it

LLaVA-NeXT / LLaVA-1.6

LLaVA-NeXT(LLaVA 1.6),improved reasoning, OCR, and world knowledge
论文:improved_llava.pdf,3.4节开始是关于LLaVA-1.6的内容

Compared with LLaVA-1.5, LLaVA-NeXT has several improvements:

  1. Increasing the input image resolution to 4x more pixels. This allows it to grasp more visual details. It supports three aspect ratios, up to 672x672, 336x1344, 1344x336 resolution.
  2. Better visual reasoning and OCR capability with an improved visual instruction tuning data mixture.
  3. Better visual conversation for more scenarios, covering different applications. Better world knowledge and logical reasoning.

Model

It re-uses the pretrained connector of LLaVA-1.5, and still uses less than 1M visual instruction tuning samples. 这一点论文里没说,然而从model zoo看,每个size的model的connector不一样。我理解为由于pretraining阶段的数据集一样,因此第一阶段可以复用同大小model的pretraining。

High-Resolution

切分子图,支持任意分辨率:336 x [(2,2), (1,2), (2,1), (1,3), (3,1), (1,4), (4,1)],论文附录中还支持了1x5,1x6,2x3

Motivation
When provided with high-resolution images and representations that preserve these details, the model’s capacity to perceive intricate details in an image is significantly improved. It reduces the model hallucination that conjectures the imagined visual content when confronted with low-resolution images.

Abalation证明,高分辨率显著降低了Hallucination!
Pasted image 20240421155254.png

实现细节:

Training

第一阶段Alignment Train Connector, 558K
第二阶段 Instruction Tuning Full model, 760K

Ablation

Vicuna-1.5是最强的Base Model,比LLaMA2-Chat强

Data Efficiency:进行data sub-sample之后,仍能有很好的效果,50%的数据也可以有不错的效果

LIMA: less-is-more alignment

通常认为Hallucination来自训练集error,然而,高分辨率显著降低了Hallucination!因此,除了训练集error之外,训练集标注的内容在对应resolution下,无法被模型捕捉到,也可能产生幻觉(相当于模型看不到标注的东西,相当于error)!