2023-07-11_Generative Pretraining in Multimodality(Emu-1)

Input: Interleaved image, text and video

Output: Interleaved image, text

no videos

核心：regress continuous visual tokens！

Insights

#Idea 视频训练数据：时序对齐数据？使用Temporal Alignment

为什么生成效果不好：

图像作为continuous token，不如discrete token
diffusion中embedding as condition，比较不自然。

关于Understanding效果：没跟MLLM比，是否比不过？

Why Emu

多尺度算子华为GPU是否支持
资源好估计

Model

Architecture

Image Encoder, Causal Transformer, LLM, Visual Decoder

Causual Transformer将Encoder输出的image tokens转为fixed size embedding，和[BLIP] Q-Former的唯一区别是self-attention采用causal形式，更有利于Autoregressive Training

Visual Decoder：LLM output embedding as condition.

Pasted image 20240317115401.png

Training Objective

Stage 1: unified auto-regressive pretraining

cross-entropy loss for discrete text embedding
ℓ2 regression loss for continuous visual embeddings
Stage 2: Tune Visual Decoder([Stable Diffusion])
Stage 3: Instrunction Tuning with [LoRA]

Training Data&Details

Pre-training：End-to-End

Image/Video-text Pairs
Interleaved Image/Video and Text

Interleaved Video and Text：缩略图+字幕, ordered by timestamp，本质上等于图文交织，没有建模运动信息

Detail

针对[IMG]和[/IMG]token

[IMG]是否作为Pretraining Objective，让模型自己决定何时输出
由于image token数量固定，[/IMG]应该在[IMG]之后的fixed number of tokens 后自动生成？

Eval

Zero-shot

Multimodal Understanding，两种evaluation：

使用了Chain Of Thought，先对Image作Caption，然后把Caption和Prompt输入模型
- 没有使用image，都先转化为文字，是否合理？
follow Flamingo，为了控制输出格式，use two text-only examples from the task as prompts

Text2Image Generation on MSCOCO val, FID，比不过SD Imagen Parti等Model

Few-shot(in-context learning)

VQA VideoQA

Qualitative

real-world knowledge grounding, interleaved multi-image understanding , detailed video understanding, , multimodal assistant, multi-turn dialogue, image blending , and (in-context) text-to-image generation

Abstract

We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the exploration of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, and supports in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.