2023-07-11_Generative Pretraining in Multimodality(Emu-1)

| 2023_Sun_Generative Pretraining in Multimodality.pdf| 2023-07-11 | Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang | URL | arxiv

Input: Interleaved image, text and video

Output: Interleaved image, text

核心:regress continuous visual tokens!

Insights

#Idea 视频训练数据:时序对齐数据?使用Temporal Alignment

为什么生成效果不好:

关于Understanding效果:没跟MLLM比,是否比不过?

Why Emu

多尺度算子华为GPU是否支持
资源好估计

Model

Architecture

Image Encoder, Causal Transformer, LLM, Visual Decoder

Causual Transformer将Encoder输出的image tokens转为fixed size embedding,和[BLIP] Q-Former的唯一区别是self-attention采用causal形式,更有利于Autoregressive Training

Visual Decoder:LLM output embedding as condition.

Pasted image 20240317115401.png

Training Objective

Stage 1: unified auto-regressive pretraining

Training Data&Details

Pre-training:End-to-End

Interleaved Video and Text:缩略图+字幕, ordered by timestamp,本质上等于图文交织,没有建模运动信息

Detail

针对[IMG]和[/IMG]token

Eval

Zero-shot

Multimodal Understanding,两种evaluation:

Text2Image Generation on MSCOCO val, FID,比不过SD Imagen Parti等Model

Few-shot(in-context learning)

VQA VideoQA

Qualitative

real-world knowledge grounding, interleaved multi-image understanding , detailed video understanding, , multimodal assistant, multi-turn dialogue, image blending , and (in-context) text-to-image generation

Abstract

We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context. This omnivore model can take in any single-modality or multimodal data input indiscriminately (e.g., interleaved image, text and video) through a one-model-for-all autoregressive training process. First, visual signals are encoded into embeddings, and together with text tokens form an interleaved input sequence. Emu is then end-to-end trained with a unified objective of classifying the next text token or regressing the next visual embedding in the multimodal sequence. This versatile multimodality empowers the exploration of diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, as well as web-scale image-text pairs and video-text pairs. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks, and supports in-context image and text generation. Across a broad range of zero-shot/few-shot tasks including image captioning, visual question answering, video question answering and text-to-image generation, Emu demonstrates superb performance compared to state-of-the-art large multimodal models. Extended capabilities such as multimodal assistants via instruction tuning are also demonstrated with impressive performance.