Visual Language Model: The Future of Fusion of Vision and Language

Vision-Language Model: The Future of Fusion of Vision and Language

2024-07-11

1 Overview

Vision-Language Models (VLMs) are artificial intelligence models that can process and understand both visual (image) and language (text) modal information. This model combines the techniques of computer vision and natural language processing, enabling them to perform well in complex tasks such as visual question answering, image description generation, and text-to-image search. It is a successful case of applying the transformer architecture to the field of computer vision. Specifically, it replaces the global image feature extraction in traditional CNN with an attention mechanism. Vision-language models have shown great potential in multiple fields, including image retrieval, generative AI, image segmentation, medical diagnosis, and robotics. The emergence of these models not only improves the performance of AI systems, but also provides new possibilities for developing smarter and more efficient applications.

2. Visual Transformer

Visual Transformer (ViT) works by splitting the image into small patches (patches) and then embedding these patches into the Transformer encoder to obtain a global image representation. Each image patch is treated as an independent "word" and processed through a self-attention mechanism. Compared with traditional convolutional neural networks (CNNs), Visual Transformers perform well when processing large datasets and high-resolution images. They surpass many advanced CNN architectures in image classification tasks.
Below is the structure of a simple visual Transformer.
insert image description here

4. Architecture of the Visual Language Model

4.1 Contrastive Learning

Contrastive learning is a technique for learning about data points by understanding their differences. The method computes a similarity score between data instances and aims to minimize the contrastive loss. It is most useful in semi-supervised learning, where only a few labeled samples guide the optimization process to label unseen data points.
insert image description here For example, one way to understand what a cat looks like is to compare it to similar images of cats and dogs. Contrastive learning models learn to distinguish between cats and dogs by identifying features such as facial structure, body size, and fur. These models can determine which image is closer to the original image (called an "anchor") and predict its category. The CLIP model is a typical model trained according to contrastive learning. The CLIP model achieves zero-shot prediction by calculating the similarity between text and image embeddings. It first trains text and image encoders, then converts the categories of the training dataset into titles and estimates the best title for a given input image. Below is the architecture of the CLIP model:
CLIP Architecture

4.2 Prefix Language Model (PrefixLM)

Prefix Language Models are pre-trained by taking as input a partial text (prefix) and predicting the next word in the sequence. In Visual Language Models, PrefixLM enables the model to predict the next sequence of words given an image and its respective prefix text. It leverages Visual Transformer (ViT) to partition the image into a sequence of 1D patches, each representing a local image region. The model then applies convolution or linear projection to the processed patches to generate contextualized visual embeddings. For the text modality, the model converts the text prefix relative to the patch into a token embedding. The encoder-decoder block of the transformer receives the visual embedding and the token embedding. SimVLM is a popular architecture that leverages the PrefixLM learning approach. Here is its architecture:
insert image description here

4.3 Frozen PrefixLM

The frozen prefix language model allows the use of pre-trained networks and updates only the parameters of the image encoder. Typical examples include the Frozen architecture and the Flamingo architecture. The Frozen architecture uses a pre-trained language model and visual encoder. By fine-tuning the image encoder, its image representation is aligned with the text embedding. The Flamingo architecture combines a CLIP-like visual encoder and a large language model (LLM). By inserting images between texts, fast reasoning is performed. Below is a typical network architecture of a Frozen PrefixLM.

insert image description here

4.4 Cross-Attention

Cross-Attention is a method that fuses information from different modalities (such as text, images, audio, etc.) through a cross-modal attention mechanism. The cross-attention fusion method learns visual representations by adding a cross-attention layer. Specifically, it allows the features of one data type (such as text) to pay attention to the features of another data type (such as pictures), so that it performs better in understanding and processing multiple information. This mechanism can significantly improve the results in many tasks that require processing multiple data types at the same time. The following is a schematic diagram of the Cross-Attention architecture:
insert image description here

5. Datasets for Visual Language Models

5.1 LAION-5B

The LAION-5B dataset contains more than 5 billion image-text pairs generated by CLIP and is used to build a large pre-trained model.
https://laion.ai/blog/laion-5b/

5.2 PMD

The PMD dataset is a combination of multiple large datasets and contains 7 billion image-text pairs.
https://huggingface.co/datasets/facebook/pmd

5.3 VQA

The VQA dataset is used for visual question answering and visual reasoning tasks, containing more than 200,000 images, each with five questions and corresponding answers.
https://visualqa.org/

5.4 ImageNet

The ImageNet dataset contains more than 14 million annotated images suitable for image classification and object recognition tasks.
https://www.image-net.org/

6. Application of Visual Language Model

6.1 Image Retrieval

With the visual language model, users can find relevant images using language queries.
insert image description here

6.2 Generative AI

Generative AI allows users to generate images through text descriptions, and is used in areas such as design and content creation. For example, products such as SD.
insert image description here

6.3 Image Segmentation

VLMs can be used for instance, panoptic, and semantic segmentation tasks, and for image annotation by understanding user cues.
insert image description here

Technology Sharing