0%

BEiT V1论文笔记

paper reference

note reference 1: BEiT

note reference 2: Self-supervised learning

Self-supervised learning

  1. Overall impression: unsupervised pretraining and supervised fine-tuning
  2. Goals: The objective is to acquire a set of general feature representations during the pretraining phase, which can be further refined through the utilization of numerous labeled datasets in the downstream task.
  3. Masked Image Modeling (MIM) methods are proposed for self*-*supervised visual representation learning

BEiT: BERT Pre-Training of Image Transformers

BEiT Architecture:

image-20230824100112299

  1. Overall Approach

    1. Pre-training task: MIM(Masked Image Modeling) -> inspired by BERT

      During pre-training, some proportion of image patches are randomly masked, and fed the corrupted input to Transformer. The model learns to recover the visual tokens of the original image, instead of the raw pixels of masked patches.

      MIM uses two views of each image

      • image patches
      • visual tokens
    2. The input of BEiT: The image is split into a grid of patches that are the input representation of backbone Transformer

    3. The approach to tokenize an image: latent codes of discrete VAE(VAE is from DALL·E)

    4. The goal of pretraining: reinforce the model’s capacity to capture generic visual features.

  2. Introduction of two image representations

    1. Image patches

      image-20230824103227012

      1. The 2D image of the size H×W×C is split into a sequence of patches of size P2, while the number of patch is N=HW/P2 patches

        image -> patches

      2. The image patches xp are flattened into vectors and are linearly projected which is similar to word embeddings in BERT

        patches -> patch embeddings

    2. Visual tokens

      image-20230824103211423

      1. The original image is represented as a sequence of tokens obtained by an image tokenizer, instead of raw pixels

        visual tokens are discreate token indices, and the true patches can be refered respectively by the token indices and a visual codebook

      2. The image tokenizer learned by discreate variational autoencoder(dVAE), by DALL·E, is directly used.

        Learning process:

        By two modules tokenizer and decoder

        image-20230824104526023

        visual codebook contains eigen vectors that represent various image patches.

  3. ViT Backbone

    1. Following ViT, Transformer backbone network is used

      ViTBase is used, which is a 12-layer Transformer with 768 hidden size, and 12 attention heads. The intermediate size of feed-forward networks is 3072.

    2. The fomula and architecture are shown below:

      image-20230824113010977

      image-20230824112608619

BEiT Pretraning

  1. MIM

    image-20230824113304695

    1. As described above, approximately 40% image patches are randomly masked, where the masked patches are denoted as M. The masked patches are replaced with a learnable embedding e[M].

    2. After a L-layer Transformer, a softmax classifier is used to predict corresponding visual tokens

      image-20230824113737099

    3. pre-training objective: maximize the log-likelihood of the correct visual tokens zi given the corrupted image

      image-20230824113835611

  2. Blockwise Masking

    image-20230824120716933

  3. Why dVAE beats VAE

    image-20230824233709250

    VAE: directly using pixel-level auto-encoding(continuous vector space to represent latent space) -> foucus on short-range dependencies and high-frequency details

    dVAE: discrete visual tokens