BEiT V2论文笔记

BEiT V2 paper

paper note reference

BEiT V2 architecture

Overall impression: two-stage training
1. input image -> visual tokens -> visual presentation (approach: VQ-KD)
2. masked image -> visual presentation from stage 1 (approach: MIM)
VQ-KD process（Vector-quantized Knowledge Distillation）
1. Visual codebook: For a given dataset of visual features extracted from images, we can employ the K-means algorithm to cluster them into several prominent categories. The number of these categories determines the size of the visual codebook, where each category’s centroid represents the specific content of a visual word in the dictionary.
2. Tokenizer encoder
  1. ViT: encode the image patches and generate eigen vector representations(From the above figure, N for patch number, D for the dimension of vector)
  2. Nearest Neighbor Lookup: employ the L2-norm to regularize encoder output {hi^p^}~~i=1~~^N^ and codebook embeddings {ei^p^}~~i=1~~^K^ , then cauculate the minimum cosine distance between them.
3. Decoder
  
  Using the nearest neighbor of each image patch as the input of Decoder part. Output is the semantic reconstruction of these visual tokens.
4. Optimization target(teacher system)
  
  The strategy of VQ-KD involves utilizing the feature learning strategies proposed in model distillation. VQ-KD employs CLIP or DINO as the teacher systems, and utilizes the features generated by the teacher systems as the optimization objective for training the model.
5. Gradient backpropagation
  
  The arg min function is non-differentiable. In order to backpropagate gradients to the encoder, VQ-KD adopts the approach proposed in VQ-VAE which directly copies the gradients from the decoder’s input to the encoder’s output (indicated by the red arrow in Figure 1), as their optimization directions align.
BEiT V2 pretrain(MIM) -> copy from blog
1. MIM
  1. [CLS] token is settled to learn global information
  2. The left part of Figure 3 is MIM pretraining process.
2. [CLS] pretrain