..
Short intro to VQ VAEs
VAEs
- Auto encoders but with a “restricted” latent space
- We make sure that the latent space also follows some probability distribution, hence we have a prior for the latent space
- Why is this done? So that the encoder does not assume any latent space and project the embeddings very far away form each other. If we “restrict” the latent space somehow, it forces the encoder to group similar things closer to each other
- The latent space embeddings that we get are smooth (continuous). Hence we can’t directly use them in generative transformer models
Specifically, it learns to pick out the high-level features explicitly, and then approximate the complexity with random sampling.
VQ VAEs
- This flavour implements a vocabulary (or cookbook) for the latent space.
- Every embedding from the encoder is matched to the closest one in the vocabulary. This vocabulary is also learned
- Example, take an image which is encoded to a 32x32 matrix. Every cell in this matrix can be embedding using a vector of size 512.
- In total, we would get $512^{32*32}$ possible encodings
- This makes it possible to tokenize the whole image
- We maximize ELBO. But what is ELBO?
References
- https://www.compthree.com/blog/autoencoder/
- https://lilianweng.github.io/posts/2018-08-12-vae/
- https://jaan.io/what-is-variational-autoencoder-vae-tutorial/
- https://github.com/MishaLaskin/vqvae/tree/master
- https://github.com/adam-maj/deep-learning/blob/main/05-image-generation/02-vae/04-vae.ipynb