2024-05-01

Short intro to VQ VAEs

VAEs

Auto encoders but with a “restricted” latent space
We make sure that the latent space also follows some probability distribution, hence we have a prior for the latent space
Why is this done? So that the encoder does not assume any latent space and project the embeddings very far away form each other. If we “restrict” the latent space somehow, it forces the encoder to group similar things closer to each other
The latent space embeddings that we get are smooth (continuous). Hence we can’t directly use them in generative transformer models

Specifically, it learns to pick out the high-level features explicitly, and then approximate the complexity with random sampling.

VQ VAEs

This flavour implements a vocabulary (or cookbook) for the latent space.
Every embedding from the encoder is matched to the closest one in the vocabulary. This vocabulary is also learned
Example, take an image which is encoded to a 32x32 matrix. Every cell in this matrix can be embedding using a vector of size 512.
- In total, we would get $512^{32*32}$ possible encodings
This makes it possible to tokenize the whole image
We maximize ELBO. But what is ELBO?

References

https://www.compthree.com/blog/autoencoder/
https://lilianweng.github.io/posts/2018-08-12-vae/
https://jaan.io/what-is-variational-autoencoder-vae-tutorial/
https://github.com/MishaLaskin/vqvae/tree/master
https://github.com/adam-maj/deep-learning/blob/main/05-image-generation/02-vae/04-vae.ipynb