2023-10-01 ~1 min read

Word Embeddings

Word2Vec

Softmax is used to squash arbitrary range of values to a probability distribution b/w values of $[0, 1]$
Word2Vec actually trains two vectors for each word, one when it appears in center and other when it appears in context (outside). The final output is the average of both the embeddings
Word2vec is a bag of words model -> it doesn’t care about the order of the words
There are two variants of word embedding models
1. Skip gram - Predict context word given centre word
2. Continuous bag of words (CBOW) - Predict centre word given context words

Why not just have co occurrence matrix with a small window length * Vectors increase in size with vocab., very high dimensional
One possible solution is to reduce the dimension of the matrix by methods such as SVD
Disproportionate importance to high frequency words

![[../assets/images/glove.png]]

Ratio of co occurrences probabilities encode meanings
How can we learn just meanings? Idea is to change the probability function to capture co occurrence