..
Word Embeddings
Word2Vec
Notes
- Softmax is used to squash arbitrary range of values to a probability distribution b/w values of $[0, 1]$
- Word2Vec actually trains two vectors for each word, one when it appears in center and other when it appears in context (outside). The final output is the average of both the embeddings
- Word2vec is a bag of words model -> it doesn’t care about the order of the words
- There are two variants of word embedding models
- Skip gram - Predict context word given centre word
- Continuous bag of words (CBOW) - Predict centre word given context words
Training efficiency
- The loss function of the word2vec model includes two terms
- Numerator which focuses on decreasing the loss function when embedding of centre word and outside word are similar
- Denominator which focuses on decreasing the loss function when embedding of centre word is dissimilar to other words in the vacobulary
- This computation is pretty slow because we have to calculate this for every centre word with every word in the vocab
- To optimize this calculation we update the loss function to sample words from the vocabulary against which the centre word optimizes to be dissimilar
- This is called Negative-sampling
Co occurrence matrix
- Why not just have co occurrence matrix with a small window length * Vectors increase in size with vocab., very high dimensional
- One possible solution is to reduce the dimension of the matrix by methods such as SVD
- Disproportionate importance to high frequency words
GloVe
![[../assets/images/glove.png]]
- Ratio of co occurrences probabilities encode meanings
- How can we learn just meanings? Idea is to change the probability function to capture co occurrence