..
Things read in 2023 OND
Last 3 months, my learning curve has been pretty steep. Probably the most steep in my entire career. I thought I’d take a break from consuming all the information, reflect on what I learned, and document the same for the future.
ASR
Related to my work
NLP
Getting to learn the field of NLP from the basics. I read a bunch of research papers and blogs. Here are my learnings from a few of them
Attention is all you need
- Foundational architecture which is the basis for all the current SOTA LLMs
- Q, K, and V are matrices that are formed by linear projections (i.e multiplying by weight matrices) of input X
- LayerNorm - Mostly used for NLP tasks. For CV tasks, batch norm is used
- Multi-head attention consists of multiple attention layers (heads) in parallel with different linear transformations on the queries, keys, values, and outputs
- The Decoder block has cross attention where the K, V projections are outputs, but Q projections come from the encoder block
- The Decoder block has masking and output offset by 1 position
- Multi-head attention: Multiple projections of input so that different relationships between words are captured
- In a typical setting, the output dimension after Q, K, and V projection is
d/h
whered
is the input dimension andh
is the number of heads - This is done so that after concatenating outputs from different heads horizontally, we get a
d
-dimension output - Concatenated output is again multiplied by a weight matrix ($W_{o}$)
- The FFN has two weight matrices. FFN projects the output after attention network to
4*d
dimension and then downsizes it intod
dimension again (hence two weight matrix) - Adds label smoothening which improves accuracy at the cost of perplexity (need to read more to understand why label smoothening hurts perplexity score)
- Great resource: The illustrated transformer by Jay Alammar
BERT
- Mostly a combination of learnings from word2vec and ELMO. Uses an encoder stack of vanilla transformers, masks words in between, and uses the network to predict the word
- Speciality is that it uses context from both sides of the word hence the word bi-directional in BERT
- We can use the same model for multiple tasks. Why? Because the training is done in such a fashion that the model understands the complete context of the input sequence.
- The first token of the sentence is always a special token
<
CLS>` token. The corresponding output token of this aggregates the knowledge of the entire sentence
- The first token of the sentence is always a special token
- If there are more than 1 sentence in a sequence, they are separated by a special token
<SEP>
. During the forward pass, separate embeddings are added to separate sentences to indicate that these are two different sentences. This is different then positional embedding which is at a token level - Great resource again from Jay Alammar: The illustrated BERT
GPTs
- Decoder-only architecture. Auto-regressive in nature: Predicts one word and appends to the input and predicts next work again
top_k
= How many words to take into account for every next token generation. Picksk
words and from them, selects the words according to the probability distribution- Masked self-attention is different in BERT and GPT. In BERT masked tokens can take into account attention from ‘future’ tokens (or from both sides)
- Masking is applied after query and key matrix multiplication but before applying softmax (why? need to understand this better)
Beam search
- In beam search, we keep on adding the scores (negative log) of each token and at each time step, take the k highest scoring one
- The longer you go down the path (before getting a
<EOS>
token), the lower the score will become (because of a negative log) - The solution is to normalize the score by the number of time steps
LLMs
Started working on LLMs professionally. Very fast-paced space which I have never experienced before. It is tough to figure out what things are noise and what will stick. I am slower than most in catching up in this field but I am getting deep into a few fundamental areas
Some innovations/experimentations in LLM space
- Multi query attention: Only have a single K, V matrix for all the attention heads. This results in a reduced number of parameters, faster training, and faster inference
- Grouped query attention: We have shared the K, V matrix for a few attention heads. Eg: If we have 16 attention heads, we can have 4 K, V matrices, and every single pair of K, V matrices are shared amongst 4 attention heads. This was used in Llama 2
- Needle in the haystack attacks (ref). Pretty neat way to evaluate the retrieval performance of LLMs
Zephyr
- Zephyr included 2 main things to improve upon the Mistral model: dSFT and dDPO.
- We have two models, one is the student model (much smaller, distilled version) and a teacher model (much larger, GPT4)
- dSFT: Distilled SFT
- The model is trained on GPT4-generated prompts and responses. The dataset is of the format (
x1
,y1
) wherex1
is the prompt andy1
is the response (they iteratively improve the prompt as well) - However, these models did not turn out to be very aligned with how humans respond
- The model is trained on GPT4-generated prompts and responses. The dataset is of the format (
- dDPO: Distilled DPO
- To align the output of the model to human responses Zephyr uses the method of DPO
- They generate responses to a prompt (
x
) from multiple models (say, Mistral, Claude, etc.,y_1
,y_2
, and so on) and use GPT 4 as an evaluator to score them. The winning response is labeled asy_w
and one of the losing response is sampled which is labeled asy_l
. Now we have a dataset represented as: (x
,y_w
,y_l
) (Prompt, winning response, losing response) - The dataset generated above is used to train a preference model, where given
x
the model learns to ranky_l
abovey_w
. The steps are as follows (para phrased from the paper)- Compute the probability for (
x
,y_w
) and (x
,y_l
) from the dSFT model (forward-only) - Compute the probability for (
x
,y_w
) and (x
,y_l
) from the dDPO model. - Compute objective function for the preference model and backpropagate to update. Repeat.
- Compute the probability for (
- They skip creating an intermediate reward model and directly update the student model while doing alignment. This the different from RLHF
- They also save the cost of a human evaluating the response and use GPT 4 as the evaluator
The power of prompting
- Microsoft blog: They evaluated carefully crafted prompt techniques to beat Gemini Pro’s benchmark in MMLU. According to them, specially crafted prompts can perform way better than no prompts. Prompts can significantly steer the performance of the model so much so that sometimes foundational models (which is a generalist) can beat specialized models trained for a particular task. They share an example of medical prompts used with GPT4 which could beat a specialized model trained on medical data (here)
- Prompting also improved the performance of Anthropic’s Claude model against the needle in the haystack attack (ref)
Sarvam’s OpenHathi
- All the current tokenizers do not prioritize Hindi characters because Hindi tokens don’t usually end up being in the limited vocabulary of the tokenizers
- This resulted in issues like expensive tokenization, and slower response times since models needed to output more tokens. I evaluated that on an open source translation dataset, the same sentences required almost 4-5x the number of tokens in Hindi than in English using tiktoken
- The first step that Sarvam did was to add 12k tokens for Hindi characters and “align” the tokenizer
- A new sentence piece tokenizer is trained from scratch with a vocabulary size of 16K. This is then merged with the existing Llama tokenizer
- The embedding of this new tokenizer is randomly initialized and the Llama model is finetuned for the translation task. Notably, this step is done using LoRA to preserve the original model’s weights
- The second step was teaching the model a world model in Hindi. This was done on some original Hindi datasets available + some translated data from English to Hindi
- The training step here is a bit weird in the sense that it learns to alternate between Hindi and English sentences (this is done purposefully). Due to this, the base model replies in this format only, wherein it alternates b/w both English and Hindi
- In the final phase, they perform SFT for some tasks like Translation, Content moderation, Qeustion-answering
- With all this finetuning, the model still performs at par with the base Llama in English only benchmarks and they claim that it outperforms GPT4 in some translation tasks