Things read 2025 JFM
Blogs
- The Bitter Lesson
-
Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation.
-
researchers always tried to make systems that worked the way the researchers thought their own minds worked—they tried to put that knowledge in their systems—but it proved ultimately counterproductive, and a colossal waste of researcher’s time, when, through Moore’s law, massive computation became available and a means was found to put it to good use.
-
- Machines of loving grace - A very nice article on the optimisim of powerful AI and its impact on our lives
- You could have designed state of the art Positional Encoding
Knowledge distillation
3 types of KD:
- Response based -> Hard targets (single sampled logit) or soft targets (top k logits with KL divergence loss)
- Feature based -> Projector layers between student and teacher intermediate layers so that student can “map” to teacher’s intermediate layers
- Relation based -> Produce relationship matrices that can capture the relationship between input and outputs from teacher model. The student model learns to produce these relationship matrices.
The toughest part of knowledge distillation is between two models who have different vocabularies. This can be handled by:
- Using the output of the teacher model as sequences and directly finetuning the student model. This is similar to synthetic data generation but a week strategy.
- Just aligning the intermediate layers (feature based KD) without logit based distillation
- Vocabulary alignment - I need to read more on this
Videos
- Don’t teach. Incentivize
- Even though seems like an introductory video to LLMs, it has quite good lessons
- General methods with less structures works on scale better than specialized strucutres
- Instead of explicitly training the model for a task (teaching), we incentivize it to perform better on next word prediction
- This signals the model to learn novel skills on its own
- Generalized models may be better than smaller specialized models because generalized models have not been taught a specific task
- They have also enjoyed more compute
- This bias in specialized models > generalized models for a task comes because humans usually have limited time to spend on tasks (either generalize or speicalize)
- But models don’t have that kind of limitations
- A quote that I liked from the video = “Teach a man the taste of fish and make him hungary” -> this will force him to learn way more skills than fishing, since now he is learning a lot of other skills on his own (their is no direct task like learn to fish)
- What model chooses to learn emerges with model size. If the size is smaller, it may give up learning higher level skills.
Research papers
Deepseek r1
- OpenAI o1 introduced inference time scaling with the help of CoT reasoning process
- Other efforts like PRM (process-based reward models), RL, search algorithms (monte carlo tree search, beam search) has not scaled very well
- Deepseek r1 introduces pure RL based reasoning models
- Started with Deepseek V3 as base
- Deepseek r1 zero emerged with reasoning behaviour but not quite polished
- Distilled smaller dense models from Deepseek r1. Interestingly
-
Using Qwen2.532B (Qwen, 2024b) as the base model, direct distillation from DeepSeek-R1 outperforms applying RL on it.
-
Deepseek r1 zero
- Directly apply RL on base model (Deepseek v3) instead of applying SFT
- What does this mean? They use GRPO algorithm as follows:
- Generating a group of responses for each question using the current policy (LLM)
- Calculating their rewards, normalizing those rewards to get advantages. Advantages are just normalized rewards.
- Updating the policy to maximize the objective that balances getting higher advantages while not deviating too much from the old policy (due to clipping) and staying close to the reference policy (due to the KL term in the loss)
- The method that gives rewards on these outputs can be based on rules/models. Specifically they have 2 types of rewards
- Accuracy rewards: For objective questions if the answer is correct and in the correct format
- Format rewards: If the model is using
<think>
tokens properly
- This alone was enough to achieve a performance comparable to o1
- They mention that as the number of steps increases in training, the model also starts using more tokens for reasoning
One of the most remarkable aspects of this self-evolution is the emergence of sophisticated behaviors as the test-time computation increases. Behaviors such as reflection—where the model revisits and reevaluates its previous steps—and the exploration of alternative approaches to problem-solving arise spontaneously. These behaviors are not explicitly programmed but instead emerge as a result of the model’s interaction with the reinforcement learning environment.
- (M1): Instead of directly starting from the base model, they first apply SFT using CoT data to the base model
- (M2): Applies RL on the above model (M1) and once it converges, generates SFT data from this model (M2). This time, they also add a reward for language consistency in CoT (which degraded the perfomance slightly)
- Finally, they add a bunch of reward signals for helpfulness, harmlessness and human preferences. They again start form the base model, finetune it using data generated from M2 and applies RL like in Deepseek r1 zero to get the final model Deepseek r1
- Applies distillation to smaller models using SFT data curated using M2. The performance improves significantly.
- Does not apply RL to these models using this data like they did to Deepseek R1 zero
- What if you apply RL directly to the small models (like Llama, Qwen)?
- Does not achieve good performance. Better to distill from a larger reasoning model instead of trying RL directly on smaller models
- They also explored MCTS and PRM algorithms which did not work, didnt udnerstand the MCTS part
Implementation
- From online softmax to Flash attention
- Very intuitive approach to deriving flash attention implementation. Helped me implement online softmax and flash attn. in principle in PyTorch
- Naive softmax