..
LLM Inference Part 1
This is an ever evolving post on LLM inference (mostly on GPUs). It covers references/explainations around these topics:
Orthogonal improvements
- Paged Attention
- Chunked pre fill
- Prefix caching
- Flash decoding and Flash Attention
- KV Caching and compression
- Attention sinks
- Batching 1