..

LLM Inference Part 1

This is an ever evolving post on LLM inference (mostly on GPUs). It covers references/explainations around these topics:

  1. Orthogonal improvements
  2. References
  3. Links

Orthogonal improvements

  1. Paged Attention
  2. Chunked pre fill
  3. Prefix caching
  4. Flash decoding and Flash Attention
  5. KV Caching and compression
  6. Attention sinks
  7. Batching 1

References

  1. Building Machine Learning Systems for a Trillion Trillion Floating Point Operations