..

2024-11-28

LLM Inference Part 1

This is an ever evolving post on LLM inference (mostly on GPUs). It covers references/explainations around these topics:

Orthogonal improvements
References
Links

Orthogonal improvements

Paged Attention
Chunked pre fill
Prefix caching
Flash decoding and Flash Attention
KV Caching and compression
Attention sinks
Batching ¹

References

Building Machine Learning Systems for a Trillion Trillion Floating Point Operations

Links

Scheduler policy in vLLM vs TensorRT ↩