..

What is Systems for LLM

Systems for LLMs

This year, I decided to focus on systems for LLM. This was not an immediate decision but a reflection on 2024 and my past career. But what do I mean by systems for LLM?

Simply put, it involves building frameworks that facilitate the training and deployment of large models. It spans across 3 layers:

Hardware layer

Understand the hardware deeply and write hardware-aware programs. For Nvidia hardware, you can’t compete with internal people at Nvidia. They know the hardware the best and will always be positioned to write the best general-purpose kernels. However, there is an opportunity outside to do impactful work, specifically in the algorithms and architecture domain. If you move to different hardware (AMD, Intel, etc.), the gains are much, much more. Examples of frameworks: TorchAO, Gemlite, Liger, Cutlass

Orchestration layer

Frameworks that can handle the orchestration of tensors. This means making the best use of the hardware from a scheduling and optimization perspective. This is the crucial layer that bridges the hardware with the end goal (distributed training and inference). This results in building efficient frameworks for either training, post-training, RL, or inference. This is the highest ROI and most interesting area personally to me. Examples of frameworks: vLLM, SGLang, Torchtune, Unsloth, axolotl

Deployment layer

Deploying models using the above frameworks in production. This means managing the instances of the deployed framework by autoscaling, building a monitoring and observability stack. In the LLM era, this also means building a KV cache-aware routing, disaggregated serving, KV cache store, etc. Examples of frameworks and tools: llm.d, Nvidia Dynamo

Every layer is heavily influenced by the layer before it. And in the LLM era, every layer is getting redefined. Right now, my interest lies mostly in (2) and (1).

Why do I like this?

  1. I like programming and building stuff
  2. I like to optimize things, reduce wastage, and run them at scale
  3. I just don’t want to be a recipe curator
  4. Architecture may change, but the scale can never go back; it will always be of value to optimize, adapt, and improve systems

Possible research avenues

  1. Optimizations around KV cache, for example: Paged Attention, KV cache compression
  2. Optimizations around reducing memory footprint, for example: Smoothquant, SwiftKV
  3. Efficient scheduling, speculative decoding, and batching, for example: Chunked prefill, Distserve
  4. Hardware-aware algorithms, for example, Flash attention, Flash decoding

The kind of impact I want to have

  1. The framework should facilitate critical flows like model training and large-scale inferencing, and should help save money based on the optimizations
  2. The framework helps in building real products that have an edge, specifically because of the optimizations and resilience