Skip to content

Disaggregated Inference for Scalable LLMs

LLM inference bottlenecks arise from coupled prefill (prompt processing) and decode (token generation) phases. Disaggregation separates them for parallel scaling.

advanced2 / 3

Key Concepts

  • Prefill Phase: Computes KV cache from input prompt (compute-intensive).
  • Decode Phase: Autoregressive token generation (memory-bound).
  • Disaggregation: Run prefill on GPU clusters, decode on separate ones.
Section 2 of 3
Next →