The core trade-off in LLM inference is between cost and latency, primarily managed by adjusting batch size.
Larger batches are more cost-effective but increase latency, while smaller batches are faster but drastically more expensive.
LLM performance is fundamentally constrained by a "roofline" model, balancing memory bandwidth (fetching weights and KV cache) and compute (matrix multiplications).
The optimal operating point is where the system is equally bound by both.
The evolution of hardware, particularly NVIDIA's increasing "scale-up domain" size (from 8 GPUs in Hopper to 72 in Blackwell), is a critical enabler for larger and more efficient Mixture-of-Experts (MoE) models by providing massive aggregate memory bandwidth.
There's a crucial distinction between memory bandwidth (the primary bottleneck for inference) and memory capacity.
Modern systems like a Blackwell rack have ample capacity for trillion-parameter models, but the challenge lies in efficiently utilizing the available bandwidth.
12 quotes
Concerns Raised
The communication bandwidth of scale-out networks (inter-rack) is a major bottleneck, fundamentally limiting the size of Mixture-of-Experts layers.
Failing to batch user requests leads to extremely poor (up to 1000x worse) cost economics for inference.
Pipeline parallelism, while useful for managing memory capacity, introduces micro-batching which can hurt efficiency by not fully amortizing weight fetches.
Opportunities Identified
Larger scale-up domains (like NVIDIA's Blackwell) unlock more powerful and efficient MoE models by providing massive aggregate memory bandwidth.
Sparse attention architectures offer a path to much longer contexts by fundamentally changing the scaling laws of memory fetch time.
Designing hardware with less memory capacity but optimized for bandwidth could be a viable strategy, especially if pipeline parallelism is used effectively for inference.