The communication bandwidth of scale-out networks (inter-rack) is a major bottleneck, fundamentally limiting the size of Mixture-of-Experts layers.
Failing to batch user requests leads to extremely poor (up to 1000x worse) cost economics for inference.
Pipeline parallelism, while useful for managing memory capacity, introduces micro-batching which can hurt efficiency by not fully amortizing weight fetches.
Opportunities Identified
Larger scale-up domains (like NVIDIA's Blackwell) unlock more powerful and efficient MoE models by providing massive aggregate memory bandwidth.
Sparse attention architectures offer a path to much longer contexts by fundamentally changing the scaling laws of memory fetch time.
Designing hardware with less memory capacity but optimized for bandwidth could be a viable strategy, especially if pipeline parallelism is used effectively for inference.