“Sparse attention architectures scale better than dense attention in terms of memory fetch time relative to context length.”