“The attention mechanism in Transformer models does not map efficiently to large systolic arrays, whereas Mixture of Experts (MoE) layers do.”