“Dima from Fireworks states that Mixture-of-Experts (MoE) models are highly sensitive to numerical mismatches between inference and training, which can cause the model to try and update the weights of an expert that was not actually used during the forward pass.”