Model architectures like Kimi's, which use residual connections where attention layers attend to ..., Sonic AI
“Model architectures like Kimi's, which use residual connections where attention layers attend to layers several steps back, are difficult to implement using pipeline parallelism.”