AlphaGo's policy network is trained to imitate the entire MCTS visit count distribution, which is..., Sonic AI
“AlphaGo's policy network is trained to imitate the entire MCTS visit count distribution, which is a soft target, rather than just the single best action, which would be a one-hot target.”