“On-policy self-distillation (OPSD) provides a denser supervision signal for model training compared to naive Reinforcement Learning (RL).”