“The on-policy self-distillation (OP-SD) technique for model training does not require an outer-loop verifiable reward signal.”