The GRPO algorithm assigns credit for a high-scoring output by up-weighting the less common (lowe..., Sonic AI
“The GRPO algorithm assigns credit for a high-scoring output by up-weighting the less common (lower probability) tokens in the sequence, assuming they were responsible for the positive outcome.”