“A key innovation of the GRPO algorithm was to completely discard the value model (or critic model) that was central to its predecessor, PPO.”