Reinforcement learning (RL) structurally optimizes for changing the minimum number of token log p..., Sonic AI
“Reinforcement learning (RL) structurally optimizes for changing the minimum number of token log probabilities required to achieve a correct answer, whereas supervised fine-tuning (SFT) overrides the entire output sequence.”