“Reinforcement learning (RL) structurally optimizes for changing the minimum number of token log probabilities required to achieve a correct answer, whereas supervised fine-tuning (SFT) overrides the entire output sequence.”

Kyle CorbittLLMs

Loading full analysis…