“To prevent a model from learning to evade a hallucination detection probe during RL training, Goodfire's technique runs the probe on a frozen, separate copy of the model, making it computationally easier for the student model to change its behavior than to evade the fixed detector.”

Tom McGrathLLMs

Loading full analysis…