Goodfire's controversial training technique involved running an interpretability detector on a fr..., Sonic AI
“Goodfire's controversial training technique involved running an interpretability detector on a frozen copy of a model to generate a penalty signal, which was then used to train a separate, active version of the model.”