“Goodfire's controversial training technique involved running an interpretability detector on a frozen copy of a model to generate a penalty signal, which was then used to train a separate, active version of the model.”

AI Safety

Loading full analysis…