“The speaker predicts that Goodfire's "frozen model" technique for applying interpretability signals during training will fail to prevent catastrophic outcomes when applied to sufficiently large and advanced AI models.”

AI Safety

Loading full analysis…