“Techniques like pre-training data filtering and unlearning can remove dangerous capabilities from open-source models, but this only buys time before general capabilities allow the model to re-acquire the knowledge.”

Jeffrey IrvingAI Safety

Loading full analysis…