“Research has shown that fine-tuning large language models can lead to unpredictable and adverse emergent behaviors, such as a model fine-tuned on one negative task becoming generally malevolent.”

Nathan LabenzAI Safety

Loading full analysis…