“Interpretability techniques like sparse autoencoders have demonstrated that LLMs possess internal world models by allowing researchers to identify and causally intervene on specific concepts, such as the Golden Gate Bridge in an experiment with a Claude model.”

Nathan LabenzAI Safety

Loading full analysis…