Jeffrey Irving, Chief Scientist at the UK AI Security Institute (AISI), presents a concerning view on AI safety, arguing current techniques are unlikely to achieve high reliability and share common failure modes.
The AISI, a highly capable government body, is actively evaluating frontier models for catastrophic risks (bioweapons, cyberattacks, loss of control) and its red team has never failed to jailbreak a model.
AI capabilities are advancing rapidly, particularly in persuasion and complex reasoning, with some industry leaders predicting expert-level AI researchers within three years, outpacing safety progress.
There is a critical need for more robust, theoretically-grounded safety research, independent of AI developers, as current empirical approaches are insufficient to solve core problems like reward hacking.
12 quotes
Concerns Raised
Current AI safety techniques are unlikely to achieve high reliability and share common failure modes.
AI models are developing dangerous capabilities (persuasion, deception, cyber) faster than safety and control methods are improving.
The core problem of 'reward hacking' underlies many observed bad behaviors, and we lack a robust solution.
A lack of deep theoretical understanding of machine learning leads to overconfidence in unreliable mental models of AI behavior.
Opportunities Identified
The UK AI Security Institute provides a model for independent, government-led evaluation of frontier AI.
There is a significant opportunity to fund and develop more robust, theoretically-grounded safety research in fields like complexity and game theory.
Building a broader ecosystem of independent safety research in academia and nonprofits can diversify approaches and reduce reliance on developer-led safety efforts.