Jeffrey Irving, Sonic AI

Skip to content

Home Discover Ask Sonic Projects

Use with Claude or ChatGPT

Home Discover Ask Sonic Projects

Use with Claude or ChatGPT

Jeffrey Irving, Sonic AI

Home/Discover/Jeffrey Irving

J

Jeffrey Irving

Person · Tech

16

Mentions

Episodes

16

Claims

The UK AI Security Institute conducted a long-term red teaming collaboration with Anthropic and OpenAI that found significantly more jailbreaks than a normal pre-deployment evaluation would have.

Official sourceJeffrey IrvingApr 3

Many recently observed sophisticated bad behaviors in AI, such as sycophancy and deception, are all different manifestations of the same underlying problem of reward hacking.

Expert perspectiveJeffrey IrvingApr 3

A key failure mode for AI debate, discovered in human experiments by Beth Barnes, is for one party to steer the conversation into a confusing area where neither side knows the answer, which can fool t...

Expert perspectiveJeffrey IrvingApr 3

Techniques like pre-training data filtering and unlearning can remove dangerous capabilities from open-source models, but this only buys time before general capabilities allow the model to re-acquire ...

Expert perspectiveJeffrey IrvingApr 3

The UK AI Security Institute has evaluated over 30 different models or testing runs, and every time it has conducted safeguard testing, it has successfully jailbroken the model.

Official sourceJeffrey IrvingApr 3

Over the last year, despite various AI models exhibiting deceptive behaviors, the world's primary response has been to continue training stronger models.

Expert perspectiveJeffrey IrvingApr 3

It is important to build up independent AI safety and security research at nonprofits, in academia, and in governments, rather than having most of the work happen at AI developer labs.

Expert perspectiveJeffrey IrvingApr 3

The UK AI Security Institute (AISI) has approximately 100 technical experts on staff and around 200 total employees.

Official sourceJeffrey IrvingApr 3

Current pragmatic AI safety approaches, such as AI control measures, monitoring, and honesty training, all have correlated potential failures and could fail for the same essential reason.

Expert perspectiveJeffrey IrvingApr 3

Reinforcement learning is being successfully used to improve AI model capabilities in non-verifiable domains, such as analyzing a photo of a biology experiment.

Expert perspectiveJeffrey IrvingApr 3

A study by the UK AI Security Institute's human influence team found that AI models are very effective at persuasion on political questions, with newer models being more persuasive.

Official sourceJeffrey IrvingApr 3

The risk of AI loss of control is more strongly coupled with cybersecurity risks than with biological risks.

Expert perspectiveJeffrey IrvingApr 3

The three main catastrophic risks the UK AI Security Institute focuses on are biological weapons, large-scale cyber attacks, and loss of control.

Official sourceJeffrey IrvingApr 3

The UK AI Security Institute's evaluations are conducted using "inspect," an open-source package also used by other governments and AI developers for model testing.

Official sourceJeffrey IrvingApr 3

The UK AI Security Institute places significant probability on the idea that current AI methods will scale, and where they don't, more mundane algorithmic progress will fill the gaps.

Official sourceJeffrey IrvingApr 3

The typical AI developer plan for addressing loss-of-control risk involves using pragmatic safety measures and monitoring to buy time until automated safety research can find better solutions.

Expert perspectiveJeffrey IrvingApr 3

Sign up free to see the full entity analysis

Get started free

Back to Entities Entity Detail