Goodfire's first proof of concept for 'intentional design' is a technique that uses a probe train..., Sonic AI
“Goodfire's first proof of concept for 'intentional design' is a technique that uses a probe trained to detect hallucinations to both steer a model at runtime and provide a reward signal for reinforcement learning.”