“Notion's AI evaluation process uses two distinct sets of tests: a "golden set" where the model must achieve 100% accuracy, and a "challenging set" where it is expected to fail around 50% of the time to measure improvements.”

Ryan NystromAI / ML

Loading full analysis…