Skip to content
Sonic
AI
Sonic
AI
Home
Discover
Ask Sonic
Projects
Request source or feature
Shreya Shankar — Sonic AI
Home
/
Discover
/
Shreya Shankar
S
Shreya Shankar
Person · Tech
11
Mentions
Episodes
11
Claims
Claims
By Source
Timeline
All
(11)
Finance
(0)
Healthcare
(0)
Government
(0)
Tech
(8)
Energy
(0)
Science
(0)
Geopolitics
(0)
Research shows that developers' criteria for 'good' and 'bad' LLM outputs evolve as they review more examples, a phenomenon known as 'criteria drift', making it impossible to define a complete evaluat...
Expert perspective
Shreya Shankar
Apr 3
Products like Anthropic's Claude Code are built upon foundational models that have been extensively evaluated on coding benchmarks, even if the application team itself claims to rely more on 'vibes'.
Expert perspective
Shreya Shankar
Apr 3
For most AI products, a small number of 'LLM as a judge' evals, typically between four and seven, is sufficient to cover the most critical failure modes.
Expert perspective
Shreya Shankar
Apr 3
LLM judges can be used both in offline unit tests or CI/CD pipelines and for online monitoring of real production traces to measure failure rates over time.
Expert perspective
Shreya Shankar
Apr 3
The initial process of setting up a robust evaluation system for an AI product typically takes three to four days, followed by an ongoing maintenance cost of about 30 minutes per week.
Expert perspective
Shreya Shankar
Apr 3
LLMs often fail at automated error analysis because they lack the necessary product context to identify certain failures, such as hallucinating a feature that does not exist.
Expert perspective
Shreya Shankar
Apr 3
The acquisition of A/B testing company Statsig by OpenAI was a strategic move, potentially influenced by the fact that OpenAI's competitors were also using Statsig's platform.
Speculative
Shreya Shankar
Apr 3
An 'LLM as a judge' is most effective when scoped to evaluate a single, narrow failure mode with a binary pass/fail output.
Expert perspective
Shreya Shankar
Apr 3
To validate an 'LLM as a judge', teams should compare its outputs against human-labeled data using a confusion matrix to analyze false positives and false negatives, rather than relying on a simple ac...
Expert perspective
Shreya Shankar
Apr 3
The concept of error analysis, including 'open coding' and 'axial coding', is a long-standing technique from machine learning and social science, not a new invention for LLMs.
Expert perspective
Shreya Shankar
Apr 3
OpenAI's evaluation methods include analyzing public sentiment from sources like Twitter and Reddit to identify product issues.
Expert perspective
Shreya Shankar
Apr 3
Sign up free to see the full entity analysis
Get started free
Back to Entities
Entity Detail