Instead of relying on a single, ever-larger model, the future lies in building 'networks of networks' that compose calls to multiple, often smaller and specialized, models. This approach, using techniques like ensembling and routing, can achieve superior performance and reliability.
The discussion highlights the massive cost dispersion between AI models and the fact that compute is now a primary P&L item for AI companies, often exceeding personnel costs. Compound systems offer a way to achieve frontier performance with up to 1000x cost reductions by leveraging cheaper models.
The speaker repeatedly emphasizes that clever system design can break traditional trade-offs. Methods like 'laconic decoding' (parallel calls with early stopping) can be simultaneously faster, more accurate, and cheaper than a single model call.
The rise of compound AI necessitates new infrastructure and programming models. Foundry's cloud is built for scheduling diverse ML workloads, while its Ember framework aims to be for 'networks of networks' what PyTorch was for neural networks.
Compound techniques are most powerful on tasks where it is easier to verify a correct answer than to generate it, such as in coding or mathematics. This allows a 'judge' or 'verifier' model to effectively select the best output from multiple 'generator' models.
Keep pulling the thread on Jared Quincy Davis.