Is It Time to Rethink LLM Pre-Training? [Aditi Raghunathan] - 747
From TwiML AI Podcast
Aditi Raghunathan•Assistant Professor of Computer Science, Carnegie Mellon University
Executive Summary
Overtraining large language models on excessive data can paradoxically make them worse for downstream fine-tuning, a counter-intuitive finding that challenges the 'more data is always better' paradigm.
The standard next-token prediction objective fundamentally limits LLM creativity and long-range planning, as it encourages local, greedy decisions rather than global coherence.
New training techniques like "memorization sinks" offer a path toward more controllable and editable models by intentionally localizing specific information (e.g., facts, private data) into designated neurons that can be selectively ignored or updated.
Alternative training methods, such as multi-token prediction and diffusion-like processes, show promise for overcoming the creative limitations of current models by forcing them to plan entire sequences at once.
9 quotes
Concerns Raised
Overtraining models on excessive data degrades their fine-tuning potential.
The next-token prediction paradigm fundamentally limits LLM creativity and long-range planning.
Benchmark performance is an increasingly poor proxy for real-world model utility and adaptability.
Retrieval-Augmented Generation (RAG) is not a complete solution for keeping models updated, as they often fail to override parametric knowledge.
Opportunities Identified
"Memorization sinks" can enable targeted information removal and updates, improving model control and safety.
Multi-token prediction and diffusion-like training can unlock greater creativity and structured generation.
Disentangling factual knowledge from reasoning abilities could make models more robust and easier to maintain.
Developing better evaluation metrics focused on adaptability could guide the creation of more useful models.