“In data-constrained training regimes, using weight decay values up to 30 times larger than those used in compute-optimal pre-training can prevent overfitting and allow for continued performance gains with larger models.”

Kan WuAI / ML

Loading full analysis…