In data-constrained training regimes, using weight decay values up to 30 times larger than those ..., Sonic AI
“In data-constrained training regimes, using weight decay values up to 30 times larger than those used in compute-optimal pre-training can prevent overfitting and allow for continued performance gains with larger models.”