Kuan Wu

Person

Mentions

Episodes

Claims

Kuan Wu, mentioned 9 times across podcast episodes and expert conversations analyzed by Sonic.

What Kuan Wu has said

In data-constrained settings, it is more effective to train an ensemble of smaller models than to train a single large model with the same total parameter count.

neutral·Kan Wu

In data-constrained training regimes, using weight decay values up to 30 times larger than those used in compute-optimal pre-training can prevent overfitting and allow for continued performance gains with larger models.

neutral·Kan Wu

An 8-member ensemble model with 2.4 billion total parameters can be distilled into a single 300 million parameter model while retaining 83% of the loss improvement.

bullish·Kan Wu

In data-constrained settings, it is more effective to train an ensemble of smaller models than to train a single large model with the same total parameter count.

Expert perspectiveKan WuMay 31

An 8-member ensemble model with 2.4 billion total parameters can be distilled into a single 300 million parameter model while retaining 83% of the loss improvement.

Expert perspectiveKan WuMay 31

Self-distillation, where a model is distilled into a new model of the same size, can significantly improve loss and outperform the asymptote of a heavily regularized single model.

Expert perspectiveKan WuMay 31

A joint scaling recipe combining aggressive regularization and ensembling can achieve a 5x data efficiency win over standard pre-training methods.

Expert perspectiveKan WuMay 31

In a continued pre-training scenario on math data, data-efficiency techniques like aggressive epoching and ensembling matched the performance of training on 73 billion tokens while using only 4 billio...

Expert perspectiveKan WuMay 31

Public projections indicate that the amount of human-generated text on the internet is growing by approximately 3% per year.

Expert perspectiveKan WuMay 31

The amount of compute spent per data point in pre-training will increase by roughly 4x year-over-year due to the disparity in growth rates between compute availability and data generation.

SpeculativeKan WuMay 31

The amount of compute spent on pre-training large language models is growing by approximately 4x to 5x per year.

Expert perspectiveKan WuMay 31

Create a free account to see Kuan Wu's full intelligence report - every claim, the relationship network, and AI Q&A across all sources. No card needed.

Get started free

Back to Entities Entity Detail