“High-quality data remains a significant bottleneck for improving large language models, even with the availability of synthetic data generation.”