“Research from Allen AI demonstrated that removing all duplicate data before training results in worse-performing language models.”