“The VO3 model only generates audio for video clips created from a text prompt, not for those generated from an image-to-video prompt.”