“The AI industry is moving towards "omni-models" that can handle multiple modalities like text, image, and video within a single architecture.”