Multi-Modality is the way forward for efficient and scalable solutions in the Transformer Era. The famed Transformer [1] architecture released in 2017 was simply outstanding and went on to break almost all SOTA benchmarks in Natural Language Processing. It was then adapted to other domains like Image & Video classification, Object Detection, Audio classification, and even in Generative networks without any major modifications. One of the popularly known architecture — Vision Transformer [2] achieved SOTA results on the ImageNet Classification task which then set the base in the Computer Vision domain. We now have mainstream research happening around Transformers for every domain under the Deep Learning umbrella. The architectures and results achieved are exceptional but throw a simple question,

TECHNOLOGY ・ 3 DAYS AGO