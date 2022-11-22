The use of multimodal data for AI training has gained popularity, particularly in recent years. The popularity of voice-activated screen devices like the Amazon Echo Show is rising due to their increased potential for multimodal interactions. Customers can refer to products on-screen using spoken language, which makes it easier for them to express their objectives. Multimodal coreference resolution (MCR) refers to this process of selecting the appropriate object on the screen utilizing natural language comprehension. In order to create the next generation of conversational bots involves resolving the references across many modalities, such as text and visuals.

