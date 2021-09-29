“Fast, simple yet accurate video instance segmentation module based on transformers”. Video instance segmentation (VIS) is the recently introduced computer vision research that aims at joint detection, segmentation, and tracking of instances in the video domain. Recent methods proposed highly sophisticated and multi-stage networks that are practically unusable. Hence, simple yet effective approaches are needed to be used in practice. To fill the gap, we propose an end-to-end transformer-based video instance segmentation module with Sinusoidal Representation Networks (SRN), namely TT-SRN, to address this problem. TT-SRN views the VIS task as a direct sequence prediction problem in a single-stage that enables us to aggregate temporal information with spatial one. Set of video frame features are extracted by twin transformers that then propagated to the original transformer to produce a set of instance predictions. This produced instance-level information is then passed through modified SRNs to get end instance-level class ids and bounding boxes and self-attended 3-D convolutions to get segmentation masks. At its core, TT-SRN is a natural paradigm that handles instance segmentation and tracking via similarity learning that enables the system to produce a fast and accurate set of predictions. TT-SRN is trained end-to-end with set-based global loss that forces unique predictions via bipartite matching. Thus, the general complexity of the pipeline is significantly decreased without sacrificing the quality of segmentation masks. For the first time, the VIS problem is addressed without implicit CNN architectures thanks to twin transformers with being one of the fastest approaches.

