Object Propagation via Inter-Frame Attentions for Temporally Stable Video Instance Segmentation
By Anirudh S Chakravarthy, Won-Dong Jang, Zudi Lin, Donglai Wei, Song Bai, Hanspeter Pfister
5 days ago
Video instance segmentation aims to detect, segment, and track objects in a video. Current approaches extend image-level segmentation algorithms to the temporal domain. However, this results in temporally inconsistent masks. In this work,...
The temporal sentence grounding in video (TSGV) task is to locate a temporal moment from an untrimmed video, to match a language query, i.e., a sentence. Without considering bias in moment annotations (e.g., start and end positions in a video), many models tend to capture statistical regularities of the moment annotations, and do not well learn cross-modal reasoning between video and language query. In this paper, we propose two debiasing strategies, data debiasing and model debiasing, to "force" a TSGV model to capture cross-modal interactions. Data debiasing performs data oversampling through video truncation to balance moment temporal distribution in train set. Model debiasing leverages video-only and query-only models to capture the distribution bias, and forces the model to learn cross-modal interactions. Using VSLNet as the base model, we evaluate impact of the two strategies on two datasets that contain out-of-distribution test instances. Results show that both strategies are effective in improving model generalization capability. Equipped with both debiasing strategies, VSLNet achieves best results on both datasets.
Learning object-centric scene representations is essential for attaining structural understanding and abstraction of complex scenes. Yet, as current approaches for unsupervised object-centric representation learning are built upon either a stationary observer assumption or a static scene assumption, they often: i) suffer single-view spatial ambiguities, or ii) infer incorrectly or inaccurately object representations from dynamic scenes. To address this, we propose Dynamics-aware Multi-Object Network (DyMON), a method that broadens the scope of multi-view object-centric representation learning to dynamic scenes. We train DyMON on multi-view-dynamic-scene data and show that DyMON learns -- without supervision -- to factorize the entangled effects of observer motions and scene object dynamics from a sequence of observations, and constructs scene object spatial representations suitable for rendering at arbitrary times (querying across time) and from arbitrary viewpoints (querying across space). We also show that the factorized scene representations (w.r.t. objects) support querying about a single object by space and time independently.
Spiking Neural Networks (SNN) and the field of Neuromorphic Engineering has brought about a paradigm shift in how to approach Machine Learning (ML) and Computer Vision (CV) problem. This paradigm shift comes from the adaption of event-based sensing and processing. An event-based vision sensor allows for sparse and asynchronous events to be produced that are dynamically related to the scene. Allowing not only the spatial information but a high-fidelity of temporal information to be captured. Meanwhile avoiding the extra overhead and redundancy of conventional high frame rate approaches. However, with this change in paradigm, many techniques from traditional CV and ML are not applicable to these event-based spatial-temporal visual streams. As such a limited number of recognition, detection and segmentation approaches exist. In this paper, we present a novel approach that can perform instance segmentation using just the weights of a Spike Time Dependent Plasticity trained Spiking Convolutional Neural Network that was trained for object recognition. This exploits the spatial and temporal aspects of the network's internal feature representations adding this new discriminative capability. We highlight the new capability by successfully transforming a single class unsupervised network for face detection into a multi-person face recognition and instance segmentation network.
Precise instrument segmentation aid surgeons to navigate the body more easily and increase patient safety. While accurate tracking of surgical instruments in real-time plays a crucial role in minimally invasive computer-assisted surgeries, it is a challenging task to achieve, mainly due to 1) complex surgical environment, and 2) model design with both optimal accuracy and speed. Deep learning gives us the opportunity to learn complex environment from large surgery scene environments and placements of these instruments in real world scenarios. The Robust Medical Instrument Segmentation 2019 challenge (ROBUST-MIS) provides more than 10,000 frames with surgical tools in different clinical settings. In this paper, we use a light-weight single stage instance segmentation model complemented with a convolutional block attention module for achieving both faster and accurate inference. We further improve accuracy through data augmentation and optimal anchor localisation strategies. To our knowledge, this is the first work that explicitly focuses on both real-time performance and improved accuracy. Our approach out-performed top team performances in the ROBUST-MIS challenge with over 44% improvement on both area-based metric MI_DSC and distance-based metric MI_NSD. We also demonstrate real-time performance (> 60 frames-per-second) with different but competitive variants of our final approach.
The practical implementation of free-space quantum information tasks requires entanglement to be sustained over long distances and in the presence of turbulent and noisy environments. The transverse position-momentum entanglement of photon pairs produced by parametric down-conversion has found several uses in quantum information science, however, it is not suitable for applications involving long-distance propagation as the entanglement decays very rapidly when photons propagate away from their source. Entanglement is lost after a few centimetres of propagation, and the effect becomes even more pronounced in turbulent environments. In contrast, in this article, we show that entanglement in the angle-orbital angular momentum (OAM) bases exhibits a remarkably different behaviour. As with the position-momentum case, initially, the angle-OAM entanglement decays with propagation, but as the photons continue to travel further from the source, the photons regain their strongly correlated behaviour, and the entanglement returns. We theoretically and experimentally demonstrate this behaviour and show that entanglement returns even in the presence of strong turbulence. The only effect of turbulence is to increase the propagation distance for revival, but once revived, the two photons remain entangled up to an arbitrary propagation distance. This work highlights the role that OAM-angle entanglement will play in applications where quantum information is shared over long distances.
Segmentation of head and neck (H\&N) tumours and prediction of patient outcome are crucial for patient's disease diagnosis and treatment monitoring. Current developments of robust deep learning models are hindered by the lack of large multi-centre, multi-modal data with quality annotations. The MICCAI 2021 HEad and neCK TumOR (HECKTOR) segmentation and outcome prediction challenge creates a platform for comparing segmentation methods of the primary gross target volume on fluoro-deoxyglucose (FDG)-PET and Computed Tomography images and prediction of progression-free survival in H\&N oropharyngeal cancer.For the segmentation task, we proposed a new network based on an encoder-decoder architecture with full inter- and intra-skip connections to take advantage of low-level and high-level semantics at full scales. Additionally, we used Conditional Random Fields as a post-processing step to refine the predicted segmentation maps. We trained multiple neural networks for tumor volume segmentation, and these segmentations were ensembled achieving an average Dice Similarity Coefficient of 0.75 in cross-validation, and 0.76 on the challenge testing data set. For prediction of patient progression free survival task, we propose a Cox proportional hazard regression combining clinical, radiomic, and deep learning features. Our survival prediction model achieved a concordance index of 0.82 in cross-validation, and 0.62 on the challenge testing data set.
Achieving strong coupling between light and matter excitations in hybrid systems is a benchmark for the implementation of quantum technologies. We recently proposed [arXiv:2110.02984] that strong single-particle coupling between magnons and light can be realized in a magnetized epsilon-near-zero (ENZ) medium, in which magneto-optical effects are enhanced. Here we present a detailed derivation of the magnon-photon coupling Hamiltonian in dispersive media both for degenerate and non-degenerate optical modes, and show the enhancement of the coupling near the ENZ frequency. Moreover, we show that the coupling of magnons to plane-wave non-degenerate Voigt modes vanishes at specific frequencies due to polarization selection rules tuned by dispersion. Finally, we present specific results using a Lorentz dispersion model. Our results pave the way for the design of dispersive optomagnonic systems, providing a general theoretical framework for describing engineering ENZ-based optomagnonic systems.
Text tracking is to track multiple texts in a video,and construct a trajectory for each text. Existing methodstackle this task by utilizing the tracking-by-detection frame-work, i.e., detecting the text instances in each frame andassociating the corresponding text instances in consecutiveframes. We argue that the tracking accuracy of this paradigmis severely limited in more complex scenarios, e.g., owing tomotion blur, etc., the missed detection of text instances causesthe break of the text trajectory. In addition, different textinstances with similar appearance are easily confused, leadingto the incorrect association of the text instances. To this end,a novel spatio-temporal complementary text tracking model isproposed in this paper. We leverage a Siamese ComplementaryModule to fully exploit the continuity characteristic of the textinstances in the temporal dimension, which effectively alleviatesthe missed detection of the text instances, and hence ensuresthe completeness of each text trajectory. We further integratethe semantic cues and the visual cues of the text instance intoa unified representation via a text similarity learning network,which supplies a high discriminative power in the presence oftext instances with similar appearance, and thus avoids the mis-association between them. Our method achieves state-of-the-art performance on several public benchmarks. The source codeis available at this https URL.
Sparse training is a natural idea to accelerate the training speed of deep neural networks and save the memory usage, especially since large modern neural networks are significantly over-parameterized. However, most of the existing methods cannot achieve this goal in practice because the chain rule based gradient (w.r.t. structure parameters) estimators adopted by previous methods require dense computation at least in the backward propagation step. This paper solves this problem by proposing an efficient sparse training method with completely sparse forward and backward passes. We first formulate the training process as a continuous minimization problem under global sparsity constraint. We then separate the optimization process into two steps, corresponding to weight update and structure parameter update. For the former step, we use the conventional chain rule, which can be sparse via exploiting the sparse structure. For the latter step, instead of using the chain rule based gradient estimators as in existing methods, we propose a variance reduced policy gradient estimator, which only requires two forward passes without backward propagation, thus achieving completely sparse training. We prove that the variance of our gradient estimator is bounded. Extensive experimental results on real-world datasets demonstrate that compared to previous methods, our algorithm is much more effective in accelerating the training process, up to an order of magnitude faster.
We investigated the propagation of turbulent fronts in pipe flow at high Reynolds numbers by direct numerical simulation. We used a technique combining a moving frame of reference and an artificial damping to isolate the fronts in short periodic pipes, which enables us to explore the bulk Reynolds number up to Re = $10^5$ with affordable computation power. We measured the propagation speed of the downstream front and observed that a fit of $1.971-(Re/1925)^{-0.825}$ (in unit of bulk speed) well captures this speed above $Re\simeq 5000$. The speed increases monotonically as Re increases, in stark contrast to the decreasing trend above $Re\simeq 10000$ reported by Wygnanski & Champagne (1973). The speed of the upstream front overall agrees with the former studies and $0.024 + (Re/1936)^{-0.528}$ well fits our data and those from the literature. Based on our analysis of the front dynamics, we proposed that both front speeds would keep their respective monotonic trends as the Reynolds number increases further. We show that, at high Reynolds numbers, the local transition at the upstream front tip is via high-azimuthal-wavenumber structures in the high-shear region near the pipe wall, whereas at the downstream front tip is via low-azimuthal-wavenumber structures in the low-shear region near the pipe center. This difference is possibly responsible for the asymmetric speed scalings between the upstream and downstream fronts.
Computer vision tasks can benefit from the estimation of the salient object regions and interactions between those object regions. Identifying the object regions involves utilizing pretrained models to perform object detection, object segmentation and/or object pose estimation. However, it is infeasible in practice due to the following reasons: 1) The object categories of pretrained models' training dataset may not cover all the object categories exhaustively needed for general computer vision tasks, 2) The domain gap between pretrained models' training dataset and target task's dataset may differ and negatively impact the performance, 3) The bias and variance present in pretrained models may leak into target task leading to an inadvertently biased target model. To overcome these downsides, we propose to utilize the common rationale that a sequence of video frames capture a set of common objects and interactions between them, thus a notion of co-segmentation between the video frame features may equip the model with the ability to automatically focus on salient regions and improve underlying task's performance in an end-to-end manner. In this regard, we propose a generic module called "Co-Segmentation Activation Module" (COSAM) that can be plugged-in to any CNN to promote the notion of co-segmentation based attention among a sequence of video frame features. We show the application of COSAM in three video based tasks namely 1) Video-based person re-ID, 2) Video captioning, & 3) Video action classification and demonstrate that COSAM is able to capture salient regions in the video frames, thus leading to notable performance improvements along with interpretable attention maps.
In this paper we introduce an image-based person re-identification dataset collected across five non-overlapping camera views in the large and busy airport in Dublin, Ireland. Unlike all publicly available image-based datasets, our dataset contains timestamp information in addition to frame number, and camera and person IDs. Also our dataset has been fully anonymized to comply with modern data privacy regulations. We apply state-of-the-art person re-identification models to our dataset and show that by leveraging the available timestamp information we are able to achieve a significant gain of 37.43% in mAP and a gain of 30.22% in Rank1 accuracy. We also propose a Bayesian temporal re-ranking post-processing step, which further adds a 10.03% gain in mAP and 9.95% gain in Rank1 accuracy metrics. This work on combining visual and temporal information is not possible on other image-based person re-identification datasets. We believe that the proposed new dataset will enable further development of person re-identification research for challenging real-world applications. DAA dataset can be downloaded from this https URL.
Object detection has achieved promising performance on clean datasets, but how to achieve better tradeoff between the adversarial robustness and clean precision is still under-explored. Adversarial training is the mainstream method to improve robustness, but most of the works will sacrifice clean precision to gain robustness than standard training. In this paper, we propose Unified Decoupled Feature Alignment (UDFA), a novel fine-tuning paradigm which achieves better performance than existing methods, by fully exploring the combination between self-knowledge distillation and adversarial training for object detection. We first use decoupled fore/back-ground features to construct self-knowledge distillation branch between clean feature representation from pretrained detector (served as teacher) and adversarial feature representation from student detector. Then we explore the self-knowledge distillation from a new angle by decoupling original branch into a self-supervised learning branch and a new self-knowledge distillation branch. With extensive experiments on the PASCAL-VOC and MS-COCO benchmarks, the evaluation results show that UDFA can surpass the standard training and state-of-the-art adversarial training methods for object detection. For example, compared with teacher detector, our approach on GFLV2 with ResNet-50 improves clean precision by 2.2 AP on PASCAL-VOC; compared with SOTA adversarial training methods, our approach improves clean precision by 1.6 AP, while improving adversarial robustness by 0.5 AP. Our code will be available at this https URL.
We consider the problem of assigning appearing times to the edges of a digraph in order to maximize the (average) temporal reachability between pairs of nodes. Motivated by the application to public transit networks, where edges cannot be scheduled independently one of another, we consider the setting where the edges are grouped into certain walks (called trips) in the digraph and where assigning the appearing time to the first edge of a trip forces the appearing times of the subsequent edges. In this setting, we show that, quite surprisingly, it is NP-complete to decide whether there exists an assignment of times connecting a given pair of nodes. This result allows us to prove that the problem of maximising the temporal reachability cannot be approximated within a factor better than some polynomial term in the size of the graph. We thus focus on the case where, for each pair of nodes, there exists an assignment of times such that one node is reachable from the other. We call this property strong temporalisability. It is a very natural assumption for the application to public transit networks. On the negative side, the problem of maximising the temporal reachability remains hard to approximate within a factor $\sqrt$ n/12 in that setting. Moreover, we show the existence of collections of trips that are strongly temporalisable but for which any assignment of starting times to the trips connects at most an O(1/ $\sqrt$ n) fraction of all pairs of nodes. On the positive side, we show that there must exist an assignment of times that connects a constant fraction of all pairs in the strongly temporalisable and symmetric case, that is, when the set of trips to be scheduled is such that, for each trip, there is a symmetric trip visiting the same nodes in reverse order. Keywords:edge labeling edge scheduled network network optimisation temporal graph temporal path temporal reachability time assignment.
Spatiotemporal information about light pulse propagation obtained with femtosecond temporal resolution plays an important role in understanding transient phenomena and light"“matter interactions. Although ultrafast optical imaging techniques have been developed, it is still difficult to capture light pulse propagation spatiotemporally. Furthermore, imaging through a three-dimensional (3-D) scattering medium is a longstanding challenge due to the optical scattering caused by the interaction between light pulse and a 3-D scattering medium. Here, we propose a technique for ultrafast optical imaging of light pulses propagating inside a 3D scattering medium. We record an image of the light pulse propagation using the ultrashort light pulse even when the interaction between light pulse and a 3-D scattering medium causes the optical scattering. We demonstrated our proposed technique by recording converging, refracted, and diffracted propagating light for 59Â ps with femtosecond temporal resolution.
We propose a novel DNN based framework called the Enhanced Correlation Matching based Video Frame Interpolation Network to support high resolution like 4K, which has a large scale of motion and occlusion. Considering the extensibility of the network model according to resolution, the proposed scheme employs the recurrent pyramid architecture that shares the parameters among each pyramid layer for optical flow estimation. In the proposed flow estimation, the optical flows are recursively refined by tracing the location with maximum correlation. The forward warping based correlation matching enables to improve the accuracy of flow update by excluding incorrectly warped features around the occlusion area. Based on the final bi-directional flows, the intermediate frame at arbitrary temporal position is synthesized using the warping and blending network and it is further improved by refinement network. Experiment results demonstrate that the proposed scheme outperforms the previous works at 4K video data and low-resolution benchmark datasets as well in terms of objective and subjective quality with the smallest number of model parameters.
