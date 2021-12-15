Supervised learning of analysis-sparsity priors with automatic differentiation
By Hashem Ghanem, Joseph Salmon, Nicolas Keriven, Samuel Vaiter
Sparsity priors are commonly used in denoising and image reconstruction. For analysis-type priors, a dictionary defines a representation of signals that is likely to be sparse. In most situations, this dictionary is not known, and is to be recovered from...
As a seminal tool in self-supervised representation learning, contrastive learning has gained unprecedented attention in recent years. In essence, contrastive learning aims to leverage pairs of positive and negative samples for representation learning, which relates to exploiting neighborhood information in a feature space. By investigating the connection between contrastive learning and neighborhood component analysis (NCA), we provide a novel stochastic nearest neighbor viewpoint of contrastive learning and subsequently propose a series of contrastive losses that outperform the existing ones. Under our proposed framework, we show a new methodology to design integrated contrastive losses that could simultaneously achieve good accuracy and robustness on downstream tasks. With the integrated framework, we achieve up to 6\% improvement on the standard accuracy and 17\% improvement on the adversarial accuracy.
Concept-oriented deep learning (CODL) is a general approach to meet the future challenges for deep learning: (1) learning with little or no external supervision, (2) coping with test examples that come from a different distribution than the training examples, and (3) integrating deep learning with symbolic AI. In CODL, as in human learning, concept representations are learned based on concept exemplars. Contrastive self-supervised learning (CSSL) provides a promising approach to do so, since it: (1) uses data-driven associations, to get away from semantic labels, (2) supports incremental and continual learning, to get away from (large) fixed datasets, and (3) accommodates emergent objectives, to get away from fixed objectives (tasks). We discuss major aspects of concept representation learning using CSSL. These include dual-level concept representations, CSSL for feature representations, exemplar similarity measures and self-supervised relational reasoning, incremental and continual CSSL, and contrastive self-supervised concept (class) incremental learning. The discussion leverages recent findings from cognitive neural science and CSSL.
In this article we introduce the differentiable reinforcement learning framework. It is based on the fact that in many reinforcement learning applications, the environment reward and transition functions are not black boxes but known differentiable functions. Incorporating deep learning in this framework we find more accurate and stable solutions than more generic actor critic algorithms. We apply this deep differentiable reinforcement learning (DDRL) algorithm to the problem of optimal trading strategies in various environments where the market dynamics are known. Thanks to the stability of this method, we are able to efficiently find optimal strategies for complex multi-scale market models and for a wide range of environment parameters. This makes it applicable to real life financial signals and portfolio optimization where the expected return has multiple time scales. In the case of a slow and a fast alpha signal, we find that the optimal trading strategy consists in using the fast signal to time the trades associated to the slow signal.
Audio representations for music information retrieval are typically learned via supervised learning in a task-specific fashion. Although effective at producing state-of-the-art results, this scheme lacks flexibility with respect to the range of applications a model can have and requires extensively annotated datasets. In this work, we pose the question of whether it may be possible to exploit weakly aligned text as the only supervisory signal to learn general-purpose music audio representations. To address this question, we design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks. Weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track. After pre-training, we transfer the audio backbone of the model to a set of music audio classification and regression tasks. We demonstrate the usefulness of our approach by comparing the performance of audio representations produced by the same audio backbone with different training strategies and show that our pre-training method consistently achieves comparable or higher scores on all tasks and datasets considered. Our experiments also confirm that MuLaP effectively leverages audio-caption pairs to learn representations that are competitive with audio-only and cross-modal self-supervised methods in the literature.
Learning self-supervised video representation predominantly focuses on discriminating instances generated from simple data augmentation schemes. However, the learned representation often fails to generalize over unseen camera viewpoints. To this end, we propose ViewCLR, that learns self-supervised video representation invariant to camera viewpoint changes. We introduce a view-generator that can be considered as a learnable augmentation for any self-supervised pre-text tasks, to generate latent viewpoint representation of a video. ViewCLR maximizes the similarities between the latent viewpoint representation with its representation from the original viewpoint, enabling the learned video encoder to generalize over unseen camera viewpoints. Experiments on cross-view benchmark datasets including NTU RGB+D dataset show that ViewCLR stands as a state-of-the-art viewpoint invariant self-supervised method.
Automatic evaluation of the retinal fundus image is emerging as one of the most important tools for early detection and treatment of progressive eye diseases like Glaucoma. Glaucoma results to a progressive degeneration of vision and is characterized by the deformation of the shape of optic cup and the degeneration of the blood vessels resulting in the formation of a notch along the neuroretinal rim. In this paper, we propose a deep learning-based pipeline for automatic segmentation of optic disc (OD) and optic cup (OC) regions from Digital Fundus Images (DFIs), thereby extracting distinct features necessary for prediction of Glaucoma. This methodology has utilized focal notch analysis of neuroretinal rim along with cup-to-disc ratio values as classifying parameters to enhance the accuracy of Computer-aided design (CAD) systems in analyzing glaucoma. Support Vector-based Machine Learning algorithm is used for classification, which classifies DFIs as Glaucomatous or Normal based on the extracted features. The proposed pipeline was evaluated on the freely available DRISHTI-GS dataset with a resultant accuracy of 93.33% for detecting Glaucoma from DFIs.
In recent years, self-supervised representation learning for skeleton-based action recognition has been developed with the advance of contrastive learning methods. The existing contrastive learning methods use normal augmentations to construct similar positive samples, which limits the ability to explore novel movement patterns. In this paper, to make better use of the movement patterns introduced by extreme augmentations, a Contrastive Learning framework utilizing Abundant Information Mining for self-supervised action Representation (AimCLR) is proposed. First, the extreme augmentations and the Energy-based Attention-guided Drop Module (EADM) are proposed to obtain diverse positive samples, which bring novel movement patterns to improve the universality of the learned representations. Second, since directly using extreme augmentations may not be able to boost the performance due to the drastic changes in original identity, the Dual Distributional Divergence Minimization Loss (D$^3$M Loss) is proposed to minimize the distribution divergence in a more gentle way. Third, the Nearest Neighbors Mining (NNM) is proposed to further expand positive samples to make the abundant information mining process more reasonable. Exhaustive experiments on NTU RGB+D 60, PKU-MMD, NTU RGB+D 120 datasets have verified that our AimCLR can significantly perform favorably against state-of-the-art methods under a variety of evaluation protocols with observed higher quality action representations. Our code is available at this https URL.
DeepMind Interactive Agents Team: Josh Abramson, Arun Ahuja, Arthur Brussee, Federico Carnevale, Mary Cassin, Felix Fischer, Petko Georgiev, Alex Goldin, Tim Harley, Felix Hill, Peter C Humphreys, Alden Hung, Jessica Landon, Timothy Lillicrap, Hamza Merzic, Alistair Muldal, Adam Santoro, Guy Scully, Tamara von Glehn, Greg Wayne, Nathaniel Wong, Chen Yan, Rui Zhu.
A comprehensive and precise analysis of shale gas production performance is crucial for evaluating resource potential, designing field development plan, and making investment decisions. However, quantitative analysis can be challenging because production performance is dominated by a complex interaction among a series of geological and engineering factors. In this study, we propose a hybrid data-driven procedure for analyzing shale gas production performance, which consists of a complete workflow for dominant factor analysis, production forecast, and development optimization. More specifically, game theory and machine learning models are coupled to determine the dominating geological and engineering factors. The Shapley value with definite physical meanings is employed to quantitatively measure the effects of individual factors. A multi-model-fused stacked model is trained for production forecast, on the basis of which derivative-free optimization algorithms are introduced to optimize the development plan. The complete workflow is validated with actual production data collected from the Fuling shale gas field, Sichuan Basin, China. The validation results show that the proposed procedure can draw rigorous conclusions with quantified evidence and thereby provide specific and reliable suggestions for development plan optimization. Comparing with traditional and experience-based approaches, the hybrid data-driven procedure is advanced in terms of both efficiency and accuracy.
Training speaker-discriminative and robust speaker verification systems without speaker labels is still challenging and worthwhile to explore. In this study, we propose an effective self-supervised learning framework and a novel regularization strategy to facilitate self-supervised speaker representation learning. Different from contrastive learning-based self-supervised learning methods, the proposed self-supervised regularization (SSReg) focuses exclusively on the similarity between the latent representations of positive data pairs. We also explore the effectiveness of alternative online data augmentation strategies on both the time domain and frequency domain. With our strong online data augmentation strategy, the proposed SSReg shows the potential of self-supervised learning without using negative pairs and it can significantly improve the performance of self-supervised speaker representation learning with a simple Siamese network architecture. Comprehensive experiments on the VoxCeleb datasets demonstrate that our proposed self-supervised approach obtains a 23.4% relative improvement by adding the effective self-supervised regularization and outperforms other previous works.
Medical image segmentation is a fundamental and critical step in many clinical approaches. Semi-supervised learning has been widely applied to medical image segmentation tasks since it alleviates the heavy burden of acquiring expert-examined annotations and takes the advantage of unlabeled data which is much easier to acquire. Although consistency learning has been proven to be an effective approach by enforcing an invariance of predictions under different distributions, existing approaches cannot make full use of region-level shape constraint and boundary-level distance information from unlabeled data. In this paper, we propose a novel uncertainty-guided mutual consistency learning framework to effectively exploit unlabeled data by integrating intra-task consistency learning from up-to-date predictions for self-ensembling and cross-task consistency learning from task-level regularization to exploit geometric shape information. The framework is guided by the estimated segmentation uncertainty of models to select out relatively certain predictions for consistency learning, so as to effectively exploit more reliable information from unlabeled data. We extensively validate our proposed method on two publicly available benchmark datasets: Left Atrium Segmentation (LA) dataset and Brain Tumor Segmentation (BraTS) dataset. Experimental results demonstrate that our method achieves performance gains by leveraging unlabeled data and outperforms existing semi-supervised segmentation methods.
Despite the great progress in video understanding made by deep convolutional neural networks, feature representation learned by existing methods may be biased to static visual cues. To address this issue, we propose a novel method to suppress static visual cues (SSVC) based on probabilistic analysis for self-supervised video representation learning. In our method, video frames are first encoded to obtain latent variables under standard normal distribution via normalizing flows. By modelling static factors in a video as a random variable, the conditional distribution of each latent variable becomes shifted and scaled normal. Then, the less-varying latent variables along time are selected as static cues and suppressed to generate motion-preserved videos. Finally, positive pairs are constructed by motion-preserved videos for contrastive learning to alleviate the problem of representation bias to static cues. The less-biased video representation can be better generalized to various downstream tasks. Extensive experiments on publicly available benchmarks demonstrate that the proposed method outperforms the state of the art when only single RGB modality is used for pre-training.
The prosperous development of cloud computing and machine learning as a service has led to the widespread use of media software to process confidential media data. This paper explores an adversary's ability to launch side channel analyses (SCA) against media software to reconstruct confidential media inputs. Recent advances in representation learning and perceptual learning inspired us to consider the reconstruction of media inputs from side channel traces as a cross-modality manifold learning task that can be addressed in a unified manner with an autoencoder framework trained to learn the mapping between media inputs and side channel observations. We further enhance the autoencoder with attention to localize the program points that make the primary contribution to SCA, thus automatically pinpointing information-leakage points in media software. We also propose a novel and highly effective defensive technique called perception blinding that can perturb media inputs with perception masks and mitigate manifold learning-based SCA.
Weak supervision (WS) frameworks are a popular way to bypass hand-labeling large datasets for training data-hungry models. These approaches synthesize multiple noisy but cheaply-acquired estimates of labels into a set of high-quality pseudolabels for downstream training. However, the synthesis technique is specific to a particular kind of label, such as binary labels or sequences, and each new label type requires manually designing a new synthesis algorithm. Instead, we propose a universal technique that enables weak supervision over any label type while still offering desirable properties, including practical flexibility, computational efficiency, and theoretical guarantees. We apply this technique to important problems previously not tackled by WS frameworks including learning to rank, regression, and learning in hyperbolic manifolds. Theoretically, our synthesis approach produces a consistent estimator for learning a challenging but important generalization of the exponential family model. Experimentally, we validate our framework and show improvement over baselines in diverse settings including real-world learning-to-rank and regression problems along with learning on hyperbolic manifolds.
The DevOps Movement has many recommended practices for automation of processes and testing. However, across different market verticals, the requirements, practices, and cadence of releases very widely. For example, in the more security or safety relevant software markets, development processes often also include compliance to coding standards, or other security and safety practices that must be “baked” into the process in order to achieve compliance.
Despite the outstanding success of self-supervised pretraining methods for video representation learning, they generalise poorly when the unlabeled dataset for pretraining is small or the domain difference between unlabelled data in source task (pretraining) and labeled data in target task (finetuning) is significant. To mitigate these issues, we propose a novel approach to complement self-supervised pretraining via an auxiliary pretraining phase, based on knowledge similarity distillation, auxSKD, for better generalisation with a significantly smaller amount of video data, e.g. Kinetics-100 rather than Kinetics-400. Our method deploys a teacher network that iteratively distils its knowledge to the student model by capturing the similarity information between segments of unlabelled video data. The student model then solves a pretext task by exploiting this prior knowledge. We also introduce a novel pretext task, Video Segment Pace Prediction or VSPP, which requires our model to predict the playback speed of a randomly selected segment of the input video to provide more reliable self-supervised representations. Our experimental results show superior results to the state of the art on both UCF101 and HMDB51 datasets when pretraining on K100. Additionally, we show that our auxiliary pertaining, auxSKD, when added as an extra pretraining phase to recent state of the art self-supervised methods (e.g. VideoPace and RSPNet), improves their results on UCF101 and HMDB51. Our code will be released soon.
Deep learning has brought the most profound contribution towards biomedical image segmentation to automate the process of delineation in medical imaging. To accomplish such task, the models are required to be trained using huge amount of annotated or labelled data that highlights the region of interest with a binary mask. However, efficient generation of the annotations for such huge data requires expert biomedical analysts and extensive manual effort. It is a tedious and expensive task, while also being vulnerable to human error. To address this problem, a self-supervised learning framework, BT-Unet is proposed that uses the Barlow Twins approach to pre-train the encoder of a U-Net model via redundancy reduction in an unsupervised manner to learn data representation. Later, complete network is fine-tuned to perform actual segmentation. The BT-Unet framework can be trained with a limited number of annotated samples while having high number of unannotated samples, which is mostly the case in real-world problems. This framework is validated over multiple U-Net models over diverse datasets by generating scenarios of a limited number of labelled samples using standard evaluation metrics. With exhaustive experiment trials, it is observed that the BT-Unet framework enhances the performance of the U-Net models with significant margin under such circumstances.
A modern self-supervised learning algorithm typically enforces persistency of the representations of an instance across views. While being very effective on learning holistic image and video representations, such an approach becomes sub-optimal for learning spatio-temporally fine-grained features in videos, where scenes and instances evolve through space and time. In this paper, we present the Contextualized Spatio-Temporal Contrastive Learning (ConST-CL) framework to effectively learn spatio-temporally fine-grained representations using self-supervision. We first design a region-based self-supervised pretext task which requires the model to learn to transform instance representations from one view to another guided by context features. Further, we introduce a simple network design that effectively reconciles the simultaneous learning process of both holistic and local representations. We evaluate our learned representations on a variety of downstream tasks and ConST-CL achieves state-of-the-art results on four datasets. For spatio-temporal action localization, ConST-CL achieves 39.4% mAP with ground-truth boxes and 30.5% mAP with detected boxes on the AVA-Kinetics validation set. For object tracking, ConST-CL achieves 78.1% precision and 55.2% success scores on OTB2015. Furthermore, ConST-CL achieves 94.8% and 71.9% top-1 fine-tuning accuracy on video action recognition datasets, UCF101 and HMDB51 respectively. We plan to release our code and models to the public.
We propose GAN-Supervised Learning, a framework for learning discriminative models and their GAN-generated training data jointly end-to-end. We apply our framework to the dense visual alignment problem. Inspired by the classic Congealing method, our GANgealing algorithm trains a Spatial Transformer to map random samples from a GAN trained on unaligned data to a common, jointly-learned target mode. We show results on eight datasets, all of which demonstrate our method successfully aligns complex data and discovers dense correspondences. GANgealing significantly outperforms past self-supervised correspondence algorithms and performs on-par with (and sometimes exceeds) state-of-the-art supervised correspondence algorithms on several datasets -- without making use of any correspondence supervision or data augmentation and despite being trained exclusively on GAN-generated data. For precise correspondence, we improve upon state-of-the-art supervised methods by as much as $3\times$. We show applications of our method for augmented reality, image editing and automated pre-processing of image datasets for downstream GAN training.
In light of the finite nature of the wireless spectrum and the increasing demand for spectrum use arising from recent technological breakthroughs in wireless communication, the problem of interference continues to persist. Despite recent advancements in resolving interference issues, interference still presents a difficult challenge to effective usage of the spectrum. This is partly due to the rise in the use of license-free and managed shared bands for Wi-Fi, long term evolution (LTE) unlicensed (LTE-U), LTE licensed assisted access (LAA), 5G NR, and other opportunistic spectrum access solutions. As a result of this, the need for efficient spectrum usage schemes that are robust against interference has never been more important. In the past, most solutions to interference have addressed the problem by using avoidance techniques as well as non-AI mitigation approaches (for example, adaptive filters). The key downside to non-AI techniques is the need for domain expertise in the extraction or exploitation of signal features such as cyclostationarity, bandwidth and modulation of the interfering signals. More recently, researchers have successfully explored AI/ML enabled physical (PHY) layer techniques, especially deep learning which reduces or compensates for the interfering signal instead of simply avoiding it. The underlying idea of ML based approaches is to learn the interference or the interference characteristics from the data, thereby sidelining the need for domain expertise in suppressing the interference. In this paper, we review a wide range of techniques that have used deep learning to suppress interference. We provide comparison and guidelines for many different types of deep learning techniques in interference suppression. In addition, we highlight challenges and potential future research directions for the successful adoption of deep learning in interference suppression.
