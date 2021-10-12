CreatorsPublishersAdvertisers
Robust Glare Detection: Review, Analysis, and Dataset Release

By Mahdi Abolfazli Esfahani, Han Wang
arxiv.org
 10 days ago

Sun Glare widely exists in the images captured by unmanned ground and aerial vehicles performing in outdoor environments. The existence of such artifacts in images will result in wrong feature extraction and failure of autonomous systems. Humans will try to adapt their view once they observe a

arxiv.org

arxiv.org

Robust Performance Analysis of Source-Seeking Dynamics with Integral Quadratic Constraints

We analyze the performance of source-seeking dynamics involving either a single vehicle or multiple flocking-vehicles embedded in an underlying strongly convex scalar field with gradient based forcing terms. For multiple vehicles under flocking dynamics embedded in quadratic fields, we show that the dynamics of the center of mass are equivalent to the dynamics of a single agent. We leverage the recently developed framework of $\alpha$-integral quadratic constraints (IQCs) to obtain convergence rate estimates. We first present a derivation of \textit{hard} Zames-Falb (ZF) $\alpha$-IQCs involving general non-causal multipliers based on purely time-domain arguments and show that a parameterization of the ZF multiplier, suggested in the literature for the standard version of the ZF IQCs, can be adapted to the $\alpha$-IQCs setting to obtain quasi-convex programs for estimating convergence rates. Owing to the time-domain arguments, we can seamlessly extend these results to linear parameter varying (LPV) vehicles possibly opening the doors to non-linear vehicle models with quasi-LPV representations. We illustrate the theoretical results on a linear time invariant (LTI) model of a quadrotor, a non-minimum phase LTI plant and two LPV examples which show a clear benefit of using general non-causal dynamic multipliers to drastically reduce conservatism.
CARS
arxiv.org

Self-supervised Learning is More Robust to Dataset Imbalance

Self-supervised learning (SSL) is a scalable way to learn general visual representations since it learns without labels. However, large-scale unlabeled datasets in the wild often have long-tailed label distributions, where we know little about the behavior of SSL. In this work, we systematically investigate self-supervised learning under dataset imbalance. First, we find out via extensive experiments that off-the-shelf self-supervised representations are already more robust to class imbalance than supervised representations. The performance gap between balanced and imbalanced pre-training with SSL is significantly smaller than the gap with supervised learning, across sample sizes, for both in-domain and, especially, out-of-domain evaluation. Second, towards understanding the robustness of SSL, we hypothesize that SSL learns richer features from frequent data: it may learn label-irrelevant-but-transferable features that help classify the rare classes and downstream tasks. In contrast, supervised learning has no incentive to learn features irrelevant to the labels from frequent examples. We validate this hypothesis with semi-synthetic experiments and theoretical analyses on a simplified setting. Third, inspired by the theoretical insights, we devise a re-weighted regularization technique that consistently improves the SSL representation quality on imbalanced datasets with several evaluation criteria, closing the small gap between balanced and imbalanced datasets with the same number of examples.
COMPUTERS
arxiv.org

Dataset Condensation with Distribution Matching

Computational cost to train state-of-the-art deep models in many learning problems is rapidly increasing due to more sophisticated models and larger datasets. A recent promising direction to reduce training time is dataset condensation that aims to replace the original large training set with a significantly smaller learned synthetic set while preserving its information. While training deep models on the small set of condensed images can be extremely fast, their synthesis remains computationally expensive due to the complex bi-level optimization and second-order derivative computation. In this work, we propose a simple yet effective dataset condensation technique that requires significantly lower training cost with comparable performance by matching feature distributions of the synthetic and original training images in sampled embedding spaces. Thanks to its efficiency, we apply our method to more realistic and larger datasets with sophisticated neural architectures and achieve a significant performance boost while using larger synthetic training set. We also show various practical benefits of our method in continual learning and neural architecture search.
CODING & PROGRAMMING
arxiv.org

Interactive Analysis of CNN Robustness

While convolutional neural networks (CNNs) have found wide adoption as state-of-the-art models for image-related tasks, their predictions are often highly sensitive to small input perturbations, which the human vision is robust against. This paper presents Perturber, a web-based application that allows users to instantaneously explore how CNN activations and predictions evolve when a 3D input scene is interactively perturbed. Perturber offers a large variety of scene modifications, such as camera controls, lighting and shading effects, background modifications, object morphing, as well as adversarial attacks, to facilitate the discovery of potential vulnerabilities. Fine-tuned model versions can be directly compared for qualitative evaluation of their robustness. Case studies with machine learning experts have shown that Perturber helps users to quickly generate hypotheses about model vulnerabilities and to qualitatively compare model behavior. Using quantitative analyses, we could replicate users' insights with other CNN architectures and input images, yielding new insights about the vulnerability of adversarially trained models.
SOFTWARE
arxiv.org

Sharing FANCI Features: A Privacy Analysis of Feature Extraction for DGA Detection

The goal of Domain Generation Algorithm (DGA) detection is to recognize infections with bot malware and is often done with help of Machine Learning approaches that classify non-resolving Domain Name System (DNS) traffic and are trained on possibly sensitive data. In parallel, the rise of privacy research in the Machine Learning world leads to privacy-preserving measures that are tightly coupled with a deep learning model's architecture or training routine, while non deep learning approaches are commonly better suited for the application of privacy-enhancing methods outside the actual classification module. In this work, we aim to measure the privacy capability of the feature extractor of feature-based DGA detector FANCI (Feature-based Automated Nxdomain Classification and Intelligence). Our goal is to assess whether a data-rich adversary can learn an inverse mapping of FANCI's feature extractor and thereby reconstruct domain names from feature vectors. Attack success would pose a privacy threat to sharing FANCI's feature representation, while the opposite would enable this representation to be shared without privacy concerns. Using three real-world data sets, we train a recurrent Machine Learning model on the reconstruction task. Our approaches result in poor reconstruction performance and we attempt to back our findings with a mathematical review of the feature extraction process. We thus reckon that sharing FANCI's feature representation does not constitute a considerable privacy leakage.
SOFTWARE
arxiv.org

Robustness of Quantum Systems Subject to Decoherence: Structured Singular Value Analysis?

We study the problem of robust performance of quantum systems under structured uncertainties. A specific feature of closed (Hamiltonian) quantum systems is that their poles lie on the imaginary axis and that neither a coherent controller nor physically relevant structured uncertainties can alter this situation. This changes for open systems where decoherence ensures asymptotic stability and creates a unique landscape of pure performance robustness, with the distinctive feature that closed-loop stability is secured by the underlying physics and needs not be enforced. This stability, however, is often detrimental to quantum-enhanced performance, and additive perturbations of the Hamiltonian give rise to dynamic generators that are nonlinear in the perturbed parameters, invalidating classical paradigms to assess robustness to structured perturbations such as singular value analysis. This problem is addressed using a fixed-point iteration approach to determine a maximum perturbation strength $\delta_{\max}$ that ensures that the transfer function remains bounded, $||T_\delta||<\delta^{-1}$ for $\delta<\delta_{\max}$.
SCIENCE
qualys.com

October 2021 Release: CVE ID Detection and Reporting

The Qualys Cloud Platform October 2021 release includes Qualys Cloud Suite 10.15.0.0, which contains new features and important enhancements in the Qualys Cloud Platform. Option to include CVEs in the host-based scan reports. Starting this release, Qualys introduces a powerful new capability to generate vulnerability reports based on CVEs and...
SOFTWARE
arxiv.org

Information-Theoretic Measures of Dataset Difficulty

Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. Not only is this framework informal, but it also provides little understanding of how difficult each instance is, or what attributes make it difficult for a given model. To address these problems, we propose an information-theoretic perspective, framing dataset difficulty as the absence of $\textit{usable information}$. Measuring usable information is as easy as measuring performance, but has certain theoretical advantages. While the latter only allows us to compare different models w.r.t the same dataset, the former also allows us to compare different datasets w.r.t the same model. We then introduce $\textit{pointwise}$ $\mathcal{V}-$$\textit{information}$ (PVI) for measuring the difficulty of individual instances, where instances with higher PVI are easier for model $\mathcal{V}$. By manipulating the input before measuring usable information, we can understand $\textit{why}$ a dataset is easy or difficult for a given model, which we use to discover annotation artefacts in widely-used benchmarks.
COMPUTERS
arxiv.org

MReD: A Meta-Review Dataset for Controllable Text Generation

When directly using existing text generation datasets for controllable generation, we are facing the problem of not having the domain knowledge and thus the aspects that could be controlled are limited.A typical example is when using CNN/Daily Mail dataset for controllable text summarization, there is no guided information on the emphasis of summary sentences. A more useful text generator should leverage both the input text and control variables to guide the generation, which can only be built with deep understanding of the domain knowledge. Motivated by this vi-sion, our paper introduces a new text generation dataset, named MReD. Our new dataset consists of 7,089 meta-reviews and all its 45k meta-review sentences are manually annotated as one of the carefully defined 9 categories, including abstract, strength, decision, etc. We present experimental results on start-of-the-art summarization models, and propose methods for controlled generation on both extractive and abstractive models using our annotated data. By exploring various settings and analaysing the model behavior with respect to the control inputs, we demonstrate the challenges and values of our dataset. MReD allows us to have a better understanding of the meta-review corpora and enlarge the research room for controllable text generation.
COMPUTERS
arxiv.org

Wind-robust sound event detection and denoising for bioacoustics

Sound recordings are used in various ecological studies, including acoustic wildlife monitoring. Such surveys require automatic detection of target sound events. However, current detectors, especially those relying on band-limited energy, are severely impacted by wind. The rapid dynamics of this noise invalidate standard noise estimators, and no satisfactory method for dealing with it exists in bioacoustics, where simple training and generalization between conditions are important. We propose to estimate the transient noise level by fitting short-term spectrum models to a wavelet packet representation. This estimator is then combined with log-spectral subtraction to stabilize the background level. The resulting adjusted wavelet series can be analysed by standard energy detectors. We use real monitoring data to tune this workflow, and test it on two acoustic surveys of birds. Additionally, we show how the estimator can be incorporated in a denoising method to restore sound. The proposed noise-robust workflow greatly reduced the number of false alarms in the surveys, compared to unadjusted energy detection. As a result, the survey efficiency (precision of the estimated call density) improved for both species. Denoising was also more effective when using the short-term estimate, whereas standard wavelet shrinkage with a constant noise estimate struggled to remove the effects of wind. In contrast to existing methods, the proposed estimator can adjust for transient broadband noises without requiring additional hardware or extensive tuning to each species. It improved the detection workflow based on very little training data, making it particularly attractive for detection of rare species.
SCIENCE
arxiv.org

Span Detection for Aspect-Based Sentiment Analysis in Vietnamese

Aspect-based sentiment analysis plays an essential role in natural language processing and artificial intelligence. Recently, researchers only focused on aspect detection and sentiment classification but ignoring the sub-task of detecting user opinion span, which has enormous potential in practical applications. In this paper, we present a new Vietnamese dataset (UIT-ViSD4SA) consisting of 35,396 human-annotated spans on 11,122 feedback comments for evaluating the span detection in aspect-based sentiment analysis. Besides, we also propose a novel system using Bidirectional Long Short-Term Memory (BiLSTM) with a Conditional Random Field (CRF) layer (BiLSTM-CRF) for the span detection task in Vietnamese aspect-based sentiment analysis. The best result is a 62.76% F1 score (macro) for span detection using BiLSTM-CRF with embedding fusion of syllable embedding, character embedding, and contextual embedding from XLM-RoBERTa. In future work, span detection will be extended in many NLP tasks such as constructive detection, emotion recognition, complaint analysis, and opinion mining. Our dataset is freely available at this https URL for research purposes.
TECHNOLOGY
arxiv.org

MAAD: A Model and Dataset for "Attended Awareness" in Driving

We propose a computational model to estimate a person's attended awareness of their environment. We define attended awareness to be those parts of a potentially dynamic scene which a person has attended to in recent history and which they are still likely to be physically aware of. Our model takes as input scene information in the form of a video and noisy gaze estimates, and outputs visual saliency, a refined gaze estimate, and an estimate of the person's attended awareness. In order to test our model, we capture a new dataset with a high-precision gaze tracker including 24.5 hours of gaze sequences from 23 subjects attending to videos of driving scenes. The dataset also contains third-party annotations of the subjects' attended awareness based on observations of their scan path. Our results show that our model is able to reasonably estimate attended awareness in a controlled setting, and in the future could potentially be extended to real egocentric driving data to help enable more effective ahead-of-time warnings in safety systems and thereby augment driver performance. We also demonstrate our model's effectiveness on the tasks of saliency, gaze calibration, and denoising, using both our dataset and an existing saliency dataset. We make our model and dataset available at this https URL.
TECHNOLOGY
arxiv.org

Efficient Fully-Coherent Hamiltonian Simulation

Hamiltonian simulation is a fundamental problem at the heart of quantum computation, and the associated simulation algorithms are useful building blocks for designing larger quantum algorithms. In order to be successfully concatenated into a larger quantum algorithm, a Hamiltonian simulation algorithm must succeed with arbitrarily high success probability $1-\delta$ while only requiring a single copy of the initial state, a property which we call fully-coherent. Although optimal Hamiltonian simulation has been achieved by quantum signal processing (QSP), with query complexity linear in time $t$ and logarithmic in inverse error $\ln(1/\epsilon)$, the corresponding algorithm is not fully-coherent as it only succeeds with probability close to $1/4$. While this simulation algorithm can be made fully-coherent by employing amplitude amplification at the expense of appending a $\ln(1/\delta)$ multiplicative factor to the query complexity, here we develop a new fully-coherent Hamiltonian simulation algorithm that achieves a query complexity additive in $\ln(1/\delta)$: $\Theta\big( \|\mathcal{H}\| |t| + \ln(1/\epsilon) + \ln(1/\delta)\big)$. We accomplish this by compressing the spectrum of the Hamiltonian with an affine transformation, and applying to it a QSP polynomial that approximates the complex exponential only over the range of the compressed spectrum. We further numerically analyze the complexity of this algorithm and demonstrate its application to the simulation of the Heisenberg model in constant and time-dependent external magnetic fields. We believe that this efficient fully-coherent Hamiltonian simulation algorithm can serve as a useful subroutine in quantum algorithms where maintaining coherence is paramount.
COMPUTERS
arxiv.org

Likelihood Training of Schrödinger Bridge using Forward-Backward SDEs Theory

Schrödinger Bridge (SB) is an optimal transport problem that has received increasing attention in deep generative modeling for its mathematical flexibility compared to the Scored-based Generative Model (SGM). However, it remains unclear whether the optimization principle of SB relates to the modern training of deep generative models, which often rely on constructing parameterized log-likelihood objectives.This raises questions on the suitability of SB models as a principled alternative for generative applications. In this work, we present a novel computational framework for likelihood training of SB models grounded on Forward-Backward Stochastic Differential Equations Theory -- a mathematical methodology appeared in stochastic optimal control that transforms the optimality condition of SB into a set of SDEs. Crucially, these SDEs can be used to construct the likelihood objectives for SB that, surprisingly, generalizes the ones for SGM as special cases. This leads to a new optimization principle that inherits the same SB optimality yet without losing applications of modern generative training techniques, and we show that the resulting training algorithm achieves comparable results on generating realistic images on MNIST, CelebA, and CIFAR10.
CODING & PROGRAMMING
arxiv.org

Presenting a Larger Up-to-date Movie Dataset and Investigating the Effects of Pre-released Attributes on Gross Revenue

Movie-making has become one of the most costly and risky endeavors in the entertainment industry. Continuous change in the preference of the audience makes it harder to predict what kind of movie will be financially successful at the box office. So, it is no wonder that cautious, intelligent stakeholders and large production houses will always want to know the probable revenue that will be generated by a movie before making an investment. Researchers have been working on finding an optimal strategy to help investors in making the right decisions. But the lack of a large, up-to-date dataset makes their work harder. In this work, we introduce an up-to-date, richer, and larger dataset that we have prepared by scraping IMDb for researchers and data analysts to work with. The compiled dataset contains the summery data of 7.5 million titles and detail information of more than 200K movies. Additionally, we perform different statistical analysis approaches on our dataset to find out how a movie's revenue is affected by different pre-released attributes such as budget, runtime, release month, content rating, genre etc. In our analysis, we have found that having a star cast/director has a positive impact on generated revenue. We introduce a novel approach for calculating the star power of a movie. Based on our analysis we select a set of attributes as features and train different machine learning algorithms to predict a movie's expected revenue. Based on generated revenue, we classified the movies in 10 categories and achieved a one-class-away accuracy rate of almost 60% (bingo accuracy of 30%). All the generated datasets and analysis codes are available online. We also made the source codes of our scraper bots public, so that researchers interested in extending this work can easily modify these bots as they need and prepare their own up-to-date datasets.
MOVIES
arxiv.org

Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Visually-grounded spoken language datasets can enable models to learn cross-modal correspondences with very weak supervision. However, modern audio-visual datasets contain biases that undermine the real-world performance of models trained on that data. We introduce Spoken ObjectNet, which is designed to remove some of these biases and provide a way to better evaluate how effectively models will perform in real-world scenarios. This dataset expands upon ObjectNet, which is a bias-controlled image dataset that features similar image classes to those present in ImageNet. We detail our data collection pipeline, which features several methods to improve caption quality, including automated language model checks. Lastly, we show baseline results on image retrieval and audio retrieval tasks. These results show that models trained on other datasets and then evaluated on Spoken ObjectNet tend to perform poorly due to biases in other datasets that the models have learned. We also show evidence that the performance decrease is due to the dataset controls, and not the transfer setting.
CODING & PROGRAMMING
arxiv.org

Datasets are not Enough: Challenges in Labeling Network Traffic

In contrast to previous surveys, the present work is not focused on reviewing the datasets used in the network security field. The fact is that many of the available public labeled datasets represent the network behavior just for a particular time period. Given the rate of change in malicious behavior and the serious challenge to label, and maintain these datasets, they become quickly obsolete. Therefore, this work is focused on the analysis of current labeling methodologies applied to network-based data. In the field of network security, the process of labeling a representative network traffic dataset is particularly challenging and costly since very specialized knowledge is required to classify network traces. Consequently, most of the current traffic labeling methods are based on the automatic generation of synthetic network traces, which hides many of the essential aspects necessary for a correct differentiation between normal and malicious behavior. Alternatively, a few other methods incorporate non-experts users in the labeling process of real traffic with the help of visual and statistical tools. However, after conducting an in-depth analysis, it seems that all current methods for labeling suffer from fundamental drawbacks regarding the quality, volume, and speed of the resulting dataset. This lack of consistent methods for continuously generating a representative dataset with an accurate and validated methodology must be addressed by the network security research community. Moreover, a consistent label methodology is a fundamental condition for helping in the acceptance of novel detection approaches based on statistical and machine learning techniques.
COMPUTERS
arxiv.org

Adversarial Attack across Datasets

It has been observed that Deep Neural Networks (DNNs) are vulnerable to transfer attacks in the query-free black-box setting. However, all the previous studies on transfer attack assume that the white-box surrogate models possessed by the attacker and the black-box victim models are trained on the same dataset, which means the attacker implicitly knows the label set and the input size of the victim model. However, this assumption is usually unrealistic as the attacker may not know the dataset used by the victim model, and further, the attacker needs to attack any randomly encountered images that may not come from the same dataset. Therefore, in this paper we define a new Generalized Transferable Attack (GTA) problem where we assume the attacker has a set of surrogate models trained on different datasets (with different label sets and image sizes), and none of them is equal to the dataset used by the victim model. We then propose a novel method called Image Classification Eraser (ICE) to erase classification information for any encountered images from arbitrary dataset. Extensive experiments on Cifar-10, Cifar-100, and TieredImageNet demonstrate the effectiveness of the proposed ICE on the GTA problem. Furthermore, we show that existing transfer attack methods can be modified to tackle the GTA problem, but with significantly worse performance compared with ICE.
COMPUTERS
towardsdatascience.com

A PySpark Example for Dealing with Larger than Memory Datasets

A step-by-step tutorial on how to use Spark to perform exploratory data analysis on larger than memory datasets. Analyzing datasets that are larger than the available RAM memory using Jupyter notebooks and Pandas Data Frames is a challenging issue. This problem has already been addressed (for instance here or here) but my objective here is a little different. I will be presenting a method for performing exploratory analysis on a large data set with the purpose of identifying and filtering out the unnecessary data. The hope is that at the end the filtered data set can be handled by Pandas for the rest of the computations.
SOFTWARE
arxiv.org

Spectral Analysis of Solar Radio Type III Bursts from 20 kHz to 410 MHz

K. Sasikumar Raja, Milan Maksimovic, Eduard P. Kontar, Xavier Bonnin, Philippe Zarka, Laurent Lamy, Hamish Reid, Nicole Vilmer, Alain Lecacheux, Vratislav Krupar, Baptiste Cecconi, Lahmiti Nora, Laurent Denis. We present the statistical analysis of the spectral response of solar radio type III bursts over the wide frequency range between 20...
ASTRONOMY

