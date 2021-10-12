CreatorsPublishersAdvertisers
AVoE: A Synthetic 3D Dataset on Understanding Violation of Expectation for Artificial Cognition

By Arijit Dasgupta, Jiafei Duan, Marcelo H. Ang Jr, Cheston Tan
arxiv.org
 10 days ago

Recent work in cognitive reasoning and computer vision has engendered an increasing popularity for the Violation-of-Expectation (VoE) paradigm in synthetic datasets. Inspired by work in infant psychology, researchers have started evaluating a

arxiv.org

ScienceAlert

A Physicist Quantified The Amount of Information in The Entire Observable Universe

In attempts to understand the very nature of our reality, physicists sure have some mind-bending theories. Like what if information is a tangible and fundamental aspect of physical reality itself – alongside matter and energy? Or, alternatively, what if information is the fifth state of matter? Information is, after all, something all matter and energy measurably possess. The rules that govern their existence, like their mass, speed, or charge, are all bits of information they contain. So to allow experimental probing of such ideas, physicist Melvin Vopson from the University of Portsmouth in the UK estimated how much information a single elementary...
ASTRONOMY
arxiv.org

Dataset Condensation with Distribution Matching

Computational cost to train state-of-the-art deep models in many learning problems is rapidly increasing due to more sophisticated models and larger datasets. A recent promising direction to reduce training time is dataset condensation that aims to replace the original large training set with a significantly smaller learned synthetic set while preserving its information. While training deep models on the small set of condensed images can be extremely fast, their synthesis remains computationally expensive due to the complex bi-level optimization and second-order derivative computation. In this work, we propose a simple yet effective dataset condensation technique that requires significantly lower training cost with comparable performance by matching feature distributions of the synthetic and original training images in sampled embedding spaces. Thanks to its efficiency, we apply our method to more realistic and larger datasets with sophisticated neural architectures and achieve a significant performance boost while using larger synthetic training set. We also show various practical benefits of our method in continual learning and neural architecture search.
CODING & PROGRAMMING
comptia.org

Artificial Intelligence in Cybersecurity Operations

It’s a well-known fact that the cybersecurity industry faces a dramatic shortage of talented professionals with the necessary knowledge and skills. Due to a limited pool of cybersecurity resources, organizations are applying artificial intelligence (AI) to automate routine tasks. Leveraging AI for cybersecurity operations frees up cybersecurity professionals to focus...
SOFTWARE
TheConversationAU

Facebook wants AI to find your keys and understand your conversations

Facebook has announced a research project that aims to push the “frontier of first-person perception”, and in the process help you remember where your left your keys. The Ego4D project provides a huge collection of first-person video and related data, plus a set of challenges for researchers to teach computers to understand the data and gather useful information from it. In September, the social media giant launched a line of “smart glasses” called Ray-Ban Stories, which carry a digital camera and other features. Much like the Google Glass project, which met mixed reviews in 2013, this one has prompted complaints of...
SOFTWARE
IN THIS ARTICLE
#Cognition#Datasets#Voe#3d
arxiv.org

Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Visually-grounded spoken language datasets can enable models to learn cross-modal correspondences with very weak supervision. However, modern audio-visual datasets contain biases that undermine the real-world performance of models trained on that data. We introduce Spoken ObjectNet, which is designed to remove some of these biases and provide a way to better evaluate how effectively models will perform in real-world scenarios. This dataset expands upon ObjectNet, which is a bias-controlled image dataset that features similar image classes to those present in ImageNet. We detail our data collection pipeline, which features several methods to improve caption quality, including automated language model checks. Lastly, we show baseline results on image retrieval and audio retrieval tasks. These results show that models trained on other datasets and then evaluated on Spoken ObjectNet tend to perform poorly due to biases in other datasets that the models have learned. We also show evidence that the performance decrease is due to the dataset controls, and not the transfer setting.
CODING & PROGRAMMING
arxiv.org

Robust Glare Detection: Review, Analysis, and Dataset Release

Sun Glare widely exists in the images captured by unmanned ground and aerial vehicles performing in outdoor environments. The existence of such artifacts in images will result in wrong feature extraction and failure of autonomous systems. Humans will try to adapt their view once they observe a glare (especially when driving), and this behavior is an essential requirement for the next generation of autonomous vehicles. The source of glare is not limited to the sun, and glare can be seen in the images captured during the nighttime and in indoor environments, which is due to the presence of different light sources; reflective surfaces also influence the generation of such artifacts. The glare's visual characteristics are different on images captured by various cameras and depend on several factors such as the camera's shutter speed and exposure level. Hence, it is challenging to introduce a general - robust and accurate - algorithm for glare detection that can perform well in various captured images. This research aims to introduce the first dataset for glare detection, which includes images captured by different cameras. Besides, the effect of multiple image representations and their combination in glare detection is examined using the proposed deep network architecture. The released dataset is available at this https URL.
TECHNOLOGY
arxiv.org

Datasets are not Enough: Challenges in Labeling Network Traffic

In contrast to previous surveys, the present work is not focused on reviewing the datasets used in the network security field. The fact is that many of the available public labeled datasets represent the network behavior just for a particular time period. Given the rate of change in malicious behavior and the serious challenge to label, and maintain these datasets, they become quickly obsolete. Therefore, this work is focused on the analysis of current labeling methodologies applied to network-based data. In the field of network security, the process of labeling a representative network traffic dataset is particularly challenging and costly since very specialized knowledge is required to classify network traces. Consequently, most of the current traffic labeling methods are based on the automatic generation of synthetic network traces, which hides many of the essential aspects necessary for a correct differentiation between normal and malicious behavior. Alternatively, a few other methods incorporate non-experts users in the labeling process of real traffic with the help of visual and statistical tools. However, after conducting an in-depth analysis, it seems that all current methods for labeling suffer from fundamental drawbacks regarding the quality, volume, and speed of the resulting dataset. This lack of consistent methods for continuously generating a representative dataset with an accurate and validated methodology must be addressed by the network security research community. Moreover, a consistent label methodology is a fundamental condition for helping in the acceptance of novel detection approaches based on statistical and machine learning techniques.
COMPUTERS
arxiv.org

MReD: A Meta-Review Dataset for Controllable Text Generation

When directly using existing text generation datasets for controllable generation, we are facing the problem of not having the domain knowledge and thus the aspects that could be controlled are limited.A typical example is when using CNN/Daily Mail dataset for controllable text summarization, there is no guided information on the emphasis of summary sentences. A more useful text generator should leverage both the input text and control variables to guide the generation, which can only be built with deep understanding of the domain knowledge. Motivated by this vi-sion, our paper introduces a new text generation dataset, named MReD. Our new dataset consists of 7,089 meta-reviews and all its 45k meta-review sentences are manually annotated as one of the carefully defined 9 categories, including abstract, strength, decision, etc. We present experimental results on start-of-the-art summarization models, and propose methods for controlled generation on both extractive and abstractive models using our annotated data. By exploring various settings and analaysing the model behavior with respect to the control inputs, we demonstrate the challenges and values of our dataset. MReD allows us to have a better understanding of the meta-review corpora and enlarge the research room for controllable text generation.
COMPUTERS
NewsBreak
Psychology
NewsBreak
Artificial Intelligence
NewsBreak
Technology
NewsBreak
Science
arxiv.org

Adversarial Attack across Datasets

It has been observed that Deep Neural Networks (DNNs) are vulnerable to transfer attacks in the query-free black-box setting. However, all the previous studies on transfer attack assume that the white-box surrogate models possessed by the attacker and the black-box victim models are trained on the same dataset, which means the attacker implicitly knows the label set and the input size of the victim model. However, this assumption is usually unrealistic as the attacker may not know the dataset used by the victim model, and further, the attacker needs to attack any randomly encountered images that may not come from the same dataset. Therefore, in this paper we define a new Generalized Transferable Attack (GTA) problem where we assume the attacker has a set of surrogate models trained on different datasets (with different label sets and image sizes), and none of them is equal to the dataset used by the victim model. We then propose a novel method called Image Classification Eraser (ICE) to erase classification information for any encountered images from arbitrary dataset. Extensive experiments on Cifar-10, Cifar-100, and TieredImageNet demonstrate the effectiveness of the proposed ICE on the GTA problem. Furthermore, we show that existing transfer attack methods can be modified to tackle the GTA problem, but with significantly worse performance compared with ICE.
COMPUTERS
arxiv.org

NoisyActions2M: A Multimedia Dataset for Video Understanding from Noisy Labels

Deep learning has shown remarkable progress in a wide range of problems. However, efficient training of such models requires large-scale datasets, and getting annotations for such datasets can be challenging and costly. In this work, we explore the use of user-generated freely available labels from web videos for video understanding. We create a benchmark dataset consisting of around 2 million videos with associated user-generated annotations and other meta information. We utilize the collected dataset for action classification and demonstrate its usefulness with existing small-scale annotated datasets, UCF101 and HMDB51. We study different loss functions and two pretraining strategies, simple and self-supervised learning. We also show how a network pretrained on the proposed dataset can help against video corruption and label noise in downstream datasets. We present this as a benchmark dataset in noisy learning for video understanding. The dataset, code, and trained models will be publicly available for future research.
COMPUTERS
EETimes.com

On the Verge of Artificial Vision

On the Weekly Briefing podcast: Prosthetic vision, a common concept in science-fiction, has long been out of reach in reality – but perhaps for not much longer. Researchers are about to start experiments to see if they can restore vision to the blind using prosthetics based on advanced sensor technology. Our guest is Philip Troyk, head of the Pritzker Institute of Biomedical Science and Engineering at Illinois Tech and the CEO of semiconductor supplier Sigenics.
TECHNOLOGY
arxiv.org

Quantifying Cognitive Factors in Lexical Decline

We adopt an evolutionary view on language change in which cognitive factors (in addition to social ones) affect the fitness of words and their success in the linguistic ecosystem. Specifically, we propose a variety of psycholinguistic factors -- semantic, distributional, and phonological -- that we hypothesize are predictive of lexical decline, in which words greatly decrease in frequency over time. Using historical data across three languages (English, French, and German), we find that most of our proposed factors show a significant difference in the expected direction between each curated set of declining words and their matched stable words. Moreover, logistic regression analyses show that semantic and distributional factors are significant in predicting declining words. Further diachronic analysis reveals that declining words tend to decrease in the diversity of their lexical contexts over time, gradually narrowing their 'ecological niches'.
SCIENCE
arxiv.org

HumBugDB: A Large-scale Acoustic Mosquito Dataset

Ivan Kiskin, Marianne Sinka, Adam D. Cobb, Waqas Rafique, Lawrence Wang, Davide Zilli, Benjamin Gutteridge, Rinita Dam, Theodoros Marinos, Yunpeng Li, Dickson Msaky, Emmanuel Kaindoa, Gerard Killeen, Eva Herreros-Moya, Kathy J. Willis, Stephen J. Roberts. This paper presents the first large-scale multi-species dataset of acoustic recordings of mosquitoes tracked continuously...
ANIMALS
arxiv.org

Information-Theoretic Measures of Dataset Difficulty

Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. Not only is this framework informal, but it also provides little understanding of how difficult each instance is, or what attributes make it difficult for a given model. To address these problems, we propose an information-theoretic perspective, framing dataset difficulty as the absence of $\textit{usable information}$. Measuring usable information is as easy as measuring performance, but has certain theoretical advantages. While the latter only allows us to compare different models w.r.t the same dataset, the former also allows us to compare different datasets w.r.t the same model. We then introduce $\textit{pointwise}$ $\mathcal{V}-$$\textit{information}$ (PVI) for measuring the difficulty of individual instances, where instances with higher PVI are easier for model $\mathcal{V}$. By manipulating the input before measuring usable information, we can understand $\textit{why}$ a dataset is easy or difficult for a given model, which we use to discover annotation artefacts in widely-used benchmarks.
COMPUTERS
arxiv.org

MAAD: A Model and Dataset for "Attended Awareness" in Driving

We propose a computational model to estimate a person's attended awareness of their environment. We define attended awareness to be those parts of a potentially dynamic scene which a person has attended to in recent history and which they are still likely to be physically aware of. Our model takes as input scene information in the form of a video and noisy gaze estimates, and outputs visual saliency, a refined gaze estimate, and an estimate of the person's attended awareness. In order to test our model, we capture a new dataset with a high-precision gaze tracker including 24.5 hours of gaze sequences from 23 subjects attending to videos of driving scenes. The dataset also contains third-party annotations of the subjects' attended awareness based on observations of their scan path. Our results show that our model is able to reasonably estimate attended awareness in a controlled setting, and in the future could potentially be extended to real egocentric driving data to help enable more effective ahead-of-time warnings in safety systems and thereby augment driver performance. We also demonstrate our model's effectiveness on the tasks of saliency, gaze calibration, and denoising, using both our dataset and an existing saliency dataset. We make our model and dataset available at this https URL.
TECHNOLOGY
loc.gov

Artificial Intelligence: The Copyright Connection

The following is a guest post by Whitney Levandusky, Supervisory Copyright Claims Attorney in the Office of the General Counsel. Artificial intelligence (AI)—machine learning systems set to accomplish tasks—has captivated the public, filled headlines, and prompted new and broad policy discussions. AI, however, is nothing new, with the term “artificial intelligence” coined in the 1950s. Research and investment in AI rises and falls through “summers” and “winters,” but you can see the pull of AI problems and solutions in such wide-ranging applications as government administration—the U.S. Postal Service implemented a machine learning system to read handwritten mailing addresses in 1997—and entertainment—an AI system has won Jeopardy, and the topic is a central concern in a Steven Spielberg movie.
TECHNOLOGY
VentureBeat

Facebook quietly acquires synthetic data startup AI.Reverie

Facebook has quietly acquired AI.Reverie, a New York-based startup creating synthetic data to train machine learning models, VentureBeat has learned. In an apparent nod to the HBO show Westworld, where visitors to a theme park encounter hordes of artificially intelligent robots, the purchase was made through a holding company called Dolores Acquisition Sub, Inc., after a character in the show.
BUSINESS
arxiv.org

A ground-truth dataset of real security patches

Training machine learning approaches for vulnerability identification and producing reliable tools to assist developers in implementing quality software -- free of vulnerabilities -- is challenging due to the lack of large datasets and real data. Researchers have been looking at these issues and building datasets. However, these datasets usually miss natural language artifacts and programming language diversity. We scraped the entire CVE details database for GitHub references and augmented the data with 3 security-related datasets. We used the data to create a ground-truth dataset of natural language artifacts (such as commit messages, commits comments, and summaries), meta-data and code changes. Our dataset integrates a total of 8057 security-relevant commits -- the equivalent to 5942 security patches -- from 1339 different projects spanning 146 different types of vulnerabilities and 20 languages. A dataset of 110k non-security-related commits is also provided. Data and scripts are all available on GitHub. Data is stored in a .CSV file. Codebases can be downloaded using our scripts. Our dataset is a valuable asset to answer research questions on different topics such as the identification of security-relevant information using NLP models; software engineering and security best practices; and, vulnerability detection and patching; and, security program analysis.
SOFTWARE
arxiv.org

Eigenbehaviour as an Indicator of Cognitive Abilities

With growing usage of machine learning algorithms and big data in health applications, digital biomarkers have become an important key feature to ensure the success of those applications. In this paper, we focus on one important use-case, the long-term continuous monitoring of the cognitive ability of older adults. The cognitive ability is a factor both for long-term monitoring of people living alone as well as an outcome in clinical studies. In this work, we propose a new digital biomarker for cognitive abilities based on location eigenbehaviour obtained from contactless ambient sensors. Indoor location information obtained from passive infrared sensors is used to build a location matrix covering several weeks of measurement. Based on the eigenvectors of this matrix, the reconstruction error is calculated for various numbers of used eigenvectors. The reconstruction error is used to predict cognitive ability scores collected at baseline, using linear regression. Additionally, classification of normal versus pathological cognition level is performed using a support-vector-machine. Prediction performance is strong for high levels of cognitive ability, but grows weaker for low levels of cognitive ability. Classification into normal versus pathological cognitive ability level reaches high accuracy with a AUC = 0.94. Due to the unobtrusive method of measurement based on contactless ambient sensors, this digital biomarker of cognitive ability is easily obtainable. The usage of the reconstruction error is a strong digital biomarker for the binary classification and, to a lesser extent, for more detailed prediction of interindividual differences in cognition.
HEALTH
arxiv.org

A Dataset for Discourse Structure in Peer Review Discussions

At the foundation of scientific evaluation is the labor-intensive process of peer review. This critical task requires participants to consume and interpret vast amounts of highly technical text. We show that discourse cues from rebuttals can shed light on the quality and interpretation of reviews. Further, an understanding of the argumentative strategies employed by the reviewers and authors provides useful signal for area chairs and other decision makers.
SCIENCE

