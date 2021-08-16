Cancel
CreatorsPublishersAdvertisers
View more in
Coding & Programming

Dealing with Large Datasets: the Present Conundrum

By Editors' Picks
towardsdatascience.com
 6 days ago

Cover picture for the articleWith the ever-growing power of today’s machine learning algorithms, there’s no denying that something, in relation to time, has contributed to this revolutionary change. This reason is that there’s a lot more data for learning algorithms to learn from as time progresses. This is good. The more the merrier. In...

towardsdatascience.com

Comments / 0

IN THIS ARTICLE
#Small Data#Datasets#Data Collection#Data Processing#Conundrum#Stochastic#Theta
YOU MAY ALSO LIKE
NewsBreak
Technology
NewsBreak
Computers
NewsBreak
Science
NewsBreak
Coding & Programming
NewsBreak
Computer Science
Related
Environmentarxiv.org

Reconciling high resolution climate datasets using KrigR

There is an increasing need for high spatial and temporal resolution climate data for the wide community of researchers interested in climate change and its consequences. Currently, there is a large mismatch between the spatial resolutions of global climate model and reanalysis datasets (at best around 0.25o and 0.1o respectively) and the resolutions needed by many end-users of these datasets, which are typically on the scale of 30 arcseconds (~900m). This need for improved spatial resolution in climate datasets has motivated several groups to statistically downscale various combinations of observational or reanalysis datasets. However, the variety of downscaling methods and inputs used makes it difficult to reconcile the resultant differences between these high-resolution datasets. Here we make use of the KrigR R-package to statistically downscale the world-leading ERA5(-Land) reanalysis data using kriging. We show that kriging can accurately recover spatial heterogeneity of climate data given strong relationships with co-variates; that by preserving the uncertainty associated with the statistical downscaling, one can investigate and account for confidence in high-resolution climate data; and that the statistical uncertainty provided by KrigR can explain much of the difference between widely used high resolution climate datasets (CHELSA, TerraClimate, and WorldClim2) depending on variable, timescale, and region. This demonstrates the advantages of using KrigR to generate customized high spatial and/or temporal resolution climate data.
Technologyarxiv.org

Organization and Understanding of a Tactile Information Dataset TacAct For Physical Human-Robot Interaction

Human touching the robot to convey intentions or emotions is an essential communication pathway during physical Human-Robot Interaction (pHRI). Therefore, advanced service robots require superior tactile intelligence to guarantee naturalness and safety when making physical contact with human subjects. Tactile intelligence is the capability to percept and recognize tactile information from touch behaviors, in which understanding the physical meaning of touching actions is crucial. For this purpose, this report introduces a recently collected and organized dataset "TacAct" that encloses real-time tactile information when human subjects touched the test device mimicking a robot forearm. The dataset contains 12 types of 24,000 touch actions from 50 subjects. The dataset details are described, the data are preliminarily analyzed, and the validity of the dataset is tested through a convolutional neural network LeNet-5 which classifying different types of touch actions. We believe that the TacAct dataset would be beneficial for the community to understand the touch intention under various circumstances and to develop learning-based intelligent algorithms for different applications.
Coding & Programmingarxiv.org

How Nonconformity Functions and Difficulty of Datasets Impact the Efficiency of Conformal Classifiers

The property of conformal predictors to guarantee the required accuracy rate makes this framework attractive in various practical applications. However, this property is achieved at a price of reduction in precision. In the case of conformal classification, the systems can output multiple class labels instead of one. It is also known from the literature, that the choice of nonconformity function has a major impact on the efficiency of conformal classifiers. Recently, it was shown that different model-agnostic nonconformity functions result in conformal classifiers with different characteristics. For a Neural Network-based conformal classifier, the inverse probability (or hinge loss) allows minimizing the average number of predicted labels, and margin results in a larger fraction of singleton predictions. In this work, we aim to further extend this study. We perform an experimental evaluation using 8 different classification algorithms and discuss when the previously observed relationship holds or not. Additionally, we propose a successful method to combine the properties of these two nonconformity functions. The experimental evaluation is done using 11 real and 5 synthetic datasets.
Sciencearxiv.org

A Dataset for Answering Time-Sensitive Questions

Time is an important dimension in our physical world. Lots of facts can evolve with respect to time. For example, the U.S. President might change every four years. Therefore, it is important to consider the time dimension and empower the existing QA models to reason over time. However, the existing QA datasets contain rather few time-sensitive questions, hence not suitable for diagnosing or benchmarking the model's temporal reasoning capability. In order to promote research in this direction, we propose to construct a time-sensitive QA dataset. The dataset is constructed by 1) mining time-evolving facts from WikiData and align them to their corresponding Wikipedia page, 2) employing crowd workers to verify and calibrate these noisy facts, 3) generating question-answer pairs based on the annotated time-sensitive facts. Our dataset poses two novel challenges: 1) the model needs to understand both explicit and implicit mention of time information in the long document, 2) the model needs to perform temporal reasoning like comparison, addition, subtraction. We evaluate different SoTA long-document QA systems like BigBird and FiD on our dataset. The best-performing model FiD can only achieve 46\% accuracy, still far behind the human performance of 87\%. We demonstrate that these models are still lacking the ability to perform robust temporal understanding and reasoning. Therefore, we believe that our dataset could serve as a benchmark to empower future studies in temporal reasoning. The dataset and code are released in~\url{this https URL}.
Agriculturearxiv.org

Presenting an extensive lab- and field-image dataset of crops and weeds for computer vision tasks in agriculture

Michael A. Beck, Chen-Yi Liu, Christopher P. Bidinosti, Christopher J. Henry, Cara M. Godee, Manisha Ajmani. We present two large datasets of labelled plant-images that are suited towards the training of machine learning and computer vision models. The first dataset encompasses as the day of writing over 1.2 million images of indoor-grown crops and weeds common to the Canadian Prairies and many US states. The second dataset consists of over 540,000 images of plants imaged in farmland. All indoor plant images are labelled by species and we provide rich etadata on the level of individual images. This comprehensive database allows to filter the datasets under user-defined specifications such as for example the crop-type or the age of the plant. Furthermore, the indoor dataset contains images of plants taken from a wide variety of angles, including profile shots, top-down shots, and angled perspectives. The images taken from plants in fields are all from a top-down perspective and contain usually multiple plants per image. For these images metadata is also available. In this paper we describe both datasets' characteristics with respect to plant variety, plant age, and number of images. We further introduce an open-access sample of the indoor-dataset that contains 1,000 images of each species covered in our dataset. These, in total 14,000 images, had been selected, such that they form a representative sample with respect to plant age and ndividual plants per species. This sample serves as a quick entry point for new users to the dataset, allowing them to explore the data on a small scale and find the parameters of data most useful for their application without having to deal with hundreds of thousands of individual images.
Economyaithority.com

Infutor Adds Total Property Profiles Dataset on SafeGraph Shop

Enhanced, Highly Comprehensive Property Signals Include 200+ Attributes for Geospatial Mapping, Urban Planning, Utility & Energy Consumption and Prospecting Borrowers. Consumer identity management expert Infutor announced that it has added its Total Property Files dataset to the new SafeGraph Shop which includes energy consumption and county assessor energy level data assets. The partnership with SafeGraph, a data company that specializes in providing high-quality data on places, enables easy and reliable access to Infutor’s comprehensive property data, a critical component to geospatial mapping, urban planning, and prospecting borrowers.
Career Development & AdviceCMSWire

The Eminence Conundrum

“Eminence building is not a process — it is a way of life”. Eminence may not be the first word that springs to mind when looking to further our careers. But professional recognition and eminence are inextricably linked. We achieve professional recognition by creating value. Eminence not only creates value, it also ensures value is communicated through journals, conferences, social media and other external forums, so that our ideas can inspire others, spark innovation and extend beyond our immediate sphere of influence.
Softwarewinbuzzer.com

Microsoft Releases Public Dataset from First SimuLand Experiment

Microsoft is this week launching a public dataset taken from its first SimuLand event. If you are unfamiliar with the SimuLand initiative, it provides researchers with tools to test how services like Azure Defender, Microsoft 365 Defender, and Azure Sentinel handle attacks. During the first open-source SimuLand event last month...
Coding & Programmingarxiv.org

QDataset: Quantum Datasets for Machine Learning

The availability of large-scale datasets on which to train, benchmark and test algorithms has been central to the rapid development of machine learning as a discipline and its maturity as a research discipline. Despite considerable advancements in recent years, the field of quantum machine learning (QML) has thus far lacked a set of comprehensive large-scale datasets upon which to benchmark the development of algorithms for use in applied and theoretical quantum settings. In this paper, we introduce such a dataset, the QDataSet, a quantum dataset designed specifically to facilitate the training and development of QML algorithms. The QDataSet comprises 52 high-quality publicly available datasets derived from simulations of one- and two-qubit systems evolving in the presence and/or absence of noise. The datasets are structured to provide a wealth of information to enable machine learning practitioners to use the QDataSet to solve problems in applied quantum computation, such as quantum control, quantum spectroscopy and tomography. Accompanying the datasets on the associated GitHub repository are a set of workbooks demonstrating the use of the QDataSet in a range of optimisation contexts.
Coding & Programmingtowardsdatascience.com

Generating/Expanding your datasets with synthetic data

This article aims to address the need for augmenting/expanding your existing datasets using an open-source library involving GANs. As an ML practitioner or a Data Scientist, it might have been possible when we found ourselves in a situation like “if only we had more data”. There are often times when the dataset that we have is very limited and aren’t sure if the performance of our machine learning model would have been better or worse if given more amount of statistically similar data. We could of course mine more data from the same source that we got our existing data from, but that may not be possible everytime. What if there was a way to create more data from the data that we already have?
Softwarevmware.com

API does not return when large volume of data present

Https://developer.vmware.com/docs/vsphere-automation/latest/vcenter/api/vcenter/vm/get/. Although we have less than 4,000 vms in our production environment in total, this API call just returns an error message. The web service does not paginate; I haven't been able to find any workarounds. I have filtered by power state and can get those powered off and those that are suspended, but I only get an error message for those that are powered_on. When filtering by power state, this call works fine in our development environment where have only 100 vms powered on. We can't filter any further (by CPPU, for example), so I'm a little stuck.
Computersgame-debate.com

Quantum Conundrum System Requirements

OS: Windows XP 32-bit Processor: Intel Core 2 Duo E4500 2.2GHz / AMD Athlon 64 X2 Dual Core 3800+. Graphics: AMD Radeon HD 2900 GT or NVIDIA GeForce 9500 GT. Processor: Intel Core 2 Duo E8400 3.0GHz / AMD Phenom II X2 565. Graphics: AMD Radeon HD 2900 XT 512MB...
Electronicsarxiv.org

TUM-VIE: The TUM Stereo Visual-Inertial Event Dataset

Event cameras are bio-inspired vision sensors which measure per pixel brightness changes. They offer numerous benefits over traditional, frame-based cameras, including low latency, high dynamic range, high temporal resolution and low power consumption. Thus, these sensors are suited for robotics and virtual reality applications. To foster the development of 3D perception and navigation algorithms with event cameras, we present the TUM-VIE dataset. It consists of a large variety of handheld and head-mounted sequences in indoor and outdoor environments, including rapid motion during sports and high dynamic range scenarios. The dataset contains stereo event data, stereo grayscale frames at 20Hz as well as IMU data at 200Hz. Timestamps between all sensors are synchronized in hardware. The event cameras contain a large sensor of 1280x720 pixels, which is significantly larger than the sensors used in existing stereo event datasets (at least by a factor of ten). We provide ground truth poses from a motion capture system at 120Hz during the beginning and end of each sequence, which can be used for trajectory evaluation. TUM-VIE includes challenging sequences where state-of-the art visual SLAM algorithms either fail or result in large drift. Hence, our dataset can help to push the boundary of future research on event-based visual-inertial perception algorithms.
ComputersVentureBeat

AI datasets are prone to mismanagement, study finds

Public datasets like Duke University’s DukeMTMC are often used to train, test, and fine-tune machine learning algorithms that make their way into production, sometimes with controversial results. It’s an open secret that biases in these datasets could negatively impact the predictions made by an algorithm, for example causing a facial recognition system to misidentify a person. But a recent study coauthored by researchers at Princeton reveals that computer vision datasets, particularly those containing images of people, present a range of ethical problems.
SoftwareAUTOCAR.co.uk

Health check: the UK car industry's recruitment conundrum

Britain’s got talent, but attracting it to the automotive sector can be a challenge. Probably the closest thing you can make to a battery cell in the UK at the moment is a KitKat.”. It’s an eye-widening comparison from Steve Doyle, CEO of Evera Recruitment, and it illustrates the extent...
Computersarxiv.org

HiTab: A Hierarchical Table Dataset for Question Answering and Natural Language Generation

Tables are often created with hierarchies, but existing works on table reasoning mainly focus on flat tables and neglect hierarchical tables. Hierarchical tables challenge existing methods by hierarchical indexing, as well as implicit relationships of calculation and semantics. This work presents HiTab, a free and open dataset for the research community to study question answering (QA) and natural language generation (NLG) over hierarchical tables. HiTab is a cross-domain dataset constructed from a wealth of statistical reports and Wikipedia pages, and has unique characteristics: (1) nearly all tables are hierarchical, and (2) both target sentences for NLG and questions for QA are revised from high-quality descriptions in statistical reports that are meaningful and diverse. (3) HiTab provides fine-grained annotations on both entity and quantity alignment. Targeting hierarchical structure, we devise a novel hierarchy-aware logical form for symbolic reasoning over tables, which shows high effectiveness. Then given annotations of entity and quantity alignment, we propose partially supervised training, which helps models to largely reduce spurious predictions in the QA task. In the NLG task, we find that entity and quantity alignment also helps NLG models to generate better results in a conditional generation setting. Experiment results of state-of-the-art baselines suggest that this dataset presents a strong challenge and a valuable benchmark for future research.
Sciencedocwirenews.com

A merged microarray meta-dataset for transcriptionally profiling colorectal neoplasm formation and progression

Sci Data. 2021 Aug 11;8(1):214. doi: 10.1038/s41597-021-00998-5. Transcriptional profiling of pre- and post-malignant colorectal cancer (CRC) lesions enable temporal monitoring of molecular events underlying neoplastic progression. However, the most widely used transcriptomic dataset for CRC, TCGA-COAD, is devoid of adenoma samples, which increases reliance on an assortment of disparate microarray studies and hinders consensus building. To address this, we developed a microarray meta-dataset comprising 231 healthy, 132 adenoma, and 342 CRC tissue samples from twelve independent studies. Utilizing a stringent analytic framework, select datasets were downloaded from the Gene Expression Omnibus, normalized by frozen robust multiarray averaging and subsequently merged. Batch effects were then identified and removed by empirical Bayes estimation (ComBat). Finally, the meta-dataset was filtered for low variant probes, enabling downstream differential expression as well as quantitative and functional validation through cross-platform correlation and enrichment analyses, respectively. Overall, our meta-dataset provides a robust tool for investigating colorectal adenoma formation and malignant transformation at the transcriptional level with a pipeline that is modular and readily adaptable for similar analyses in other cancer types.
Sciencearxiv.org

TFRD: A Benchmark Dataset for Research on Temperature Field Reconstruction of Heat-Source Systems

Heat management plays an important role in engineering. Temperature field reconstruction of heat source systems (TFR-HSS) with limited monitoring tensors, performs an essential role in heat management. However, prior methods with common interpolations usually cannot provide accurate reconstruction. In addition, there exists no public dataset for widely research of reconstruction methods to further boost the field reconstruction in engineering. To overcome this problem, this work construct a specific dataset, namely TFRD, for TFR-HSS task with commonly used methods, including the interpolation methods and the surrogate model based methods, as baselines to advance the research over temperature field reconstruction. First, the TFR-HSS task is mathematically modelled from real-world engineering problem and three types of numerically modellings have been constructed to transform the problem into discrete mapping forms. Besides, this work selects four typical reconstruction problem with different heat source information and boundary conditions and generate the standard samples as training and testing samples for further research. Finally, a comprehensive review of the prior methods for TFR-HSS task as well as recent widely used deep learning methods is given and we provide a performance analysis of typical methods on TFRD, which can be served as the baseline results on this benchmark.
Computersarxiv.org

SURFNet: Super-resolution of Turbulent Flows with Transfer Learning using Small Datasets

Deep Learning (DL) algorithms are emerging as a key alternative to computationally expensive CFD simulations. However, state-of-the-art DL approaches require large and high-resolution training data to learn accurate models. The size and availability of such datasets are a major limitation for the development of next-generation data-driven surrogate models for turbulent flows. This paper introduces SURFNet, a transfer learning-based super-resolution flow network. SURFNet primarily trains the DL model on low-resolution datasets and transfer learns the model on a handful of high-resolution flow problems - accelerating the traditional numerical solver independent of the input size. We propose two approaches to transfer learning for the task of super-resolution, namely one-shot and incremental learning. Both approaches entail transfer learning on only one geometry to account for fine-grid flow fields requiring 15x less training data on high-resolution inputs compared to the tiny resolution (64x256) of the coarse model, significantly reducing the time for both data collection and training. We empirically evaluate SURFNet's performance by solving the Navier-Stokes equations in the turbulent regime on input resolutions up to 256x larger than the coarse model. On four test geometries and eight flow configurations unseen during training, we observe a consistent 2-2.1x speedup over the OpenFOAM physics solver independent of the test geometry and the resolution size (up to 2048x2048), demonstrating both resolution-invariance and generalization capabilities. Our approach addresses the challenge of reconstructing high-resolution solutions from coarse grid models trained using low-resolution inputs (super-resolution) without loss of accuracy and requiring limited computational resources.
Religionarxiv.org

WikiChurches: A Fine-Grained Dataset of Architectural Styles with Real-World Challenges

We introduce a novel dataset for architectural style classification, consisting of 9,485 images of church buildings. Both images and style labels were sourced from Wikipedia. The dataset can serve as a benchmark for various research fields, as it combines numerous real-world challenges: fine-grained distinctions between classes based on subtle visual features, a comparatively small sample size, a highly imbalanced class distribution, a high variance of viewpoints, and a hierarchical organization of labels, where only some images are labeled at the most precise level. In addition, we provide 631 bounding box annotations of characteristic visual features for 139 churches from four major categories. These annotations can, for example, be useful for research on fine-grained classification, where additional expert knowledge about distinctive object parts is often available. Images and annotations are available at: this https URL.

Comments / 0

Community Policy