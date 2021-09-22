CreatorsPublishersAdvertisers
View more in
Science

Two-level Bayesian interaction analysis for survival data incorporating pathway information

By Xing Qin, Shuangge Ma, Mengyun Wu
arxiv.org
 6 days ago

Genetic interactions play an important role in the progression of complex diseases, providing explanation of variations in disease phenotype missed by main genetic effects. Comparatively, there are fewer investigations on prognostic survival time, given its challenging characteristics such as censoring. In recent biomedical research, two-level analysis of both genes and their involved pathways has received much attention and been demonstrated to be more effective than single-level analysis, however such analysis is limited to main effects. Pathways are not isolated and their interactions have also been suggested to have important contributions to the prognosis of complex diseases. In this article, we develop a novel two-level Bayesian interaction analysis approach for survival data. This approach is the first to conduct the analysis of lower-level gene-gene interactions and higher-level pathway-pathway interactions simultaneously. Significantly advancing from existing Bayesian studies based on the Markov Chain Monte Carlo (MCMC) technique, we propose a variational inference framework based on the accelerated failure time model with favourable priors to account for two-level selection as well as censoring. The computational efficiency is much desirable for high dimensional interaction analysis. We examine performance of the proposed approach using extensive simulation. Application to TCGA melanoma and lung adenocarcinoma data leads to biologically sensible findings with satisfactory prediction accuracy and selection stability.

arxiv.org

Comments / 0

Related
arxiv.org

Identifying Untrustworthy Samples: Data Filtering for Open-domain Dialogues with Bayesian Optimization

Being able to reply with a related, fluent, and informative response is an indispensable requirement for building high-quality conversational agents. In order to generate better responses, some approaches have been proposed, such as feeding extra information by collecting large-scale datasets with human annotations, designing neural conversational models (NCMs) with complex architecture and loss functions, or filtering out untrustworthy samples based on a dialogue attribute, e.g., Relatedness or Genericness. In this paper, we follow the third research branch and present a data filtering method for open-domain dialogues, which identifies untrustworthy samples from training data with a quality measure that linearly combines seven dialogue attributes. The attribute weights are obtained via Bayesian Optimization (BayesOpt) that aims to optimize an objective function for dialogue generation iteratively on the validation set. Then we score training samples with the quality measure, sort them in descending order, and filter out those at the bottom. Furthermore, to accelerate the "filter-train-evaluate" iterations involved in BayesOpt on large-scale datasets, we propose a training framework that integrates maximum likelihood estimation (MLE) and negative training method (NEG). The training method updates parameters of a trained NCMs on two small sets with newly maintained and removed samples, respectively. Specifically, MLE is applied to maximize the log-likelihood of newly maintained samples, while NEG is used to minimize the log-likelihood of newly removed ones. Experimental results on two datasets show that our method can effectively identify untrustworthy samples, and NCMs trained on the filtered datasets achieve better performance.
COMPUTERS
arxiv.org

Bayesian model-based outlier detection in network meta-analysis

In a network meta-analysis, some of the collected studies may deviate markedly from the others, for example having very unusual effect sizes. These deviating studies can be regarded as outlying with respect to the rest of the network and can be influential on the pooled results. Thus, it could be inappropriate to synthesize those studies without further investigation. In this paper, we propose two Bayesian methods to detect outliers in a network meta-analysis via: (a) a mean-shifted outlier model and (b), posterior predictive p-values constructed from ad-hoc discrepancy measures. The former method uses Bayes factors to formally test each study against outliers while the latter provides a score of outlyingness for each study in the network, which allows to numerically quantify the uncertainty associated with being outlier. Furthermore, we present a simple method based on informative priors as part of the network meta-analysis model to down-weight the detected outliers. We conduct extensive simulations to evaluate the effectiveness of the proposed methodology while comparing it to some alternative, available outlier diagnostic tools. Two real networks of interventions are then used to demonstrate our methods in practice.
SCIENCE
arxiv.org

Data Privacy Protection and Utility Preservation through Bayesian Data Synthesis: A Case Study on Airbnb Listings

When releasing record-level data containing sensitive information to the public, the data disseminator is responsible for protecting the privacy of every record in the dataset, simultaneously preserving important features of the data for users' analysis. These goals can be achieved by data synthesis, where confidential data are replaced with synthetic data that are simulated based on statistical models estimated on the confidential data. In this paper, we present a data synthesis case study, where synthetic values of price and the number of available days in a sample of the New York Airbnb Open Data are created for privacy protection. One sensitive variable, the number of available days of an Airbnb listing, has a large amount of zero-valued records and also truncated at the two ends. We propose a novel zero-inflated truncated Poisson regression model for its synthesis. We utilize a sequential synthesis approach to further synthesize the sensitive price variable. The resulting synthetic data are evaluated for its utility preservation and privacy protection, the latter in the form of disclosure risks. Furthermore, we propose methods to investigate how uncertainties in intruder's knowledge would influence the identification disclosure risks of the synthetic data. In particular, we explore several realistic scenarios of uncertainties in intruder's knowledge of available information and evaluate their impacts on the resulting identification disclosure risks.
TECHNOLOGY
arxiv.org

A Bayesian Hidden Semi-Markov Model with Covariate-Dependent State Duration Parameters for High-Frequency Environmental Data

Shirley Rojas-Salazar, Erin M. Schliep, Christopher K. Wikle, Emily H. Stanley, Stephen R. Carpenter, Noah R. Lottig. Environmental time series data observed at high frequencies can be studied with approaches such as hidden Markov and semi-Markov models (HMM and HSMM). HSMMs extend the HMM by explicitly modeling the time spent in each state. In a discrete-time HSMM, the duration in each state can be modeled with a zero-truncated Poisson distribution, where the duration parameter may be state-specific but constant in time. We extend the HSMM by allowing the state-specific duration parameters to vary in time and model them as a function of known covariates observed over a period of time leading up to a state transition. In addition, we propose a data subsampling approach given that high-frequency data can violate the conditional independence assumption of the HSMM. We apply the model to high-frequency data collected by an instrumented buoy in Lake Mendota. We model the phycocyanin concentration, which is used in aquatic systems to estimate the relative abundance of blue-green algae, and identify important time-varying effects associated with the duration in each state.
SCIENCE
IN THIS ARTICLE
#Melanoma#Adenocarcinoma#Arxiv#Phenotype#Biomedical Research#Bayesian#Mcmc
arxiv.org

Incorporating Data Uncertainty in Object Tracking Algorithms

Methodologies for incorporating the uncertainties characteristic of data-driven object detectors into object tracking algorithms are explored. Object tracking methods rely on measurement error models, typically in the form of measurement noise, false positive rates, and missed detection rates. Each of these quantities, in general, can be dependent on object or measurement location. However, for detections generated from neural-network processed camera inputs, these measurement error statistics are not sufficient to represent the primary source of errors, namely a dissimilarity between run-time sensor input and the training data upon which the detector was trained. To this end, we investigate incorporating data uncertainty into object tracking methods such as to improve the ability to track objects, and particularly those which out-of-distribution w.r.t. training data. The proposed methodologies are validated on an object tracking benchmark as well on experiments with a real autonomous aircraft.
SCIENCE
arxiv.org

Estimation of Measures for Two-Way Contingency Tables Using the Bayesian Estimators

In the analysis of two-way contingency tables, the measures for representing the degree of departure from independence, symmetry or asymmetry are often used. These measures in contingency tables are expressed as functions of the probability structure of the tables. Hence, the value of a measure is estimated. Plug-in estimators of measures with sample proportions are used to estimate the measures, but without sufficient sample size, the bias and mean squared error (MSE) of the estimators become large. This study proposes an estimator that can reduce the bias and MSE, even without a sufficient sample size, using the Bayesian estimators of cell probabilities. We asymptotically evaluate the MSE of the estimator of the measure plugging in the posterior means of the cell probabilities when the prior distribution of the cell probabilities is the Dirichlet distribution. As a result, we can derive the Dirichlet parameter that asymptotically minimizes the MSE of the estimator. Numerical experiments show that the proposed estimator has a smaller bias and MSE than the plug-in estimator with sample proportions, uniform prior, and Jeffreys prior. Another advantage of our approach is the construction of credible intervals for measures using Monte Carlo simulations.
SCIENCE
World Health Organization

Snakebite envenoming: an interactive data platform to support the 2030 targets

Geospatial tracking and convergent technology can greatly contribute to accurate information and improved awareness about venomous snakes. Having this information available can accelerate the implementation of life-saving interventions, such as improved planning and delivery of antivenoms, and identification of high-risk communities and locations where treatment and antivenom centres should be prioritized.
TECHNOLOGY
Popular Science

Become an expert in data analysis with this $20 training bundle

When big companies make big decisions, they usually turn to data. But before the fat cats can look through the charts and graphs, someone needs to crunch the numbers. The Complete Microsoft Data Analysis Expert Bundle helps you acquire these lucrative skills, with six courses and over 31 hours of video training on key data tools. You can grab the bundle now for only $19.99.
MARKETING
YOU MAY ALSO LIKE
NewsBreak
Science
arxiv.org

A survey of Bayesian Network structure learning

Bayesian Networks (BNs) have become increasingly popular over the last few decades as a tool for reasoning under uncertainty in fields as diverse as medicine, biology, epidemiology, economics and the social sciences. This is especially true in real-world areas where we seek to answer complex questions based on hypothetical evidence to determine actions for intervention. However, determining the graphical structure of a BN remains a major challenge, especially when modelling a problem under causal assumptions. Solutions to this problem include the automated discovery of BN graphs from data, constructing them based on expert knowledge, or a combination of the two. This paper provides a comprehensive review of combinatoric algorithms proposed for learning BN structure from data, describing 61 algorithms including prototypical, well-established and state-of-the-art approaches. The basic approach of each algorithm is described in consistent terms, and the similarities and differences between them highlighted. Methods of evaluating algorithms and their comparative performance are discussed including the consistency of claims made in the literature. Approaches for dealing with data noise in real-world datasets and incorporating expert knowledge into the learning process are also covered.
SCIENCE
arxiv.org

Quantification of empirical determinacy: the impact of likelihood weighting on posterior location and spread in Bayesian meta-analysis estimated with JAGS and INLA

The popular Bayesian meta-analysis expressed by Bayesian normal-normal hierarchical model (NNHM) synthesizes knowledge from several studies and is highly relevant in practice. Moreover, NNHM is the simplest Bayesian hierarchical model (BHM), which illustrates problems typical in more complex BHMs. Until now, it has been unclear to what extent the data determines the marginal posterior distributions of the parameters in NNHM. To address this issue we computed the second derivative of the Bhattacharyya coefficient with respect to the weighted likelihood, defined the total empirical determinacy (TED), the proportion of the empirical determinacy of location to TED (pEDL), and the proportion of the empirical determinacy of spread to TED (pEDS). We implemented this method in the R package \texttt{ed4bhm} and considered two case studies and one simulation study. We quantified TED, pEDL and pEDS under different modeling conditions such as model parametrization, the primary outcome, and the prior. This clarified to what extent the location and spread of the marginal posterior distributions of the parameters are determined by the data. Although these investigations focused on Bayesian NNHM, the method proposed is applicable more generally to complex BHMs.
SCIENCE
arxiv.org

Using Physiological Information to Classify Task Difficulty in Human-Swarm Interaction

Joseph P. Distefano, Hemanth Manjunatha, Souma Chowdhury, Karthik Dantu, David Doermann, Ehsan T. Esfahani. Human-swarm interaction has recently gained attention due to its plethora of new applications in disaster relief, surveillance, rescue, and exploration. However, if the task difficulty increases, the performance of the human operator decreases, thereby decreasing the overall efficacy of the human-swarm team. Thus, it is critical to identify the task difficulty and adaptively allocate the task to the human operator to maintain optimal performance. In this direction, we study the classification of task difficulty in a human-swarm interaction experiment performing a target search mission. The human may control platoons of unmanned aerial vehicles (UAVs) and unmanned ground vehicles (UGVs) to search a partially observable environment during the target search mission. The mission complexity is increased by introducing adversarial teams that humans may only see when the environment is explored. While the human is completing the mission, their brain activity is recorded using an electroencephalogram (EEG), which is used to classify the task difficulty. We have used two different approaches for classification: A feature-based approach using coherence values as input and a deep learning-based approach using raw EEG as input. Both approaches can classify the task difficulty well above the chance. The results showed the importance of the occipital lobe (O1 and O2) coherence feature with the other brain regions. Moreover, we also study individual differences (expert vs. novice) in the classification results. The analysis revealed that the temporal lobe in experts (T4 and T3) is predominant for task difficulty classification compared with novices.
SCIENCE
arxiv.org

Training on Test Data with Bayesian Adaptation for Covariate Shift

When faced with distribution shift at test time, deep neural networks often make inaccurate predictions with unreliable uncertainty estimates. While improving the robustness of neural networks is one promising approach to mitigate this issue, an appealing alternate to robustifying networks against all possible test-time shifts is to instead directly adapt them to unlabeled inputs from the particular distribution shift we encounter at test time. However, this poses a challenging question: in the standard Bayesian model for supervised learning, unlabeled inputs are conditionally independent of model parameters when the labels are unobserved, so what can unlabeled data tell us about the model parameters at test-time? In this paper, we derive a Bayesian model that provides for a well-defined relationship between unlabeled inputs under distributional shift and model parameters, and show how approximate inference in this model can be instantiated with a simple regularized entropy minimization procedure at test-time. We evaluate our method on a variety of distribution shifts for image classification, including image corruptions, natural distribution shifts, and domain adaptation settings, and show that our method improves both accuracy and uncertainty estimation.
COMPUTERS
arxiv.org

Learning Transport Processes with Machine Intelligence

We present a machine learning based approach to address the study of transport processes, ubiquitous in continuous mechanics, with particular attention to those phenomena ruled by complex micro-physics, impractical to theoretical investigation, yet exhibiting emergent behavior describable by a closed mathematical expression. Our machine learning model, built using simple components and following a few well established practices, is capable of learning latent representations of the transport process substantially closer to the ground truth than expected from the nominal error characterising the data, leading to sound generalisation properties. This is demonstrated through an idealized study of the long standing problem of heat flux suppression under conditions relevant for fusion and cosmic plasmas. A simple analysis shows that the result applies beyond those case specific assumptions and that, in particular, the accuracy of the learned representation is controllable through knowledge of the data quality (error properties) and a suitable choice of the dataset size. While the learned representation can be used as a plug-in for numerical modeling purposes, it can also be leveraged with the above error analysis to obtain reliable mathematical expressions describing the transport mechanism and of great theoretical value.
COMPUTERS
arxiv.org

Stabilizing Preparation of Quantum Gaussian States via Continuous Measurement

This paper provides a stabilizing preparation method for quantum Gaussian states by utilizing continuous measurement. The stochastic evolution of the open quantum system is described in terms of the quantum stochastic master equation. We present necessary and sufficient conditions for the system to have a unique stabilizing steady Gaussian state. The conditions are much weaker than those existing results presented in the approach of preparing Gaussian states through environment engineering. Parametric conditions of how to prepare an arbitrary pure Gaussian state are provided. This approach provides more degrees of freedom to choose the system Hamiltonian and the system-environment coupling operators, as compared with the case where dissipation-induced approach is employed. The stabilizing conditions for the case of imperfect measurement efficiency are also presented. These results may benefit practical experimental implementation in preparing quantum Gaussian states.
SCIENCE
arxiv.org

Objective metrics for language lateralization of fMRI examinations: a new model for the classification of hemispheric dominance in healthy subjects and epileptic patients

M. Stroppi, D. Lizio, L. Berta, A. Citterio, C. Regna-Gladin, M. Rizzi, I. Sartori, P. E. Colombo, P. Arosio, A. Torresin. Purpose: to compare different methods to calculate Laterality Index (LI), a metric which allows to evaluate hemispheric brain language dominance in functional MRI examinations (fMRI). Methods: Two methods were...
SCIENCE
arxiv.org

Generalized Ising Model on a Scale-Free Network: An Interplay of Power Laws

We consider a recently introduced generalization of the Ising model in which individual spin strength can vary. The model is intended for analysis of ordering in systems comprising agents which, although matching in their binarity (i.e., maintaining the iconic Ising features of `+' or `$-$', `up' or `down', `yes' or `no'), differ in their strength. To investigate the interplay between variable properties of nodes and interactions between them, we study the model on a complex network where both the spin strength and degree distributions are governed by power laws. We show that in the annealed network approximation, thermodynamic functions of the model are self-averaging and we obtain an exact solution for the partition function. This allows us to derive the leading temperature and field dependencies of thermodynamic functions, their critical behavior, and logarithmic corrections at the interface of different phases. We find the delicate interplay of the two power laws leads to new universality classes.
SCIENCE
arxiv.org

Text to Insight: Accelerating Organic Materials Knowledge Extraction via Deep Learning

Scientific literature is one of the most significant resources for sharing knowledge. Researchers turn to scientific literature as a first step in designing an experiment. Given the extensive and growing volume of literature, the common approach of reading and manually extracting knowledge is too time consuming, creating a bottleneck in the research cycle. This challenge spans nearly every scientific domain. For the materials science, experimental data distributed across millions of publications are extremely helpful for predicting materials properties and the design of novel materials. However, only recently researchers have explored computational approaches for knowledge extraction primarily for inorganic materials. This study aims to explore knowledge extraction for organic materials. We built a research dataset composed of 855 annotated and 708,376 unannotated sentences drawn from 92,667 abstracts. We used named-entity-recognition (NER) with BiLSTM-CNN-CRF deep learning model to automatically extract key knowledge from literature. Early-phase results show a high potential for automated knowledge extraction. The paper presents our findings and a framework for supervised knowledge extraction that can be adapted to other scientific domains.
CHEMISTRY
arxiv.org

Exactness of Parrilo's conic approximations for copositive matrices and associated low order bounds for the stability number of a graph

De Klerk and Pasechnik (2002) introduced the bounds $\vartheta^{(r)}(G)$ ($r\in \mathbb{N}$) for the stability number $\alpha(G)$ of a graph $G$ and conjectured exactness at order $\alpha(G)-1$: $\vartheta^{(\alpha(G)-1)}(G)=\alpha(G)$. These bounds rely on the conic approximations $\mathcal{K}_n^{(r)}$ by Parrilo (2000) for the copositive cone $\text{COP}_n$. A difficulty in the convergence analysis of $\vartheta^{(r)}$ is the bad behaviour of the cones $\mathcal{K}_n^{(r)}$ under adding a zero row/column: when applied to a matrix not in $\mathcal{K}^{(0)}_n$ this gives a matrix not in any ${\mathcal{K}}^{(r)}_{n+1}$, thereby showing strict inclusion $\bigcup_{r\ge 0}{\mathcal{K}}^{(r)}_n\subset \text{COP}_n$ for $n\ge 6$. We investigate the graphs with $\vartheta^{(r)}(G)=\alpha(G)$ for $r=0,1$: we algorithmically reduce testing exactness of $\vartheta^{(0)}$ to acritical graphs, we characterize critical graphs with $\vartheta^{(0)}$ exact, and we exhibit graphs for which exactness of $\vartheta^{(1)}$ is not preserved under adding an isolated node. This disproves a conjecture by Gvozdenović and Laurent (2007) which, if true, would have implied the above conjecture by de Klerk and Pasechnik.
MATHEMATICS
arxiv.org

Towards an extended taxonomy of information dynamics via Integrated Information Decomposition

Pedro A.M. Mediano, Fernando E. Rosas, Andrea I Luppi, Robin L. Carhart-Harris, Daniel Bor, Anil K. Seth, Adam B. Barrett. Complex systems, from the human brain to the global economy, are made of multiple elements that interact in such ways that the behaviour of the `whole' often seems to be more than what is readily explainable in terms of the `sum of the parts.' Our ability to understand and control these systems remains limited, one reason being that we still don't know how best to describe -- and quantify -- the higher-order dynamical interactions that characterise their complexity. To address this limitation, we combine principles from the theories of Information Decomposition and Integrated Information into what we call Integrated Information Decomposition, or $\Phi$ID. $\Phi$ID provides a comprehensive framework to reason about, evaluate, and understand the information dynamics of complex multivariate systems. $\Phi$ID reveals the existence of previously unreported modes of collective information flow, providing tools to express well-known measures of information transfer and dynamical complexity as aggregates of these modes. Via computational and empirical examples, we demonstrate that $\Phi$ID extends our explanatory power beyond traditional causal discovery methods -- with profound implications for the study of complex systems across disciplines.
SCIENCE
arxiv.org

Accuracy and speed of elongation in a minimal model of DNA replication

Being a dual purpose enzyme, the DNA polymerase is responsible for elongation of the newly formed DNA strand as well as cleaving the erroneous growth in case of a misincorporation. The efficiency of replication depends on the coordination of the polymerization and exonuclease activity of DNA polymerase. Here we propose and analyze a minimal kinetic model of DNA replication and determine exact expressions for the velocity of elongation and the accuracy of replication. We first analyze the case without exonuclease activity. In that case, accuracy is determined by a kinetic competition between stepping and unbinding, with discrimination between correct and incorrect nucleotides in both transitions. We then include exonuclease activity and ask how different modes of additional discrimination in the exonuclease pathway can improve the accuracy while limiting the detrimental effect of exonuclase on the speed of replication. In this way, we ask how the kinetic parameters of the model have to be set to coordinate the two activities of the enzyme for high accuracy and high speed. The analysis also shows that the design of a replication system does not universally have to follow the speed-accuracy trade-off rule, although it does in the biologically realized parameter range. The accuracy of the process is mainly controlled by the crucial role of stepping after erroneous incorporation, which has impact on both polymerase and exonuclease activities of DNA polymerase.
SCIENCE

Comments / 0

Community Policy