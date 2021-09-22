CreatorsPublishersAdvertisers
Causal Inference in Non-linear Time-series usingDeep Networks and Knockoff Counterfactuals

By Wasim Ahmad, Maha Shadaydeh, Joachim Denzler
arxiv.org
 6 days ago

Estimating causal relations is vital in understanding the complex interactions in multivariate time series. Non-linear coupling of variables is one of the major challenges inaccurate estimation of cause-effect relations. In this paper, we propose to use deep autoregressive networks (DeepAR) in tandem with counterfactual analysis to infer nonlinear causal relations in multivariate time series. We extend the concept of Granger causality using probabilistic forecasting with DeepAR. Since deep networks can neither handle missing input nor out-of-distribution intervention, we propose to use the Knockoffs framework (Barberand Cand`es, 2015) for generating intervention variables and consequently counterfactual probabilistic forecasting. Knockoff samples are independent of their output given the observed variables and exchangeable with their counterpart variables without changing the underlying distribution of the data. We test our method on synthetic as well as real-world time series datasets. Overall our method outperforms the widely used vector autoregressive Granger causality and PCMCI in detecting nonlinear causal dependency in multivariate time series.

arxiv.org

Direct estimation of differential Granger causality between two high-dimensional time series

Differential Granger causality, that is understanding how Granger causal relations differ between two related time series, is of interest in many scientific applications. Modeling each time series by a vector autoregressive (VAR) model, we propose a new method to directly learn the difference between the corresponding transition matrices in high dimensions. Key to the new method is an estimating equation constructed based on the Yule-Walker equation that links the difference in transition matrices to the difference in the corresponding precision matrices. In contrast to separately estimating each transition matrix and then calculating the difference, the proposed direct estimation method only requires sparsity of the difference of the two VAR models, and hence allows hub nodes in each high-dimensional time series. The direct estimator is shown to be consistent in estimation and support recovery under mild assumptions. These results also lead to novel consistency results with potentially faster convergence rates for estimating differences between precision matrices of i.i.d observations under weaker assumptions than existing results. We evaluate the finite sample performance of the proposed method using simulation studies and an application to electroencephalogram (EEG) data.
SCIENCE
arxiv.org

Non-linear Independent Dual System (NIDS) for Discretization-independent Surrogate Modeling over Complex Geometries

Numerical solution of partial differential equations (PDEs) require expensive simulations, limiting their application in design optimization routines, model-based control, or solution of large-scale inverse problems. Existing Convolutional Neural Network-based frameworks for surrogate modeling require lossy pixelization and data-preprocessing, which is not suitable for realistic engineering applications. Therefore, we propose non-linear independent dual system (NIDS), which is a deep learning surrogate model for discretization-independent, continuous representation of PDE solutions, and can be used for prediction over domains with complex, variable geometries and mesh topologies. NIDS leverages implicit neural representations to develop a non-linear mapping between problem parameters and spatial coordinates to state predictions by combining evaluations of a case-wise parameter network and a point-wise spatial network in a linear output layer. The input features of the spatial network include physical coordinates augmented by a minimum distance function evaluation to implicitly encode the problem geometry. The form of the overall output layer induces a dual system, where each term in the map is non-linear and independent. Further, we propose a minimum distance function-driven weighted sum of NIDS models using a shared parameter network to enforce boundary conditions by construction under certain restrictions. The framework is applied to predict solutions around complex, parametrically-defined geometries on non-parametrically-defined meshes with solution obtained many orders of magnitude faster than the full order models. Test cases include a vehicle aerodynamics problem with complex geometry and data scarcity, enabled by a training method in which more cases are gradually added as training progresses.
COMPUTERS
arxiv.org

How Does Counterfactually Augmented Data Impact Models for Social Computing Constructs?

As NLP models are increasingly deployed in socially situated settings such as online abusive content detection, it is crucial to ensure that these models are robust. One way of improving model robustness is to generate counterfactually augmented data (CAD) for training models that can better learn to distinguish between core features and data artifacts. While models trained on this type of data have shown promising out-of-domain generalizability, it is still unclear what the sources of such improvements are. We investigate the benefits of CAD for social NLP models by focusing on three social computing constructs -- sentiment, sexism, and hate speech. Assessing the performance of models trained with and without CAD across different types of datasets, we find that while models trained on CAD show lower in-domain performance, they generalize better out-of-domain. We unpack this apparent discrepancy using machine explanations and find that CAD reduces model reliance on spurious features. Leveraging a novel typology of CAD to analyze their relationship with model performance, we find that CAD which acts on the construct directly or a diverse set of CAD leads to higher performance.
SOFTWARE
arxiv.org

Anomaly Attribution of Multivariate Time Series using Counterfactual Reasoning

There are numerous methods for detecting anomalies in time series, but that is only the first step to understanding them. We strive to exceed this by explaining those anomalies. Thus we develop a novel attribution scheme for multivariate time series relying on counterfactual reasoning. We aim to answer the counterfactual question of would the anomalous event have occurred if the subset of the involved variables had been more similarly distributed to the data outside of the anomalous interval. Specifically, we detect anomalous intervals using the Maximally Divergent Interval (MDI) algorithm, replace a subset of variables with their in-distribution values within the detected interval and observe if the interval has become less anomalous, by re-scoring it with MDI. We evaluate our method on multivariate temporal and spatio-temporal data and confirm the accuracy of our anomaly attribution of multiple well-understood extreme climate events such as heatwaves and hurricanes.
SCIENCE
arxiv.org

Proximal Causal Inference for Complex Longitudinal Studies

A standard assumption for causal inference about the joint effects of time-varying treatment is that one has measured sufficient covariates to ensure that within covariate strata, subjects are exchangeable across observed treatment values, also known as "sequential randomization assumption (SRA)". SRA is often criticized as it requires one to accurately measure all confounders. Realistically, measured covariates can rarely capture all confounders with certainty. Often covariate measurements are at best proxies of confounders, thus invalidating inferences under SRA. In this paper, we extend the proximal causal inference (PCI) framework of Miao et al. (2018) to the longitudinal setting under a semiparametric marginal structural mean model (MSMM). PCI offers an opportunity to learn about joint causal effects in settings where SRA based on measured time-varying covariates fails, by formally accounting for the covariate measurements as imperfect proxies of underlying confounding mechanisms. We establish nonparametric identification with a pair of time-varying proxies and provide a corresponding characterization of regular and asymptotically linear estimators of the parameter indexing the MSMM, including a rich class of doubly robust estimators, and establish the corresponding semiparametric efficiency bound for the MSMM. Extensive simulation studies and a data application illustrate the finite sample behavior of proposed methods.
SCIENCE
arxiv.org

Graphical models for nonstationary time series

We propose NonStGGM, a general nonparametric graphical modeling framework for studying dynamic associations among the components of a nonstationary multivariate time series. It builds on the framework of Gaussian Graphical Models (GGM) and stationary time series Gaussian Graphical model (StGGM), and complements existing works on parametric graphical models based on change point vector autoregressions (VAR). Analogous to StGGM, the proposed framework captures conditional noncorrelations (both intertemporal and contemporaneous) in the form of an undirected graph. In addition, to describe the more nuanced nonstationary relationships among the components of the time series, we introduce the new notion of conditional nonstationarity/stationarity and incorporate it within the graph architecture. This allows one to distinguish between direct and indirect nonstationary relationships among system components, and can be used to search for small subnetworks that serve as the "source" of nonstationarity in a large system. Together, the two concepts of conditional noncorrelation and nonstationarity/stationarity provide a parsimonious description of the dependence structure of the time series.
MATHEMATICS
arxiv.org

Causal Effects with Hidden Treatment Diffusion on Observed or Partially Observed Networks

In randomized experiments, interactions between units might generate a treatment diffusion process. This is common when the treatment of interest is an actual object or product that can be shared among peers (e.g., flyers, booklets, videos). For instance, if the intervention of interest is an information campaign realized through the distribution of a video to targeted individuals, some of these treated individuals might share the video they received with their friends. Such a phenomenon is usually unobserved, causing a misallocation of individuals in the two treatment arms: some of the initially untreated units might have actually received the treatment by diffusion. Treatment misclassification can, in turn, introduce a bias in the estimation of the causal effect. Inspired by a recent field experiment on the effect of different types of school incentives aimed at encouraging students to attend cultural events, we present a novel approach to deal with a hidden diffusion process on observed or partially observed networks.Specifically, we develop a simulation-based sensitivity analysis that assesses the robustness of the estimates against the possible presence of a treatment diffusion. We simulate several diffusion scenarios within a plausible range of sensitivity parameters and we compare the treatment effect which is estimated in each scenario with the one that is obtained while ignoring the diffusion process. Results suggest that even a treatment diffusion parameter of small size may lead to a significant bias in the estimation of the treatment effect.
SCIENCE
arxiv.org

Interpretable Additive Recurrent Neural Networks For Multivariate Clinical Time Series

Time series models with recurrent neural networks (RNNs) can have high accuracy but are unfortunately difficult to interpret as a result of feature-interactions, temporal-interactions, and non-linear transformations. Interpretability is important in domains like healthcare where constructing models that provide insight into the relationships they have learned are required to validate and trust model predictions. We want accurate time series models where users can understand the contribution of individual input features. We present the Interpretable-RNN (I-RNN) that balances model complexity and accuracy by forcing the relationship between variables in the model to be additive. Interactions are restricted between hidden states of the RNN and additively combined at the final step. I-RNN specifically captures the unique characteristics of clinical time series, which are unevenly sampled in time, asynchronously acquired, and have missing data. Importantly, the hidden state activations represent feature coefficients that correlate with the prediction target and can be visualized as risk curves that capture the global relationship between individual input features and the outcome. We evaluate the I-RNN model on the Physionet 2012 Challenge dataset to predict in-hospital mortality, and on a real-world clinical decision support task: predicting hemodynamic interventions in the intensive care unit. I-RNN provides explanations in the form of global and local feature importances comparable to highly intelligible models like decision trees trained on hand-engineered features while significantly outperforming them. I-RNN remains intelligible while providing accuracy comparable to state-of-the-art decay-based and interpolation-based recurrent time series models. The experimental results on real-world clinical datasets refute the myth that there is a tradeoff between accuracy and interpretability.
HEALTH
cell.com

Causal assumptions and causal inference in ecological experiments

Causal inferences require causal assumptions. To formalize the assumptions required to draw causal inferences from experimental data, scholars have leveraged insights about causal inference in observational settings. Even carefully designed experiments may face challenges in satisfying four important causal assumptions. Ecologists sometimes acknowledge and address these challenges but do not...
SCIENCE
arxiv.org

Hypocoercivity for non-linear infinite-dimensional degenerate stochastic differential equations

The aim of the article is to construct solutions to second order in time stochastic partial differential equations and to show hypocoercivity of their corresponding transition semigroups. More generally, we analyze infinite-dimensional non-linear stochastic differential equations in terms of their infinitesimal generators. In the first part of this article we use resolvent methods developed by Beznea, Boboc and Röckner to construct $\mu^{\Phi}$-standard right processes with infinite lifetime and weakly continuous paths providing weak solutions to infinite-dimensional Langevin dynamics with invariant measure $\mu^{\Phi}$. The second part deals with the general abstract Hilbert space hypocoercivity method, first described by Dolbeaut, Mouhout and Schmeiser and made rigorous in the Kolmogorov backwards setting by Grothaus and Stilgenbauer. In order to apply the method to infinite-dimensional Langevin dynamics we use an essential m-dissipativity statement for infinite-dimensional Ornstein-Uhlenbeck operators, perturbed by the gradient of a potential, with possible unbounded diffusion operators as coefficients and corresponding regularity estimates. Furthermore, essential m-dissipativity of a non-sectorial Kolmogorov backward operator associated to the dynamic and Poincaré inequalities for measures with densities w.r.t. infinite-dimensional non-degenerate Gaussian measures are substantial. Deriving a stochastic representation of the semigroup generated by the Kolmogorov backward operator as the transition semigroup of the $\mu^{\Phi}$-standard right process enables us to show an $L^2$-exponential ergodic result for the weak solution. In the end we apply our results to explicit infinite-dimensional degenerate diffusion equations.
MATHEMATICS
arxiv.org

Non-equilibrium time-dependent solution to discrete choice with social interactions

We solve the binary decision model of Brock and Durlauf (2001) in time using methods often employed in studies of stochastic complex systems. This solution is valid when not at equilibrium and can be used to exemplify path-dependent behaviours of their model. The solution is computationally fast and is indistinguishable from Monte Carlo simulation. Lock-in effects are observed in some regions of the model's parameter space, and we calculate the time scale of the lock-ins. Curiously, we find that although altruistic agents coalesce more strongly on a particular decision, increasing their utility in the short-term, they are also more prone to being stuck in a non-optimal decision lock-in as compared to selfish agents. Finally, we construct a likelihood function that can be used on non-equilibrium data for model calibration. Even with a well-defined likelihood function, model calibration is difficult unless one has access to data representative of the underlying model.
SCIENCE
towardsdatascience.com

A Practical Guide to Linear Regression

From EDA to Feature Engineering to Model Evaluation. Linear regression is a typical regression algorithm that is responsible for numerous prediction. It is distinct to classification models — such as decision tree, support vector machine or neural network. In a nutshell, a linear regression finds the optimal linear relationship between independent variables and dependent variables, then makes prediction accordingly.
SCIENCE
arxiv.org

Equivalent one-dimensional first-order linear hyperbolic systems and range of the minimal null control time with respect to the internal coupling matrix

In this paper, we are interested in the minimal null control time of one-dimensional first-order linear hyperbolic systems by one-sided boundary controls. Our main result is an explicit characterization of the smallest and largest values that this minimal null control time can take with respect to the internal coupling matrix. In particular, we obtain a complete description of the situations where the minimal null control time is invariant with respect to all the possible choices of internal coupling matrices. The proof relies on the notion of equivalent systems, in particular the backstepping method, a canonical $LU$-decomposition for boundary coupling matrices and a compactness-uniqueness method adapted to the null controllability property.
MATHEMATICS
arxiv.org

Unifying Design-based Inference: On Bounding and Estimating the Variance of any Linear Estimator in any Experimental Design

This paper provides a design-based framework for variance (bound) estimation in experimental analysis. Results are applicable to virtually any combination of experimental design, linear estimator (e.g., difference-in-means, OLS, WLS) and variance bound, allowing for unified treatment and a basis for systematic study and comparison of designs using matrix spectral analysis. A proposed variance estimator reproduces Eicker-Huber-White (aka. "robust", "heteroskedastic consistent", "sandwich", "White", "Huber-White", "HC", etc.) standard errors and "cluster-robust" standard errors as special cases. While past work has shown algebraic equivalences between design-based and the so-called "robust" standard errors under some designs, this paper motivates them for a wide array of design-estimator-bound triplets. In so doing, it provides a clearer and more general motivation for variance estimators.
SCIENCE
arxiv.org

Causal Discovery in High-Dimensional Point Process Networks with Hidden Nodes

Thanks to technological advances leading to near-continuous time observations, emerging multivariate point process data offer new opportunities for causal discovery. However, a key obstacle in achieving this goal is that many relevant processes may not be observed in practice. Naive estimation approaches that ignore these hidden variables can generate misleading results because of the unadjusted confounding. To plug this gap, we propose a deconfounding procedure to estimate high-dimensional point process networks with only a subset of the nodes being observed. Our method allows flexible connections between the observed and unobserved processes. It also allows the number of unobserved processes to be unknown and potentially larger than the number of observed nodes. Theoretical analyses and numerical studies highlight the advantages of the proposed method in identifying causal interactions among the observed processes.
SCIENCE
towardsdatascience.com

Causal Inference using Natural Language Processing

Estimating causal effects from text variables by applying NLP methods, and its application to social science research. Recently, I was honoured to be interviewed for an author spotlight by TDS editor, Ben Huberman. I took the opportunity to highlight my connectionist approach to learning data science. In particular, I discussed my desire to continuously connect ideas — that inclination is responsible for this article that combines two of my interests: natural language processing (NLP) and causal inference. I was inspired by a computational linguistics survey paper published at the beginning of this month, which provided a comprehensive review of the use of NLP for causal inference and conversely, the use of causality in improving NLP models. The reverse relationship of applying causality to NLP research, carries implications for improving the reliability and fairness of AI models; I will explore that connection in a later article. Here, I focus on the first relationship with the primary objective of translating recent research into applications for social science research.
SCIENCE
arxiv.org

On the representation of non-holonomic power series

Holonomic functions play an essential role in Computer Algebra since they allow the application of many symbolic algorithms. Among all algorithmic attempts to find formulas for power series, the holonomic property remains the most important requirement to be satisfied by the function under consideration. The targeted functions mainly summarize that of meromorphic functions. However, expressions like $\tan(z)$, $z/(\exp(z)-1)$, $\sec(z)$, etc. are not holonomic, therefore their power series are inaccessible by non-pattern matching implementations like the current Maple \texttt{convert/FormalPowerSeries}. From the mathematical dictionaries, one can observe that most of the known closed-form formulas of non-holonomic power series involve another sequence whose evaluation depends on some finite summations. In the case of $\tan(z)$ and $\sec(z)$ the corresponding sequences are the Bernoulli and Euler numbers, respectively. Thus providing a symbolic approach that yields complete representations when linear summations for power series coefficients of non-holonomic functions appear, might be seen as a step forward towards the representation of non-holonomic power series.
CODING & PROGRAMMING
arxiv.org

CounterNet: End-to-End Training of Counterfactual Aware Predictions

This work presents CounterNet, a novel end-to-end learning framework which integrates the predictive model training and counterfactual (CF) explanation generation into a single end-to-end pipeline. Counterfactual explanations attempt to find the smallest modification to the feature values of an instance that changes the prediction of the ML model to a predefined output. Prior CF explanation techniques rely on solving separate time-intensive optimization problems for every single input instance to find CF examples, and also suffer from the misalignment of objectives between model predictions and explanations, which leads to significant shortcomings in the quality of CF explanations. CounterNet, on the other hand, integrates both prediction and explanation in the same framework, which enables the optimization of the CF example generation only once together with the predictive model. We propose a novel variant of back-propagation which can help in effectively training CounterNet's network. Finally, we conduct extensive experiments on multiple real-world datasets. Our results show that CounterNet generates high-quality predictions, and corresponding CF examples (with high validity) for any new input instance significantly faster than existing state-of-the-art baselines.
CODING & PROGRAMMING
arxiv.org

Post-selection inference for linear mixed model parameters using the conditional Akaike information criterion

We investigate the issue of post-selection inference for a fixed and a mixed parameter in a linear mixed model using a conditional Akaike information criterion as a model selection procedure. Within the framework of linear mixed models we develop complete theory to construct confidence intervals for regression and mixed parameters under three frameworks: nested and general model sets as well as misspecified models. Our theoretical analysis is accompanied by a simulation experiment and a post-selection examination on mean income across Galicia's counties. Our numerical studies confirm a good performance of our new procedure. Moreover, they reveal a startling robustness to the model misspecification of a naive method to construct the confidence intervals for a mixed parameter which is in contrast to our findings for the fixed parameters.
SCIENCE
arxiv.org

Joint Estimation and Inference for Multi-Experiment Networks of High-Dimensional Point Processes

Modern high-dimensional point process data, especially those from neuroscience experiments, often involve observations from multiple conditions and/or experiments. Networks of interactions corresponding to these conditions are expected to share many edges, but also exhibit unique, condition-specific ones. However, the degree of similarity among the networks from different conditions is generally unknown. Existing approaches for multivariate point processes do not take these structures into account and do not provide inference for jointly estimated networks. To address these needs, we propose a joint estimation procedure for networks of high-dimensional point processes that incorporates easy-to-compute weights in order to data-adaptively encourage similarity between the estimated networks. We also propose a powerful hierarchical multiple testing procedure for edges of all estimated networks, which takes into account the data-driven similarity structure of the multi-experiment networks. Compared to conventional multiple testing procedures, our proposed procedure greatly reduces the number of tests and results in improved power, while tightly controlling the family-wise error rate. Unlike existing procedures, our method is also free of assumptions on dependency between tests, offers flexibility on p-values calculated along the hierarchy, and is robust to misspecification of the hierarchical structure. We verify our theoretical results via simulation studies and demonstrate the application of the proposed procedure using neuronal spike train data.
SCIENCE

