Decomposing Natural Logic Inferences in Neural NLI

By Julia Rozanova, Deborah Ferreira, Marco Valentino, Mokanrarangan Thayaparan, Andre Freitas
 4 days ago

In the interest of interpreting neural NLI models and their reasoning strategies, we carry out a systematic probing study which investigates whether these models capture the crucial semantic features central to natural logic: monotonicity and concept...

Scaling Structured Inference with Randomization

The scale of the state space of discrete graphical models is crucial for model capacity in the era of deep learning. Existing dynamic programming (DP) based inference typically works with a small number of states (usually less than hundreds). In this work, we propose a family of randomized dynamic programming (RDP) algorithms for scaling structured models to tens of thousands of latent states. Our method is widely applicable to classical DP-based inference (partition, marginal, reparameterization, entropy, .etc) and different graph structures (chains, trees, and more general hypergraphs). It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly and learned with gradient-based optimizers. Our core technique is randomization, which is to restrict and reweight DP on a small selected subset of nodes, leading to computation reduction by orders of magnitudes. We further achieve low bias and variance with Rao-Blackwellization and importance sampling. Experiments on different inferences over different graphs demonstrate the accuracy and efficiency of our methods. Furthermore, when using RDP to train a scaled structured VAE, it outperforms baselines in terms of test likelihood and successfully prevents posterior collapse.
Boosting Mobile CNN Inference through Semantic Memory

Human brains are known to be capable of speeding up visual recognition of repeatedly presented objects through faster memory encoding and accessing procedures on activated neurons. For the first time, we borrow and distill such a capability into a semantic memory design, namely SMTM, to improve on-device CNN inference. SMTM employs a hierarchical memory architecture to leverage the long-tail distribution of objects of interest, and further incorporates several novel techniques to put it into effects: (1) it encodes high-dimensional feature maps into low-dimensional, semantic vectors for low-cost yet accurate cache and lookup; (2) it uses a novel metric in determining the exit timing considering different layers' inherent characteristics; (3) it adaptively adjusts the cache size and semantic vectors to fit the scene dynamics. SMTM is prototyped on commodity CNN engine and runs on both mobile CPU and GPU. Extensive experiments on large-scale datasets and models show that SMTM can significantly speed up the model inference over standard approach (up to 2X) and prior cache designs (up to 1.5X), with acceptable accuracy loss.
Projection methods for Neural Field equations

Neural field models are nonlinear integro-differential equations for the evolution of neuronal activity, and they are a prototypical large-scale, coarse-grained neuronal model in continuum cortices. Neural fields are often simulated heuristically and, in spite of their popularity in mathematical neuroscience, their numerical analysis is not yet fully established. We introduce generic projection methods for neural fields, and derive a-priori error bounds for these schemes. We extend an existing framework for stationary integral equations to the time-dependent case, which is relevant for neuroscience applications. We find that the convergence rate of a projection scheme for a neural field is determined to a great extent by the convergence rate of the projection operator. This abstract analysis, which unifies the treatment of collocation and Galerkin schemes, is carried out in operator form, without resorting to quadrature rules for the integral term, which are introduced only at a later stage, and whose choice is enslaved by the choice of the projector. Using an elementary timestepper as an example, we demonstrate that the error in a time stepper has two separate contributions: one from the projector, and one from the time discretisation. We give examples of concrete projection methods: two collocation schemes (piecewise-linear and spectral collocation) and two Galerkin schemes (finite elements and spectral Galerkin); for each of them we derive error bounds from the general theory, introduce several discrete variants, provide implementation details, and present reproducible convergence tests.
Pairwise Learning for Neural Link Prediction

In this paper, we aim at providing an effective Pairwise Learning Neural Link Prediction (PLNLP) framework. The framework treats link prediction as a pairwise learning to rank problem and consists of four main components, i.e., neighborhood encoder, link predictor, negative sampler and objective function. The framework is flexible that any generic graph neural convolution or link prediction specific neural architecture could be employed as neighborhood encoder. For link predictor, we design different scoring functions, which could be selected based on different types of graphs. In negative sampler, we provide several sampling strategies, which are problem specific. As for objective function, we propose to use an effective ranking loss, which approximately maximizes the standard ranking metric AUC. We evaluate the proposed PLNLP framework on 4 link property prediction datasets of Open Graph Benchmark, including ogbl-ddi, ogbl-collab, ogbl-ppa and ogbl-ciation2. PLNLP achieves Top 1 performance on ogbl-ddi, and Top 2 performance on ogbl-collab and ogbl-ciation2 only with basic neural architecture. The performance demonstrates the effectiveness of PLNLP.
#Logic#Nli
Julia 1.7 Extends its Threading Capabilities, Improves Type Inference, and More

Julia 1.7 brings a number of significant enhancements, including new threading capabilities, new Package Manager features, improved type inference, and new syntactic features. It is also the first release to run natively on Apple Silicon. Julia has been progressing a lot in recent releases to better supoprt multithreaded programs. Julia...
Logics of upsets of De Morgan lattices

We study logics determined by matrices consisting of a De~Morgan lattice with an upward closed set of designated values, such as the logic of non-falsity preservation in a given finite Boolean algebra and Shramko's logic of non-falsity preservation in the four-element subdirectly irreducible De Morgan lattice. The key tool in the study of these logics is the lattice-theoretic notion of an $n$-filter. We study the logics of all (complete, consistent, and classical) $n$-filters on De Morgan lattices, which are non-adjunctive generalizations of the four-valued logic of Belnap and Dunn (of the three-valued logics of Priest and Kleene, and of classical logic). We then show how to find a finite Hilbert-style axiomatization of any logic determined by a finite family of prime upsets of finite De Morgan lattices and a finite Gentzen-style axiomatization of any logic determined by a finite family of filters on finite De Morgan lattices. As an application, we axiomatize Shramko's logic of anything but falsehood.
Machine Learning Inference for Point Processes: A General, Simple Introduction with Simulations

In part 1 here, we introduced an original class of point processes, generalizing the Poisson process which is a limiting case, in a simple and intuitive way. We started with one-dimensional processes, and then discussed the two dimensional case, when the two coordinates X and Y are paired - the equivalent of paired time series. In part 2 (this article), we continue to investigate the two-dimensional case, with unpaired coordinates: this creates a very rich class of processes, with many potential applications. We also introduce cluster processes, and statistical inference to estimate some quantities associated with these processes (granularity, radiality, variance). We also develop a general framework to identify the best model to fit with a particular data set, using non-parametric statistics, even though in many instances, we face non-identifiability issues. We use a machine learning approach, as opposed to classic statistical methodology. It also includes the design of simple, intuitive, model-free confidence intervals. Sections 1 and 2 are found in part 1, here. Numerous simulations are provided in our interactive spreadsheet here, for replication purpose and to allow you to play with the parameters to create your own point processes, or create cluster processes to test clustering algorithms, or analyze (for instance) nearest neighbor empirical distributions. The spreadsheet will also teach you how to create scatterplots in Excel with multiple groups of data, each with a different color. This article is written for people with limited exposure to probability theory, yet goes deep into the methodology without lengthy discussions, allowing the busy practitioner or executive to grasp the concepts in a minimum amount of time. This part 2 of my article can be read independently of part 1.
Disentangled Counterfactual Recurrent Networks for Treatment Effect Inference over Time

Choosing the best treatment-plan for each individual patient requires accurate forecasts of their outcome trajectories as a function of the treatment, over time. While large observational data sets constitute rich sources of information to learn from, they also contain biases as treatments are rarely assigned randomly in practice. To provide accurate and unbiased forecasts, we introduce the Disentangled Counterfactual Recurrent Network (DCRN), a novel sequence-to-sequence architecture that estimates treatment outcomes over time by learning representations of patient histories that are disentangled into three separate latent factors: a treatment factor, influencing only treatment selection; an outcome factor, influencing only the outcome; and a confounding factor, influencing both. With an architecture that is completely inspired by the causal structure of treatment influence over time, we advance forecast accuracy and disease understanding, as our architecture allows for practitioners to infer which patient features influence which part in a patient's trajectory, contrasting other approaches in this domain. We demonstrate that DCRN outperforms current state-of-the-art methods in forecasting treatment responses, on both real and simulated data.
Implicit Neural Representations for Image Compression

Recently Implicit Neural Representations (INRs) gained attention as a novel and effective representation for various data types. Thus far, prior work mostly focused on optimizing their reconstruction performance. This work investigates INRs from a novel perspective, i.e., as a tool for image compression. To this end, we propose the first comprehensive compression pipeline based on INRs including quantization, quantization-aware retraining and entropy coding. Encoding with INRs, i.e. overfitting to a data sample, is typically orders of magnitude slower. To mitigate this drawback, we leverage meta-learned initializations based on MAML to reach the encoding in fewer gradient updates which also generally improves rate-distortion performance of INRs. We find that our approach to source compression with INRs vastly outperforms similar prior work, is competitive with common compression algorithms designed specifically for images and closes the gap to state-of-the-art learned approaches based on Rate-Distortion Autoencoders. Moreover, we provide an extensive ablation study on the importance of individual components of our method which we hope facilitates future research on this novel approach to image compression.
Adaptive Kernel Graph Neural Network

Graph neural networks (GNNs) have demonstrated great success in representation learning for graph-structured data. The layer-wise graph convolution in GNNs is shown to be powerful at capturing graph topology. During this process, GNNs are usually guided by pre-defined kernels such as Laplacian matrix, adjacency matrix, or their variants. However, the adoptions of pre-defined kernels may restrain the generalities to different graphs: mismatch between graph and kernel would entail sub-optimal performance. For example, GNNs that focus on low-frequency information may not achieve satisfactory performance when high-frequency information is significant for the graphs, and vice versa. To solve this problem, in this paper, we propose a novel framework - i.e., namely Adaptive Kernel Graph Neural Network (AKGNN) - which learns to adapt to the optimal graph kernel in a unified manner at the first attempt. In the proposed AKGNN, we first design a data-driven graph kernel learning mechanism, which adaptively modulates the balance between all-pass and low-pass filters by modifying the maximal eigenvalue of the graph Laplacian. Through this process, AKGNN learns the optimal threshold between high and low frequency signals to relieve the generality problem. Later, we further reduce the number of parameters by a parameterization trick and enhance the expressive power by a global readout function. Extensive experiments are conducted on acknowledged benchmark datasets and promising results demonstrate the outstanding performance of our proposed AKGNN by comparison with state-of-the-art GNNs. The source code is publicly available at: this https URL.
Logical Boltzmann Machines

The idea of representing symbolic knowledge in connectionist systems has been a long-standing endeavour which has attracted much attention recently with the objective of combining machine learning and scalable sound reasoning. Early work has shown a correspondence between propositional logic and symmetrical neural networks which nevertheless did not scale well with the number of variables and whose training regime was inefficient. In this paper, we introduce Logical Boltzmann Machines (LBM), a neurosymbolic system that can represent any propositional logic formula in strict disjunctive normal form. We prove equivalence between energy minimization in LBM and logical satisfiability thus showing that LBM is capable of sound reasoning. We evaluate reasoning empirically to show that LBM is capable of finding all satisfying assignments of a class of logical formulae by searching fewer than 0.75% of the possible (approximately 1 billion) assignments. We compare learning in LBM with a symbolic inductive logic programming system, a state-of-the-art neurosymbolic system and a purely neural network-based system, achieving better learning performance in five out of seven data sets.
Programming with Neural Surrogates of Programs

Surrogates, models that mimic the behavior of programs, form the basis of a variety of development workflows. We study three surrogate-based design patterns, evaluating each in case studies on a large-scale CPU simulator. With surrogate compilation, programmers develop a surrogate that mimics the behavior of a program to deploy to...
Quantum processor swapped in for a neural network

It's become increasingly clear that quantum computers won't have a single moment when they become clearly superior to classical hardware. Instead, we're likely to see them becoming useful for a narrow set of problems and then gradually expand out from there to an increasing range of computations. The question obviously becomes one of where the utility will be seen first.
Scalable subsampling: computation, aggregation and inference

Subsampling is a general statistical method developed in the 1990s aimed at estimating the sampling distribution of a statistic $\hat \theta _n$ in order to conduct nonparametric inference such as the construction of confidence intervals and hypothesis tests. Subsampling has seen a resurgence in the Big Data era where the standard, full-resample size bootstrap can be infeasible to compute. Nevertheless, even choosing a single random subsample of size $b$ can be computationally challenging with both $b$ and the sample size $n$ being very large. In the paper at hand, we show how a set of appropriately chosen, non-random subsamples can be used to conduct effective -- and computationally feasible -- distribution estimation via subsampling. Further, we show how the same set of subsamples can be used to yield a procedure for subsampling aggregation -- also known as subagging -- that is scalable with big data. Interestingly, the scalable subagging estimator can be tuned to have the same (or better) rate of convergence as compared to $\hat \theta _n$. The paper is concluded by showing how to conduct inference, e.g., confidence intervals, based on the scalable subagging estimator instead of the original $\hat \theta _n$.
Automated Customization of On-Thing Inference for Quality-of-Experience Enhancement

The rapid uptake of intelligent applications is pushing deep learning (DL) capabilities to Internet-of-Things (IoT). Despite the emergence of new tools for embedding deep neural networks (DNNs) into IoT devices, providing satisfactory Quality of Experience (QoE) to users is still challenging due to the heterogeneity in DNN architectures, IoT devices, and user preferences. This paper studies automated customization for DL inference on IoT devices (termed as on-thing inference), and our goal is to enhance user QoE by configuring the on-thing inference with an appropriate DNN for users under different usage scenarios. The core of our method is a DNN selection module that learns user QoE patterns on-the-fly and identifies the best-fit DNN for on-thing inference with the learned knowledge. It leverages a novel online learning algorithm, NeuralUCB, that has excellent generalization ability for handling various user QoE patterns. We also embed the knowledge transfer technique in NeuralUCB to expedite the learning process. However, NeuralUCB frequently solicits QoE ratings from users, which incurs non-negligible inconvenience. To address this problem, we design feedback solicitation schemes to reduce the number of QoE solicitations while maintaining the learning efficiency of NeuralUCB. A pragmatic problem, aggregated QoE, is further investigated to improve the practicality of our framework. We conduct experiments on both synthetic and real-world data. The results indicate that our method efficiently learns the user QoE pattern with few solicitations and provides drastic QoE enhancement for IoT devices.
Domain Prompts: Towards memory and compute efficient domain adaptation of ASR systems

Automatic Speech Recognition (ASR) systems have found their use in numerous industrial applications in very diverse domains. Since domain-specific systems perform better than their generic counterparts on in-domain evaluation, the need for memory and compute-efficient domain adaptation is obvious. Particularly, adapting parameter-heavy transformer-based language models used for rescoring ASR hypothesis is challenging. In this work, we introduce domain-prompts, a methodology that trains a small number of domain token embedding parameters to prime a transformer-based LM to a particular domain. With just a handful of extra parameters per domain, we achieve 7-14% WER improvement over the baseline of using an unadapted LM. Despite being parameter-efficient, these improvements are comparable to those of fully-fine-tuned models with hundreds of millions of parameters. With ablations on prompt-sizes, dataset sizes, initializations and domains, we provide evidence for the benefits of using domain-prompts in ASR systems.
AvatarMe++: Facial Shape and BRDF Inference with Photorealistic Rendering-Aware GANs

Alexandros Lattas, Stylianos Moschoglou, Stylianos Ploumpis, Baris Gecer, Abhijeet Ghosh, Stefanos Zafeiriou. Over the last years, many face analysis tasks have accomplished astounding performance, with applications including face generation and 3D face reconstruction from a single "in-the-wild" image. Nevertheless, to the best of our knowledge, there is no method which can produce render-ready high-resolution 3D faces from "in-the-wild" images and this can be attributed to the: (a) scarcity of available data for training, and (b) lack of robust methodologies that can successfully be applied on very high-resolution data. In this work, we introduce the first method that is able to reconstruct photorealistic render-ready 3D facial geometry and BRDF from a single "in-the-wild" image. We capture a large dataset of facial shape and reflectance, which we have made public. We define a fast facial photorealistic differentiable rendering methodology with accurate facial skin diffuse and specular reflection, self-occlusion and subsurface scattering approximation. With this, we train a network that disentangles the facial diffuse and specular BRDF components from a shape and texture with baked illumination, reconstructed with a state-of-the-art 3DMM fitting method. Our method outperforms the existing arts by a significant margin and reconstructs high-resolution 3D faces from a single low-resolution image, that can be rendered in various applications, and bridge the uncanny valley.
Inferring Invariants with Quantifier Alternations: Taming the Search Space Explosion

We present a PDR/IC3 algorithm for finding inductive invariants with quantifier alternations. We tackle scalability issues that arise due to the large search space of quantified invariants by combining a breadth-first search strategy and a new syntactic form for quantifier-free bodies. The breadth-first strategy prevents inductive generalization from getting stuck in regions of the search space that are expensive to search and focuses instead on lemmas that are easy to discover. The new syntactic form is well-suited to lemmas with quantifier alternations by allowing both limited conjunction and disjunction in the quantifier-free body, while carefully controlling the size of the search space. Combining the breadth-first strategy with the new syntactic form results in useful inductive bias by prioritizing lemmas according to: (i) well-defined syntactic metrics for simple quantifier structures and quantifier-free bodies, and (ii) the empirically useful heuristic of preferring lemmas that are fast to discover. On a benchmark suite of primarily distributed protocols and complex Paxos variants, we demonstrate that our algorithm can solve more of the most complicated examples than state-of-the-art techniques.
Progressive Feature Transmission for Split Inference at the Wireless Edge

In edge inference, an edge server provides remote-inference services to edge devices. This requires the edge devices to upload high-dimensional features of data samples over resource-constrained wireless channels, which creates a communication bottleneck. The conventional solution of feature pruning requires that the device has access to the inference model, which is unavailable in the current scenario of split inference. To address this issue, we propose the progressive feature transmission (ProgressFTX) protocol, which minimizes the overhead by progressively transmitting features until a target confidence level is reached. The optimal control policy of the protocol to accelerate inference is derived and it comprises two key operations. The first is importance-aware feature selection at the server, for which it is shown to be optimal to select the most important features, characterized by the largest discriminant gains of the corresponding feature dimensions. The second is transmission-termination control by the server for which the optimal policy is shown to exhibit a threshold structure. Specifically, the transmission is stopped when the incremental uncertainty reduction by further feature transmission is outweighed by its communication cost. The indices of the selected features and transmission decision are fed back to the device in each slot. The optimal policy is first derived for the tractable case of linear classification and then extended to the more complex case of classification using a convolutional neural network. Both Gaussian and fading channels are considered. Experimental results are obtained for both a statistical data model and a real dataset. It is seen that ProgressFTX can substantially reduce the communication latency compared to conventional feature pruning and random feature transmission.
Semiparametric Conditional Factor Models: Estimation and Inference

This paper introduces a simple and tractable sieve estimation of semiparametric conditional factor models with latent factors. We establish large-$N$-asymptotic properties of the estimators and the tests without requiring large $T$. We also develop a simple bootstrap procedure for conducting inference about the conditional pricing errors as well as the shapes of the factor loadings functions. These results enable us to estimate conditional factor structure of a large set of individual assets by utilizing arbitrary nonlinear functions of a number of characteristics without the need to pre-specify the factors, while allowing us to disentangle the characteristics' role in capturing factor betas from alphas (i.e., undiversifiable risk from mispricing). We apply these methods to the cross-section of individual U.S. stock returns and find strong evidence of large nonzero pricing errors that combine to produce arbitrage portfolios with Sharpe ratios above 3.
