Rethinking the Spatial Route Prior in Vision-and-Language Navigation

By Xinzhe Zhou, Wei Liu, Yadong Mu
 10 days ago

Vision-and-language navigation (VLN) is a trending topic which aims to navigate an intelligent agent to an expected position through natural language instructions. This work addresses the task of VLN from a previously-ignored aspect, namely the spatial route prior of the navigation scenes. A critically enabling innovation of this work is explicitly

A Good Prompt Is Worth Millions of Parameters? Low-resource Prompt-based Learning for Vision-Language Models

Large pretrained vision-language (VL) models can learn a new task with a handful of examples or generalize to a new task without fine-tuning. However, these gigantic VL models are hard to deploy for real-world applications due to their impractically huge model size and slow inference speed. In this work, we propose FewVLM, a few-shot prompt-based learner on vision-language tasks. We pretrain a sequence-to-sequence Transformer model with both prefix language modeling (PrefixLM) and masked language modeling (MaskedLM), and introduce simple prompts to improve zero-shot and few-shot performance on VQA and image captioning. Experimental results on five VQA and captioning datasets show that \method\xspace outperforms Frozen which is 31 times larger than ours by 18.2% point on zero-shot VQAv2 and achieves comparable results to a 246$\times$ larger model, PICa. We observe that (1) prompts significantly affect zero-shot performance but marginally affect few-shot performance, (2) MaskedLM helps few-shot VQA tasks while PrefixLM boosts captioning performance, and (3) performance significantly increases when training set size is small.
COMPUTERS
Benchmarking the Robustness of Spatial-Temporal Models Against Corruptions

The state-of-the-art deep neural networks are vulnerable to common corruptions (e.g., input data degradations, distortions, and disturbances caused by weather changes, system error, and processing). While much progress has been made in analyzing and improving the robustness of models in image understanding, the robustness in video understanding is largely unexplored. In this paper, we establish a corruption robustness benchmark, Mini Kinetics-C and Mini SSV2-C, which considers temporal corruptions beyond spatial corruptions in images. We make the first attempt to conduct an exhaustive study on the corruption robustness of established CNN-based and Transformer-based spatial-temporal models. The study provides some guidance on robust model design and training: Transformer-based model performs better than CNN-based models on corruption robustness; the generalization ability of spatial-temporal models implies robustness against temporal corruptions; model corruption robustness (especially robustness in the temporal domain) enhances with computational cost and model capacity, which may contradict the current trend of improving the computational efficiency of models. Moreover, we find the robustness intervention for image-related tasks (e.g., training models with noise) may not work for spatial-temporal models.
COMPUTERS
Affordable 4K spatial AI computer vision kit raises over $700,000 via Kickstarter

Developers searching for an affordable 4K computer vision spatial artificial intelligence kit may be interested in the Oak D Lite OpenCV AI Kit which has raised over $700,000 thanks to nearly 7,000 backers via Kickstarter. Now in its final week of funding the system can be easily setup on your Raspberry Pi, Apple Mac, Windows or Linux PC. “3D object detection is what humans do. We know where objects are – and where they are in physical space. It’s why we can pick up a coffee cup, or catch a ball.”
ADVOCACY
Spatial-Temporal Transformer for 3D Point Cloud Sequences

Effective learning of spatial-temporal information within a point cloud sequence is highly important for many down-stream tasks such as 4D semantic segmentation and 3D action recognition. In this paper, we propose a novel framework named Point Spatial-Temporal Transformer (PST2) to learn spatial-temporal representations from dynamic 3D point cloud sequences. Our PST2 consists of two major modules: a Spatio-Temporal Self-Attention (STSA) module and a Resolution Embedding (RE) module. Our STSA module is introduced to capture the spatial-temporal context information across adjacent frames, while the RE module is proposed to aggregate features across neighbors to enhance the resolution of feature maps. We test the effectiveness our PST2 with two different tasks on point cloud sequences, i.e., 4D semantic segmentation and 3D action recognition. Extensive experiments on three benchmarks show that our PST2 outperforms existing methods on all datasets. The effectiveness of our STSA and RE modules have also been justified with ablation experiments.
CODING & PROGRAMMING
Pathologies in priors and inference for Bayesian transformers

In recent years, the transformer has established itself as a workhorse in many applications ranging from natural language processing to reinforcement learning. Similarly, Bayesian deep learning has become the gold-standard for uncertainty estimation in safety-critical applications, where robustness and calibration are crucial. Surprisingly, no successful attempts to improve transformer models in terms of predictive uncertainty using Bayesian inference exist. In this work, we study this curiously underpopulated area of Bayesian transformers. We find that weight-space inference in transformers does not work well, regardless of the approximate posterior. We also find that the prior is at least partially at fault, but that it is very hard to find well-specified weight priors for these models. We hypothesize that these problems stem from the complexity of obtaining a meaningful mapping from weight-space to function-space distributions in the transformer. Therefore, moving closer to function-space, we propose a novel method based on the implicit reparameterization of the Dirichlet distribution to apply variational inference directly to the attention weights. We find that this proposed method performs competitively with our baselines.
COMPUTERS
CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Large-scale contrastive vision-language pre-training has shown significant progress in visual representation learning. Unlike traditional visual systems trained by a fixed set of discrete labels, a new paradigm was introduced in \cite{radford2021learning} to directly learn to align images with raw texts in an open-vocabulary setting. On downstream tasks, a carefully chosen text prompt is employed to make zero-shot predictions.~To avoid non-trivial prompt engineering, context optimization \cite{zhou2021coop} has been proposed to learn continuous vectors as task-specific prompts with few-shot training examples.~In this paper, we show that there is an alternative path to achieve better vision-language models other than prompt tuning.~While prompt tuning is for the textual inputs, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch. Specifically, CLIP-Adapter adopts an additional bottleneck layer to learn new features and performs residual-style feature blending with the original pre-trained features.~As a consequence, CLIP-Adapter is able to outperform context optimization while maintains a simple design. Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
COMPUTERS
Co-clustering of Spatially Resolved Transcriptomic Data

Spatial transcriptomics is a modern sequencing technology that allows the measurement of the activity of thousands of genes in a tissue sample and map where the activity is occurring. This technology has enabled the study of the so-called spatially expressed genes, i.e., genes which exhibit spatial variation across the tissue. Comprehending their functions and their interactions in different areas of the tissue is of great scientific interest, as it might lead to a deeper understanding of several key biological mechanisms. However, adequate statistical tools that exploit the newly spatial mapping information to reach more specific conclusions are still lacking.
SCIENCE
Spatial Censored Regression Models in R: The CensSpatial package

CensSpatial is an R package for analyzing spatial censored data through linear models. It offers a set of tools for simulating, estimating, making predictions, and performing local influence diagnostics for outlier detection. The package provides four algorithms for estimation and prediction. One of them is based on the stochastic approximation of the EM (SAEM) algorithm, which allows easy and fast estimation of the parameters of linear spatial models when censoring is present. The package provides worthy measures to perform diagnostic analysis using the Hessian matrix of the completed log-likelihood function. This work is divided into two parts. The first part discusses and illustrates the utilities that the package offers for estimating and predicting spatial censored data. The second one describes the valuable tools to perform diagnostic analysis. Several examples in spatial environmental data are also provided.
CODING & PROGRAMMING
Technology
Spatial Transformer Networks — Backpropagation

Spatial Transformer modules, introduced by Max Jaderberg et al., are a popular way to increase spatial invariance of a model against spatial transformations such as translation, scaling, rotation, cropping, as well as non-rigid deformations. They achieve spatial invariance by adaptively transforming their input to a canonical, expected pose, thus leading to a better classification performance.
CODING & PROGRAMMING
Leveraging Spatial and Temporal Correlations in Sparsified Mean Estimation

We study the problem of estimating at a central server the mean of a set of vectors distributed across several nodes (one vector per node). When the vectors are high-dimensional, the communication cost of sending entire vectors may be prohibitive, and it may be imperative for them to use sparsification techniques. While most existing work on sparsified mean estimation is agnostic to the characteristics of the data vectors, in many practical applications such as federated learning, there may be spatial correlations (similarities in the vectors sent by different nodes) or temporal correlations (similarities in the data sent by a single node over different iterations of the algorithm) in the data vectors. We leverage these correlations by simply modifying the decoding method used by the server to estimate the mean. We provide an analysis of the resulting estimation error as well as experiments for PCA, K-Means and Logistic Regression, which show that our estimators consistently outperform more sophisticated and expensive sparsification methods.
SCIENCE
The woes of navigating technology

I have never envied anyone for having a grandchild – one old enough to understand how to work new television technology. That lack of envy came to a screeching, and I mean screeching, halt this …. This item is available in full to subscribers. Attention subscribers. We have recently launched...
TECHNOLOGY
RoQNN: Noise-Aware Training for Robust Quantum Neural Networks

Quantum Neural Network (QNN) is a promising application towards quantum advantage on near-term quantum hardware. However, due to the large quantum noises (errors), the performance of QNN models has a severe degradation on real quantum devices. For example, the accuracy gap between noise-free simulation and noisy results on IBMQ-Yorktown for MNIST-4 classification is over 60%. Existing noise mitigation methods are general ones without leveraging unique characteristics of QNN and are only applicable to inference; on the other hand, existing QNN work does not consider noise effect. To this end, we present RoQNN, a QNN-specific framework to perform noise-aware optimizations in both training and inference stages to improve robustness. We analytically deduct and experimentally observe that the effect of quantum noise to QNN measurement outcome is a linear map from noise-free outcome with a scaling and a shift factor. Motivated by that, we propose post-measurement normalization to mitigate the feature distribution differences between noise-free and noisy scenarios. Furthermore, to improve the robustness against noise, we propose noise injection to the training process by inserting quantum error gates to QNN according to realistic noise models of quantum hardware. Finally, post-measurement quantization is introduced to quantize the measurement outcomes to discrete values, achieving the denoising effect. Extensive experiments on 8 classification tasks using 6 quantum devices demonstrate that RoQNN improves accuracy by up to 43%, and achieves over 94% 2-class, 80% 4-class, and 34% 10-class MNIST classification accuracy measured on real quantum computers. We also open-source our PyTorch library for construction and noise-aware training of QNN at this https URL .
CODING & PROGRAMMING
Likelihood Training of Schrödinger Bridge using Forward-Backward SDEs Theory

Schrödinger Bridge (SB) is an optimal transport problem that has received increasing attention in deep generative modeling for its mathematical flexibility compared to the Scored-based Generative Model (SGM). However, it remains unclear whether the optimization principle of SB relates to the modern training of deep generative models, which often rely on constructing parameterized log-likelihood objectives.This raises questions on the suitability of SB models as a principled alternative for generative applications. In this work, we present a novel computational framework for likelihood training of SB models grounded on Forward-Backward Stochastic Differential Equations Theory -- a mathematical methodology appeared in stochastic optimal control that transforms the optimality condition of SB into a set of SDEs. Crucially, these SDEs can be used to construct the likelihood objectives for SB that, surprisingly, generalizes the ones for SGM as special cases. This leads to a new optimization principle that inherits the same SB optimality yet without losing applications of modern generative training techniques, and we show that the resulting training algorithm achieves comparable results on generating realistic images on MNIST, CelebA, and CIFAR10.
CODING & PROGRAMMING
Efficient Fully-Coherent Hamiltonian Simulation

Hamiltonian simulation is a fundamental problem at the heart of quantum computation, and the associated simulation algorithms are useful building blocks for designing larger quantum algorithms. In order to be successfully concatenated into a larger quantum algorithm, a Hamiltonian simulation algorithm must succeed with arbitrarily high success probability $1-\delta$ while only requiring a single copy of the initial state, a property which we call fully-coherent. Although optimal Hamiltonian simulation has been achieved by quantum signal processing (QSP), with query complexity linear in time $t$ and logarithmic in inverse error $\ln(1/\epsilon)$, the corresponding algorithm is not fully-coherent as it only succeeds with probability close to $1/4$. While this simulation algorithm can be made fully-coherent by employing amplitude amplification at the expense of appending a $\ln(1/\delta)$ multiplicative factor to the query complexity, here we develop a new fully-coherent Hamiltonian simulation algorithm that achieves a query complexity additive in $\ln(1/\delta)$: $\Theta\big( \|\mathcal{H}\| |t| + \ln(1/\epsilon) + \ln(1/\delta)\big)$. We accomplish this by compressing the spectrum of the Hamiltonian with an affine transformation, and applying to it a QSP polynomial that approximates the complex exponential only over the range of the compressed spectrum. We further numerically analyze the complexity of this algorithm and demonstrate its application to the simulation of the Heisenberg model in constant and time-dependent external magnetic fields. We believe that this efficient fully-coherent Hamiltonian simulation algorithm can serve as a useful subroutine in quantum algorithms where maintaining coherence is paramount.
COMPUTERS
Micromagnetic simulations of clusters of nanoparticles with internal structure: Application to magnetic hyperthermia

Micromagnetic simulation results on dynamic hysteresis loops of clusters of iron oxide nanoparticles (NPs) with internal structure composed of nanorods are compared with the widely used macrospin approximation. Such calculations allowing for nanorod-composed NPs is facilitated by a previously developed coarse-graining method based on the renormalization group approach. With a focus on applications to magnetic hyperthermia, we show that magnetostatic interactions improve the heating performance of NPs in chains and triangles, and reduce heating performance in fcc arrangements. Hysteresis loops of triangular and fcc systems of complex NPs are not recovered within the macrospin approximation, especially at smaller interparticle distances. For triangular arrangements, the macrospin approximation predicts that magnetostatic interactions reduce loop area, in contrast to the complex NP case. An investigation of the local hysteresis loops of individual NPs and macrospins in clusters reveals the impact of the geometry of their neighbours on individual versus collective magnetic response, inhomogenous heating within clusters, and further differences between simulating NPs with internal structure and the use of the macrospin approximation. Capturing the internal physical and magnetic structure of NPs is thus important for some applications.
CHEMISTRY
Quantum Chaos and Trotterisation Thresholds in Digital Quantum Simulations

Cahit Kargi, Juan Pablo Dehollain, Fabio Henriques, Lukas M. Sieberer, Tobias Olsacher, Philipp Hauke, Markus Heyl, Peter Zoller, Nathan K. Langford. Digital quantum simulation (DQS) is one of the most promising paths for achieving first useful real-world applications for quantum processors. Yet even assuming rapid progress in device engineering and development of fault-tolerant quantum processors, algorithmic resource optimisation will long remain crucial to exploit their full power. Currently, Trotterisation provides state-of-the-art DQS resource scaling. Moreover, recent theoretical studies of Trotterised Ising models suggest it also offers feasible performance for unexpectedly large step sizes up to a sharp breakdown threshold, but demonstrations and characterisation have been limited, and the question of whether this behaviour applies as a general principle has remained open. Here, we study a set of paradigmatic and experimentally realisable DQS models, and show that a range of Trotterisation performance behaviours, including the existence of a sharp threshold, are remarkably universal. Carrying out a detailed characterisation of a range of performance signatures, we demonstrate that it is the onset of digitisation-induced quantum chaos at this threshold that underlies the breakdown of Trotterisation. Specifically, combining analysis of detailed dynamics with conclusive, global static signatures based on random matrix theory, we observe clear signatures of regular behaviour pre-threshold, and conclusive, initial-state-independent evidence for the onset of quantum chaotic dynamics beyond the threshold. We also show how this behaviour consistently emerges as a function of system size for sizes and times already relevant for current experimental DQS platforms. The advances in this work open up many important questions about the algorithm performance and general shared features of sufficiently complex Trotterisation-based DQS. Answering these will be crucial for extracting the maximum simulation power from future quantum processors.
COMPUTERS
Local Existence and Uniqueness of Spatially Quasi-Periodic Solutions to the Generalized KdV Equation

In this paper, we study the existence and uniqueness of spatially quasi-periodic solutions to the generalized KdV equation (gKdV for short) on the real line with quasi-periodic initial data whose Fourier coefficients are exponentially decaying. In order to solve for the Fourier coefficients of the solution, we first reduce the nonlinear dispersive partial differential equation to a nonlinear infinite system of coupled ordinary differential equations, and then construct the Picard sequence to approximate them. However, we meet, and have to deal with, the difficulty of studying {\bf the higher dimensional discrete convolution operation for several functions}: \[\underbrace{c\times\cdots\times c}_{\mathfrak p~\text{times}}~(\text{total distance}):=\sum_{\substack{\clubsuit_1,\cdots,\clubsuit_{\mathfrak p}\in\mathbb Z^\nu\\ \clubsuit_1+\cdots+\clubsuit_{\mathfrak p}=~\text{total distance}}}\prod_{j=1}^{\mathfrak p}c(\clubsuit_j).\] In order to overcome it, we apply a combinatorial method to reformulate the Picard sequence as a tree. Based on this form, we prove that the Picard sequence is exponentially decaying and fundamental ({\color{red}i.e., a} Cauchy sequence). We first give a detailed discussion of the proof of the existence and uniqueness result in the case $\mathfrak p=3$. Next, we prove existence and uniqueness in the general case $\mathfrak p\geq 2$, which then covers the remaining cases $\mathfrak p\geq 4$. As a byproduct, we recover the local result from \cite{damanik16}. We exhibit the most important combinatorial index $\sigma$ and obtain a relationship with other indices, which is essential to our proofs in the case of general $\mathfrak p$.
SCIENCE
Quantum field theories, Markov random fields and machine learning

The transition to Euclidean space and the discretization of quantum field theories on spatial or space-time lattices opens up the opportunity to investigate probabilistic machine learning from the perspective of quantum field theory. Here, we will discuss how discretized Euclidean field theories can be recast within the mathematical framework of Markov random fields, which is a notable class of probabilistic graphical models with applications in a variety of research areas, including machine learning. Specifically, we will demonstrate that the $\phi^{4}$ scalar field theory on a square lattice satisfies the Hammersley-Clifford theorem, therefore recasting it as a Markov random field from which neural networks are additionally derived. We will then discuss applications pertinent to the minimization of an asymmetric distance between the probability distribution of the $\phi^{4}$ machine learning algorithms and that of target probability distributions.
COMPUTERS
Stability against large perturbations of invertible, frustration-free ground states

A gapped ground state of a quantum spin system has a natural length scale set by the gap. This length scale governs the decay of correlations. A common intuition is that this length scale also controls the spatial relaxation towards the ground state away from impurities or boundaries. The aim of this article is to take a step towards a proof of this intuition. To make the problem more tractable, we assume that there is a unique ground state that is frustration-free and invertible (i.e. no long-range entanglement). Moreover, we assume the property that we are aiming to prove for one specific kind of boundary condition; namely open boundary conditions. With these assumptions we can prove stretched exponential decay away from boundaries for any boundary conditions or (large) perturbations and for all ground states of the perturbed system. In particular, the perturbed system itself can certainly have long-range entanglement.
SCIENCE

