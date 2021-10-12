CreatorsPublishersAdvertisers
Topic Scene Graph Generation by Attention Distillation from Caption

By W. Wang, R. Wang, X. Chen
 10 days ago

If an image tells a story, the image caption is the briefest narrator. Generally, a scene graph prefers to be an omniscient generalist, while the image caption is more willing to be a specialist, which outlines the gist. Lots of previous studies have found that a scene graph is not as

Neural Attention-Aware Hierarchical Topic Model

Neural topic models (NTMs) apply deep neural networks to topic modelling. Despite their success, NTMs generally ignore two important aspects: (1) only document-level word count information is utilized for the training, while more fine-grained sentence-level information is ignored, and (2) external semantic knowledge regarding documents, sentences and words are not exploited for the training. To address these issues, we propose a variational autoencoder (VAE) NTM model that jointly reconstructs the sentence and document word counts using combinations of bag-of-words (BoW) topical embeddings and pre-trained semantic embeddings. The pre-trained embeddings are first transformed into a common latent topical space to align their semantics with the BoW embeddings. Our model also features hierarchical KL divergence to leverage embeddings of each document to regularize those of their sentences, thereby paying more attention to semantically relevant sentences. Both quantitative and qualitative experiments have shown the efficacy of our model in 1) lowering the reconstruction errors at both the sentence and document levels, and 2) discovering more coherent topics from real-world datasets.
GCN-SE: Attention as Explainability for Node Classification in Dynamic Graphs

Graph Convolutional Networks (GCNs) are a popular method from graph representation learning that have proved effective for tasks like node classification tasks. Although typical GCN models focus on classifying nodes within a static graph, several recent variants propose node classification in dynamic graphs whose topologies and node attributes change over time, e.g., social networks with dynamic relationships, or literature citation networks with changing co-authorships. These works, however, do not fully address the challenge of flexibly assigning different importance to snapshots of the graph at different times, which depending on the graph dynamics may have more or less predictive power on the labels. We address this challenge by proposing a new method, GCN-SE, that attaches a set of learnable attention weights to graph snapshots at different times, inspired by Squeeze and Excitation Net (SE-Net). We show that GCN-SE outperforms previously proposed node classification methods on a variety of graph datasets. To verify the effectiveness of the attention weight in determining the importance of different graph snapshots, we adapt perturbation-based methods from the field of explainable machine learning to graphical settings and evaluate the correlation between the attention weights learned by GCN-SE and the importance of different snapshots over time. These experiments demonstrate that GCN-SE can in fact identify different snapshots' predictive power for dynamic node classification.
From Multimodal to Unimodal Attention in Transformers using Knowledge Distillation

Multimodal Deep Learning has garnered much interest, and transformers have triggered novel approaches, thanks to the cross-attention mechanism. Here we propose an approach to deal with two key existing challenges: the high computational resource demanded and the issue of missing modalities. We introduce for the first time the concept of knowledge distillation in transformers to use only one modality at inference time. We report a full study analyzing multiple student-teacher configurations, levels at which distillation is applied, and different methodologies. With the best configuration, we improved the state-of-the-art accuracy by 3%, we reduced the number of parameters by 2.5 times and the inference time by 22%. Such performance-computation tradeoff can be exploited in many applications and we aim at opening a new research area where the deployment of complex models with limited resources is demanded.
Creating Generative Art NFTs from Genomic Data

In this post I outline my journey creating a dynamic NFT on the Ethereum blockchain with IPFS and discuss the possible use cases for scientific data. I do not cover algorithmic generation of static images (you should read Albert Sanchez Lafuente’s neat step-by-step for that) but instead demonstrate how I used Cytoscape.js, Anime.js and genomic feature data to dynamically generate visualizations/art at run time when NFTs are viewed from a browser. I will also also not be providing an overview of Blockchain but I highly recommend reading Yifei Huang’s recent post: Why every data scientist should pay attention to crypto.
Asymmetric Graph Representation Learning

Despite the enormous success of graph neural networks (GNNs), most existing GNNs can only be applicable to undirected graphs where relationships among connected nodes are two-way symmetric (i.e., information can be passed back and forth). However, there is a vast amount of applications where the information flow is asymmetric, leading to directed graphs where information can only be passed in one direction. For example, a directed edge indicates that the information can only be conveyed forwardly from the start node to the end node, but not backwardly. To accommodate such an asymmetric structure of directed graphs within the framework of GNNs, we propose a simple yet remarkably effective framework for directed graph analysis to incorporate such one-way information passing. We define an incoming embedding and an outgoing embedding for each node to model its sending and receiving features respectively. We further develop two steps in our directed GNN model with the first one to aggregate/update the incoming features of nodes and the second one to aggregate/update the outgoing features. By imposing the two roles for each node, the likelihood of a directed edge can be calculated based on the outgoing embedding of the start node and the incoming embedding of the end node. The log-likelihood of all edges plays a natural role of regularization for the proposed model, which can alleviate the over-smoothing problem of the deep GNNs. Extensive experiments on multiple real-world directed graphs demonstrate outstanding performances of the proposed model in both node-level and graph-level tasks.
Cavity optomechanics with Anderson-localized optical modes

Guillermo Arregui, Ryan Cecil Ng, Marcus Albrechtsen, Søren Stobbe, Clivia Marfa Sotomayor Torres, Pedro David García. Confining photons in cavities enables the modification of the interaction between light and different forms of matter.\ In optomechanics, cavities are used to enhance the radiation pressure, which in turn enables a wealth of phenomena ranging from optomechanically induced transparency to macroscopic objects cooled to their motional ground state.\ Previous work on cavity optomechanics employed devices where the ubiquitous structural disorder played no role beyond perturbing resonance frequencies and quality factors.\ More generally, the interplay between disorder, which must be described by statistical physics, and optomechanical effects has thus far been unexplored.\ Here we show that sidewall roughness in air-slot photonic-crystal waveguides can induce sufficiently strong backscattering of slot-guided light to create tightly confined Anderson-localized modes with quality factors above $10^5$ and estimated mode volumes below the diffraction limit.\ The interaction between these disorder-induced optical modes and in-plane mechanical modes of the slotted membrane is governed by a distribution of coupling rates, which can exceed $g_{\text{o}}/2\pi\sim 200$ kHz, leading to mechanical amplification up to self sustained oscillations via optomechanical backaction.\ Our work contributes to understand optomechanics in the multiple-scattering regime opening new perspectives for exploring complex systems with multitude mutually-coupled degrees of freedom.
Semi-Autoregressive Image Captioning

Current state-of-the-art approaches for image captioning typically adopt an autoregressive manner, i.e., generating descriptions word by word, which suffers from slow decoding issue and becomes a bottleneck in real-time applications. Non-autoregressive image captioning with continuous iterative refinement, which eliminates the sequential dependence in a sentence generation, can achieve comparable performance to the autoregressive counterparts with a considerable acceleration. Nevertheless, based on a well-designed experiment, we empirically proved that iteration times can be effectively reduced when providing sufficient prior knowledge for the language decoder. Towards that end, we propose a novel two-stage framework, referred to as Semi-Autoregressive Image Captioning (SAIC), to make a better trade-off between performance and speed. The proposed SAIC model maintains autoregressive property in global but relieves it in local. Specifically, SAIC model first jumpily generates an intermittent sequence in an autoregressive manner, that is, it predicts the first word in every word group in order. Then, with the help of the partially deterministic prior information and image features, SAIC model non-autoregressively fills all the skipped words with one iteration. Experimental results on the MS COCO benchmark demonstrate that our SAIC model outperforms the preceding non-autoregressive image captioning models while obtaining a competitive inference speedup. Code is available at this https URL.
Attention-guided Generative Models for Extractive Question Answering

We propose a novel method for applying Transformer models to extractive question answering (QA) tasks. Recently, pretrained generative sequence-to-sequence (seq2seq) models have achieved great success in question answering. Contributing to the success of these models are internal attention mechanisms such as cross-attention. We propose a simple strategy to obtain an extractive answer span from the generative model by leveraging the decoder cross-attention patterns. Viewing cross-attention as an architectural prior, we apply joint training to further improve QA performance. Empirical results show that on open-domain question answering datasets like NaturalQuestions and TriviaQA, our method approaches state-of-the-art performance on both generative and extractive inference, all while using much fewer parameters. Furthermore, this strategy allows us to perform hallucination-free inference while conferring significant improvements to the model's ability to rerank relevant passages.
Symbolic Knowledge Distillation: from General Language Models to Commonsense Models

Peter West, Chandra Bhagavatula, Jack Hessel, Jena D. Hwang, Liwei Jiang, Ronan Le Bras, Ximing Lu, Sean Welleck, Yejin Choi. The common practice for training commonsense models has gone from-human-to-corpus-to-machine: humans author commonsense knowledge graphs in order to train commonsense models. In this work, we investigate an alternative, from-machine-to-corpus-to-machine: general language models author these commonsense knowledge graphs to train commonsense models. Our study leads to a new framework, Symbolic Knowledge Distillation. As with prior art in Knowledge Distillation (Hinton et al., 2015), our approach uses larger models to teach smaller models. A key difference is that we distill knowledge symbolically-as text-in addition to the neural model. We also distill only one aspect-the commonsense of a general language model teacher, allowing the student to be a different type, a commonsense model. Altogether, we show that careful prompt engineering and a separately trained critic model allow us to selectively distill high-quality causal commonsense from GPT-3, a general language model. Empirical results demonstrate that, for the first time, a human-authored commonsense knowledge graph is surpassed by our automatically distilled variant in all three criteria: quantity, quality, and diversity. In addition, it results in a neural commonsense model that surpasses the teacher model's commonsense capabilities despite its 100x smaller size. We apply this to the ATOMIC resource, and share our new symbolic knowledge graph and commonsense models.
Visual-aware Attention Dual-stream Decoder for Video Captioning

Video captioning is a challenging task that captures different visual parts and describes them in sentences, for it requires visual and linguistic coherence. The attention mechanism in the current video captioning method learns to assign weight to each frame, promoting the decoder dynamically. This may not explicitly model the correlation and the temporal coherence of the visual features extracted in the sequence this http URL generate semantically coherent sentences, we propose a new Visual-aware Attention (VA) model, which concatenates dynamic changes of temporal sequence frames with the words at the previous moment, as the input of attention mechanism to extract sequence this http URL addition, the prevalent approaches widely use the teacher-forcing (TF) learning during training, where the next token is generated conditioned on the previous ground-truth tokens. The semantic information in the previously generated tokens is lost. Therefore, we design a self-forcing (SF) stream that takes the semantic information in the probability distribution of the previous token as input to enhance the current token.The Dual-stream Decoder (DD) architecture unifies the TF and SF streams, generating sentences to promote the annotated captioning for both streams.Meanwhile, with the Dual-stream Decoder utilized, the exposure bias problem is alleviated, caused by the discrepancy between the training and testing in the TF learning.The effectiveness of the proposed Visual-aware Attention Dual-stream Decoder (VADD) is demonstrated through the result of experimental studies on Microsoft video description (MSVD) corpus and MSR-Video to text (MSR-VTT) datasets.
A Dual-Perception Graph Neural Network with Multi-hop Graph Generator

Graph neural networks (GNNs) have drawn increasing attention in recent years and achieved remarkable performance in many graph-based tasks, especially in semi-supervised learning on graphs. However, most existing GNNs excessively rely on topological structures and aggregate multi-hop neighborhood information by simply stacking network layers, which may introduce superfluous noise information, limit the expressive power of GNNs and lead to the over-smoothing problem ultimately. In light of this, we propose a novel Dual-Perception Graph Neural Network (DPGNN) to address these issues. In DPGNN, we utilize node features to construct a feature graph, and perform node representations learning based on the original topology graph and the constructed feature graph simultaneously, which conduce to capture the structural neighborhood information and the feature-related information. Furthermore, we design a Multi-Hop Graph Generator (MHGG), which applies a node-to-hop attention mechanism to aggregate node-specific multi-hop neighborhood information adaptively. Finally, we apply self-ensembling to form a consistent prediction for unlabeled node representations. Experimental results on five datasets with different topological structures demonstrate that our proposed DPGNN achieves competitive performance across all datasets, four of which the results outperform the latest state-of-the-art models. The source code of our model is available at this https URL.
Molecular Graph Generation via Geometric Scattering

Graph neural networks (GNNs) have been used extensively for addressing problems in drug design and discovery. Both ligand and target molecules are represented as graphs with node and edge features encoding information about atomic elements and bonds respectively. Although existing deep learning models perform remarkably well at predicting physicochemical properties and binding affinities, the generation of new molecules with optimized properties remains challenging. Inherently, most GNNs perform poorly in whole-graph representation due to the limitations of the message-passing paradigm. Furthermore, step-by-step graph generation frameworks that use reinforcement learning or other sequential processing can be slow and result in a high proportion of invalid molecules with substantial post-processing needed in order to satisfy the principles of stoichiometry. To address these issues, we propose a representation-first approach to molecular graph generation. We guide the latent representation of an autoencoder by capturing graph structure information with the geometric scattering transform and apply penalties that structure the representation also by molecular properties. We show that this highly structured latent space can be directly used for molecular graph generation by the use of a GAN. We demonstrate that our architecture learns meaningful representations of drug datasets and provides a platform for goal-directed drug synthesis.
Residual2Vec: Debiasing graph embedding with random graphs

Graph embedding maps a graph into a convenient vector-space representation for graph analysis and machine learning applications. Many graph embedding methods hinge on a sampling of context nodes based on random walks. However, random walks can be a biased sampler due to the structural properties of graphs. Most notably, random walks are biased by the degree of each node, where a node is sampled proportionally to its degree. The implication of such biases has not been clear, particularly in the context of graph representation learning. Here, we investigate the impact of the random walks' bias on graph embedding and propose residual2vec, a general graph embedding method that can debias various structural biases in graphs by using random graphs. We demonstrate that this debiasing not only improves link prediction and clustering performance but also allows us to explicitly model salient structural properties in graph embedding.
Reeb Graph Metrics from the Ground Up

The Reeb graph has been utilized in various applications including the analysis of scalar fields. Recently, research has been focused on using topological signatures such as the Reeb graph to compare multiple scalar fields by defining distance metrics on the topological signatures themselves. Here we survey five existing metrics that have been defined on Reeb graphs: the bottleneck distance, the interleaving distance, functional distortion distance, the Reeb graph edit distance, and the universal edit distance. Our goal is to (1) provide definitions and concrete examples of these distances in order to develop the intuition of the reader, (2) visit previously proven results of stability, universality, and discriminativity, (3) identify and complete any remaining properties which have only been proven (or disproven) for a subset of these metrics, (4) expand the taxonomy of the bottleneck distance to better distinguish between variations which have been commonly miscited, and (5) reconcile the various definitions and requirements on the underlying spaces for these metrics to be defined and properties to be proven.
Generalized Borsuk Graphs

Given a finite group $G$ acting freely on a compact metric space $M$, and $\epsilon>0$, we define the $G$-Borsuk graph on $M$ by drawing edges $x\sim y$ whenever there is a non-identity $g\in G$ such that $d(x,gy)\leq\epsilon$. We show that when $\epsilon$ is small, its chromatic number is determined by the topology of $M$ via its $G$-covering number, which is the minimum $k$ such that there is a closed cover $M=F_1\cup\dots\cup F_k$ with $F_i\cap g(F_i)=\emptyset$ for all $g\in G\setminus\{1\}$. We are interested in bounding this number. We give lower bounds using $G$-actions on Hom-complexes, and upper bounds using a recursive formula on the dimension of $M$. We conjecture that the true chromatic number coincides with the lower bound, and give computational evidence. We also study random $G$-Borsuk graphs, which are random induced subgraphs. For these, we compute thresholds for $\epsilon$ that guarantee that the chromatic number is still that of the whole $G$-Borsuk graph. Our results are tight (up to a constant) when the $G$-index and dimension of $M$ coincide.
Likelihood Training of Schrödinger Bridge using Forward-Backward SDEs Theory

Schrödinger Bridge (SB) is an optimal transport problem that has received increasing attention in deep generative modeling for its mathematical flexibility compared to the Scored-based Generative Model (SGM). However, it remains unclear whether the optimization principle of SB relates to the modern training of deep generative models, which often rely on constructing parameterized log-likelihood objectives.This raises questions on the suitability of SB models as a principled alternative for generative applications. In this work, we present a novel computational framework for likelihood training of SB models grounded on Forward-Backward Stochastic Differential Equations Theory -- a mathematical methodology appeared in stochastic optimal control that transforms the optimality condition of SB into a set of SDEs. Crucially, these SDEs can be used to construct the likelihood objectives for SB that, surprisingly, generalizes the ones for SGM as special cases. This leads to a new optimization principle that inherits the same SB optimality yet without losing applications of modern generative training techniques, and we show that the resulting training algorithm achieves comparable results on generating realistic images on MNIST, CelebA, and CIFAR10.
Joint 3D Human Shape Recovery from A Single Imag with Bilayer-Graph

The ability to estimate the 3D human shape and pose from images can be useful in many contexts. Recent approaches have explored using graph convolutional networks and achieved promising results. The fact that the 3D shape is represented by a mesh, an undirected graph, makes graph convolutional networks a natural fit for this problem. However, graph convolutional networks have limited representation power. Information from nodes in the graph is passed to connected neighbors, and propagation of information requires successive graph convolutions. To overcome this limitation, we propose a dual-scale graph approach. We use a coarse graph, derived from a dense graph, to estimate the human's 3D pose, and the dense graph to estimate the 3D shape. Information in coarse graphs can be propagated over longer distances compared to dense graphs. In addition, information about pose can guide to recover local shape detail and vice versa. We recognize that the connection between coarse and dense is itself a graph, and introduce graph fusion blocks to exchange information between graphs with different scales. We train our model end-to-end and show that we can achieve state-of-the-art results for several evaluation datasets.
Adaptive Distillation: Aggregating Knowledge from Multiple Paths for Efficient Distillation

Knowledge Distillation is becoming one of the primary trends among neural network compression algorithms to improve the generalization performance of a smaller student model with guidance from a larger teacher model. This momentous rise in applications of knowledge distillation is accompanied by the introduction of numerous algorithms for distilling the knowledge such as soft targets and hint layers. Despite this advancement in different techniques for distilling the knowledge, the aggregation of different paths for distillation has not been studied comprehensively. This is of particular significance, not only because different paths have different importance, but also due to the fact that some paths might have negative effects on the generalization performance of the student model. Hence, we need to adaptively adjust the importance of each path to maximize the impact of distillation on the student model. In this paper, we explore different approaches for aggregating these different paths and introduce our proposed adaptive approach based on multitask learning methods. We empirically demonstrate the effectiveness of the proposed approach over other baselines on the applications of knowledge distillation in classification, semantic segmentation, and object detection tasks.
Learning to Learn Graph Topologies

Learning a graph topology to reveal the underlying relationship between data entities plays an important role in various machine learning and data analysis tasks. Under the assumption that structured data vary smoothly over a graph, the problem can be formulated as a regularised convex optimisation over a positive semidefinite cone and solved by iterative algorithms. Classic methods require an explicit convex function to reflect generic topological priors, e.g. the $\ell_1$ penalty for enforcing sparsity, which limits the flexibility and expressiveness in learning rich topological structures. We propose to learn a mapping from node data to the graph structure based on the idea of learning to optimise (L2O). Specifically, our model first unrolls an iterative primal-dual splitting algorithm into a neural network. The key structural proximal projection is replaced with a variational autoencoder that refines the estimated graph with enhanced topological properties. The model is trained in an end-to-end fashion with pairs of node data and graph samples. Experiments on both synthetic and real-world data demonstrate that our model is more efficient than classic iterative algorithms in learning a graph with specific topological properties.
Asymptotics for Pull on the Complete Graph

We study the randomized rumor spreading algorithm \emph{pull} on complete graphs with $n$ vertices. Starting with one informed vertex and proceeding in rounds, each vertex yet uninformed connects to a neighbor chosen uniformly at random and receives the information, if the vertex it connected to is informed. The goal is to study the number of rounds needed to spread the information to everybody, also known as the \emph{runtime}.
