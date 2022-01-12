ContributorsPublishersAdvertisers
Agent-Temporal Attention for Reward Redistribution in Episodic Multi-Agent Reinforcement Learning

By Baicen Xiao, Bhaskar Ramasubramanian, Radha Poovendran
arxiv.org
 3 days ago

This paper considers multi-agent reinforcement learning (MARL) tasks where agents receive a shared global reward at the end of an episode. The delayed nature of this reward affects the ability of the agents to assess the quality of their actions at intermediate time-steps. This paper focuses on developing methods to learn...

Learning Operators with Coupled Attention

Georgios Kissas, Jacob Seidman, Leonardo Ferreira Guilhoto, Victor M. Preciado, George J. Pappas, Paris Perdikaris. Supervised operator learning is an emerging machine learning paradigm with applications to modeling the evolution of spatio-temporal dynamical systems and approximating general black-box relationships between functional data. We propose a novel operator learning method, LOCA (Learning Operators with Coupled Attention), motivated from the recent success of the attention mechanism. In our architecture, the input functions are mapped to a finite set of features which are then averaged with attention weights that depend on the output query locations. By coupling these attention weights together with an integral transform, LOCA is able to explicitly learn correlations in the target output functions, enabling us to approximate nonlinear operators even when the number of output function in the training set measurements is very small. Our formulation is accompanied by rigorous approximation theoretic guarantees on the universal expressiveness of the proposed model. Empirically, we evaluate the performance of LOCA on several operator learning scenarios involving systems governed by ordinary and partial differential equations, as well as a black-box climate prediction problem. Through these scenarios we demonstrate state of the art accuracy, robustness with respect to noisy input data, and a consistently small spread of errors over testing data sets, even for out-of-distribution prediction tasks.
Asymptotic Convergence of Deep Multi-Agent Actor-Critic Algorithms

We present sufficient conditions that ensure convergence of the multi-agent Deep Deterministic Policy Gradient (DDPG) algorithm. It is an example of one of the most popular paradigms of Deep Reinforcement Learning (DeepRL) for tackling continuous action spaces: the actor-critic paradigm. In the setting considered herein, each agent observes a part of the global state space in order to take local actions, for which it receives local rewards. For every agent, DDPG trains a local actor (policy) and a local critic (Q-function). The analysis shows that multi-agent DDPG using neural networks to approximate the local policies and critics converge to limits with the following properties: The critic limits minimize the average squared Bellman loss; the actor limits parameterize a policy that maximizes the local critic's approximation of $Q_i^*$, where $i$ is the agent index. The averaging is with respect to a probability distribution over the global state-action space. It captures the asymptotics of all local training processes. Finally, we extend the analysis to a fully decentralized setting where agents communicate over a wireless network prone to delays and losses; a typical scenario in, e.g., robotic applications.
Modelling Cournot Games as Multi-agent Multi-armed Bandits

We investigate the use of a multi-agent multi-armed bandit (MA-MAB) setting for modeling repeated Cournot oligopoly games, where the firms acting as agents choose from the set of arms representing production quantity (a discrete value). Agents interact with separate and independent bandit problems. In this formulation, each agent makes sequential choices among arms to maximize its own reward. Agents do not have any information about the environment; they can only see their own rewards after taking an action. However, the market demand is a stationary function of total industry output, and random entry or exit from the market is not allowed. Given these assumptions, we found that an $\epsilon$-greedy approach offers a more viable learning mechanism than other traditional MAB approaches, as it does not require any additional knowledge of the system to operate. We also propose two novel approaches that take advantage of the ordered action space: $\epsilon$-greedy+HL and $\epsilon$-greedy+EL. These new approaches help firms to focus on more profitable actions by eliminating less profitable choices and hence are designed to optimize the exploration. We use computer simulations to study the emergence of various equilibria in the outcomes and do the empirical analysis of joint cumulative regrets.
IN THIS ARTICLE
#Reinforcement Learning#Temporal#Redistribution#Marl#Particle World#Autonomous Agents#Multi Agent Systems#Aamas#Multiagent Systems#Machine Learning#Lg
Conditional Imitation Learning for Multi-Agent Games

While advances in multi-agent learning have enabled the training of increasingly complex agents, most existing techniques produce a final policy that is not designed to adapt to a new partner's strategy. However, we would like our AI agents to adjust their strategy based on the strategies of those around them. In this work, we study the problem of conditional multi-agent imitation learning, where we have access to joint trajectory demonstrations at training time, and we must interact with and adapt to new partners at test time. This setting is challenging because we must infer a new partner's strategy and adapt our policy to that strategy, all without knowledge of the environment reward or dynamics. We formalize this problem of conditional multi-agent imitation learning, and propose a novel approach to address the difficulties of scalability and data scarcity. Our key insight is that variations across partners in multi-agent games are often highly structured, and can be represented via a low-rank subspace. Leveraging tools from tensor decomposition, our model learns a low-rank subspace over ego and partner agent strategies, then infers and adapts to a new partner strategy by interpolating in the subspace. We experiments with a mix of collaborative tasks, including bandits, particle, and Hanabi environments. Additionally, we test our conditional policies against real human partners in a user study on the Overcooked game. Our model adapts better to new partners compared to baselines, and robustly handles diverse settings ranging from discrete/continuous actions and static/online evaluation with AI/human partners.
Reinforcement Online Learning to Rank with Unbiased Reward Shaping

Online learning to rank (OLTR) aims to learn a ranker directly from implicit feedback derived from users' interactions, such as clicks. Clicks however are a biased signal: specifically, top-ranked documents are likely to attract more clicks than documents down the ranking (position bias). In this paper, we propose a novel learning algorithm for OLTR that uses reinforcement learning to optimize rankers: Reinforcement Online Learning to Rank (ROLTR). In ROLTR, the gradients of the ranker are estimated based on the rewards assigned to clicked and unclicked documents. In order to de-bias the users' position bias contained in the reward signals, we introduce unbiased reward shaping functions that exploit inverse propensity scoring for clicked and unclicked documents. The fact that our method can also model unclicked documents provides a further advantage in that less users interactions are required to effectively train a ranker, thus providing gains in efficiency. Empirical evaluation on standard OLTR datasets shows that ROLTR achieves state-of-the-art performance, and provides significantly better user experience than other OLTR approaches. To facilitate the reproducibility of our experiments, we make all experiment code available at this https URL.
Semi-global Periodic Event-triggered Output Regulation for Nonlinear Multi-agent Systems

This study focuses on periodic event-triggered (PET) cooperative output regulation problem for a class of nonlinear multi-agent systems. The key feature of PET mechanism is that event-triggered conditions are required to be monitored only periodically. This approach is beneficial for Zeno behavior exclusion and saving of battery energy of onboard sensors. At first, new PET distributed observers are proposed to estimate the leader information. We show that the estimation error converges to zero exponentially with a known convergence rate under asynchronous PET communication. Second, a novel PET output feedback controller is designed for the underlying strict feedback nonlinear multi-agent systems. Based on a state transformation technique and a local PET state observer, the cooperative semi-global output regulation problem can be solved by the proposed new control design technique. Simulation results of multiple Lorenz systems illustrate that the developed control scheme is effective.
The Introspective Agent: Interdependence of Strategy, Physiology, and Sensing for Embodied Agents

The last few years have witnessed substantial progress in the field of embodied AI where artificial agents, mirroring biological counterparts, are now able to learn from interaction to accomplish complex tasks. Despite this success, biological organisms still hold one large advantage over these simulated agents: adaptation. While both living and simulated agents make decisions to achieve goals (strategy), biological organisms have evolved to understand their environment (sensing) and respond to it (physiology). The net gain of these factors depends on the environment, and organisms have adapted accordingly. For example, in a low vision aquatic environment some fish have evolved specific neurons which offer a predictable, but incredibly rapid, strategy to escape from predators. Mammals have lost these reactive systems, but they have a much larger fields of view and brain circuitry capable of understanding many future possibilities. While traditional embodied agents manipulate an environment to best achieve a goal, we argue for an introspective agent, which considers its own abilities in the context of its environment. We show that different environments yield vastly different optimal designs, and increasing long-term planning is often far less beneficial than other improvements, such as increased physical ability. We present these findings to broaden the definition of improvement in embodied AI passed increasingly complex models. Just as in nature, we hope to reframe strategy as one tool, among many, to succeed in an environment. Code is available at: this https URL.
Deep Learning for Partial MIMO CSI Feedback by Exploiting Channel Temporal Correlation

Accurate estimation of DL CSI is required to achieve high spectrum and energy efficiency in massive MIMO systems. Previous works have developed learning-based CSI feedback framework within FDD systems for efficient CSI encoding and recovery with demonstrated benefits. However, downlink pilots for CSI estimation by receiving terminals may occupy excessively large number of resource elements for massive number of antennas and compromise spectrum efficiency. To overcome this problem, we propose a new learning-based feedback architecture for efficient encoding of partial CSI feedback of interleaved non-overlapped antenna subarrays by exploiting CSI temporal correlation. For ease of encoding, we further design an IFFT approach to decouple partial CSI of antenna subarrays and to preserve partial CSI sparsity. Our results show superior performance in indoor/outdoor scenarios by the proposed model for CSI recovery at significantly reduced computation power and storage needs.
Memory-Guided Semantic Learning Network for Temporal Sentence Grounding

Temporal sentence grounding (TSG) is crucial and fundamental for video understanding. Although the existing methods train well-designed deep networks with a large amount of data, we find that they can easily forget the rarely appeared cases in the training stage due to the off-balance data distribution, which influences the model generalization and leads to undesirable performance. To tackle this issue, we propose a memory-augmented network, called Memory-Guided Semantic Learning Network (MGSL-Net), that learns and memorizes the rarely appeared content in TSG tasks. Specifically, MGSL-Net consists of three main parts: a cross-modal inter-action module, a memory augmentation module, and a heterogeneous attention module. We first align the given video-query pair by a cross-modal graph convolutional network, and then utilize a memory module to record the cross-modal shared semantic features in the domain-specific persistent memory. During training, the memory slots are dynamically associated with both common and rare cases, alleviating the forgetting issue. In testing, the rare cases can thus be enhanced by retrieving the stored memories, resulting in better generalization. At last, the heterogeneous attention module is utilized to integrate the enhanced multi-modal features in both video and query domains. Experimental results on three benchmarks show the superiority of our method on both effectiveness and efficiency, which substantially improves the accuracy not only on the entire dataset but also on rare cases.
Hybrid intelligence for dynamic job-shop scheduling with deep reinforcement learning and attention mechanism

The dynamic job-shop scheduling problem (DJSP) is a class of scheduling tasks that specifically consider the inherent uncertainties such as changing order requirements and possible machine breakdown in realistic smart manufacturing settings. Since traditional methods cannot dynamically generate effective scheduling strategies in face of the disturbance of environments, we formulate the DJSP as a Markov decision process (MDP) to be tackled by reinforcement learning (RL). For this purpose, we propose a flexible hybrid framework that takes disjunctive graphs as states and a set of general dispatching rules as the action space with minimum prior domain knowledge. The attention mechanism is used as the graph representation learning (GRL) module for the feature extraction of states, and the double dueling deep Q-network with prioritized replay and noisy networks (D3QPN) is employed to map each state to the most appropriate dispatching rule. Furthermore, we present Gymjsp, a public benchmark based on the well-known OR-Library, to provide a standardized off-the-shelf facility for RL and DJSP research communities. Comprehensive experiments on various DJSP instances confirm that our proposed framework is superior to baseline algorithms with smaller makespan across all instances and provide empirical justification for the validity of the various components in the hybrid framework.
Review of Reinforcement Learning Papers #13

I present 4 publications from my research area: reinforcement learning. Let’s discuss it!. Ye, W., Liu, S., Kurutach, T., Abbeel, P., & Gao, Y. (2021). Mastering Atari Games with Limited Data. arXiv preprint arXiv:2111.00210. EfficientZero is the name given by the authors to their new reinforcement learning algorithm. What...
Finding General Equilibria in Many-Agent Economic Simulations Using Deep Reinforcement Learning

Real economies can be seen as a sequential imperfect-information game with many heterogeneous, interacting strategic agents of various agent types, such as consumers, firms, and governments. Dynamic general equilibrium models are common economic tools to model the economic activity, interactions, and outcomes in such systems. However, existing analytical and computational methods struggle to find explicit equilibria when all agents are strategic and interact, while joint learning is unstable and challenging. Amongst others, a key reason is that the actions of one economic agent may change the reward function of another agent, e.g., a consumer's expendable income changes when firms change prices or governments change taxes. We show that multi-agent deep reinforcement learning (RL) can discover stable solutions that are epsilon-Nash equilibria for a meta-game over agent types, in economic simulations with many agents, through the use of structured learning curricula and efficient GPU-only simulation and training. Conceptually, our approach is more flexible and does not need unrealistic assumptions, e.g., market clearing, that are commonly used for analytical tractability. Our GPU implementation enables training and analyzing economies with a large number of agents within reasonable time frames, e.g., training completes within a day. We demonstrate our approach in real-business-cycle models, a representative family of DGE models, with 100 worker-consumers, 10 firms, and a government who taxes and redistributes. We validate the learned meta-game epsilon-Nash equilibria through approximate best-response analyses, show that RL policies align with economic intuitions, and that our approach is constructive, e.g., by explicitly learning a spectrum of meta-game epsilon-Nash equilibria in open RBC models.
Solving Dynamic Graph Problems with Multi-Attention Deep Reinforcement Learning

Graph problems such as traveling salesman problem, or finding minimal Steiner trees are widely studied and used in data engineering and computer science. Typically, in real-world applications, the features of the graph tend to change over time, thus, finding a solution to the problem becomes challenging. The dynamic version of many graph problems are the key for a plethora of real-world problems in transportation, telecommunication, and social networks. In recent years, using deep learning techniques to find heuristic solutions for NP-hard graph combinatorial problems has gained much interest as these learned heuristics can find near-optimal solutions efficiently. However, most of the existing methods for learning heuristics focus on static graph problems. The dynamic nature makes NP-hard graph problems much more challenging to learn, and the existing methods fail to find reasonable solutions.
Handling Trust in A Cloud Based Multi Agent System

Cloud computing is an opened and distributed network that guarantees access to a large amount of data and IT infrastructure at several levels (software, hardware...). With the increase demand, handling clients' needs is getting increasingly challenging. Responding to all requesting clients could lead to security breaches, and since it is the provider's responsibility to secure not only the offered cloud services but also the data, it is important to ensure clients reliability. Although filtering clients in the cloud is not so common, it is required to assure cloud safety.
Performance Analysis of Event-Triggered Consensus Control for Multi-agent Systems under Cyber-Physical Attacks

This work presents a rigorous analysis of the adverse effects of cyber-physical attacks on the performance of multi-agent consensus with event-triggered control protocols. It is shown how a strategic malicious attack on sensors and actuators can deceive the triggering condition of both state-based event-triggered mechanism and combinational state-based event-triggered mechanism, which are commonplace and widely used in the literature. More precisely, it is first shown that a deception attack in the case of combinational state-based event-triggered mechanism can result in a non-triggering misbehavior, in the sense that the compromised agent does not trigger any event and consequently results in partial feedback disconnectivity by preventing information from reaching the local neighbors of the compromised agent. This indicates that the combinational state-based event-triggered mechanism can be leveraged by the attacker to harm the network connectivity by rendering the recent data unavailable to agents. It is then shown that the deception attack in the case of state-based event-triggered mechanism can result in a continuous-triggering misbehavior in the sense that the event-triggered mechanism continuously generates triggering events resulting in undesirable phenomenon of Zeno behavior. Finally, numerical simulations are presented to illustrate the theoretical findings.
Visual Attention Prediction Improves Performance of Autonomous Drone Racing Agents

Humans race drones faster than neural networks trained for end-to-end autonomous flight. This may be related to the ability of human pilots to select task-relevant visual information effectively. This work investigates whether neural networks capable of imitating human eye gaze behavior and attention can improve neural network performance for the challenging task of vision-based autonomous drone racing. We hypothesize that gaze-based attention prediction can be an efficient mechanism for visual information selection and decision making in a simulator-based drone racing task. We test this hypothesis using eye gaze and flight trajectory data from 18 human drone pilots to train a visual attention prediction model. We then use this visual attention prediction model to train an end-to-end controller for vision-based autonomous drone racing using imitation learning. We compare the drone racing performance of the attention-prediction controller to those using raw image inputs and image-based abstractions (i.e., feature tracks). Our results show that attention-prediction based controllers outperform the baselines and are able to complete a challenging race track consistently with up to 88% success rate. Furthermore, visual attention-prediction and feature-track based models showed better generalization performance than image-based models when evaluated on hold-out reference trajectories. Our results demonstrate that human visual attention prediction improves the performance of autonomous vision-based drone racing agents and provides an essential step towards vision-based, fast, and agile autonomous flight that eventually can reach and even exceed human performances.
Pavlovian Signalling with General Value Functions in Agent-Agent Temporal Decision Making

Andrew Butcher, Michael Bradley Johanson, Elnaz Davoodi, Dylan J. A. Brenneis, Leslie Acker, Adam S. R. Parker, Adam White, Joseph Modayil, Patrick M. Pilarski. In this paper, we contribute a multi-faceted study into Pavlovian signalling -- a process by which learned, temporally extended predictions made by one agent inform decision-making by another agent. Signalling is intimately connected to time and timing. In service of generating and receiving signals, humans and other animals are known to represent time, determine time since past events, predict the time until a future stimulus, and both recognize and generate patterns that unfold in time. We investigate how different temporal processes impact coordination and signalling between learning agents by introducing a partially observable decision-making domain we call the Frost Hollow. In this domain, a prediction learning agent and a reinforcement learning agent are coupled into a two-part decision-making system that works to acquire sparse reward while avoiding time-conditional hazards. We evaluate two domain variations: machine agents interacting in a seven-state linear walk, and human-machine interaction in a virtual-reality environment. Our results showcase the speed of learning for Pavlovian signalling, the impact that different temporal representations do (and do not) have on agent-agent coordination, and how temporal aliasing impacts agent-agent and human-agent interactions differently. As a main contribution, we establish Pavlovian signalling as a natural bridge between fixed signalling paradigms and fully adaptive communication learning between two agents. We further show how to computationally build this adaptive signalling process out of a fixed signalling process, characterized by fast continual prediction learning and minimal constraints on the nature of the agent receiving signals. Our results therefore suggest an actionable, constructivist path towards communication learning between reinforcement learning agents.
Data-Driven Modeling and Prediction of Non-Linearizable Dynamics via Spectral Submanifolds

We develop a methodology to construct low-dimensional predictive models from data sets representing essentially nonlinear (or non-linearizable) dynamical systems with a hyperbolic linear part that are subject to external forcing with finitely many frequencies. Our data-driven, sparse, nonlinear models are obtained as extended normal forms of the reduced dynamics on low-dimensional, attracting spectral submanifolds (SSMs) of the dynamical system. We illustrate the power of data-driven SSM reduction on high-dimensional numerical data sets and experimental measurements involving beam oscillations, vortex shedding and sloshing in a water tank. We find that SSM reduction trained on unforced data also predicts nonlinear response accurately under additional external forcing.
Non-Markovian anti-parity-time symmetric systems: theory and experiment

Open systems with anti parity-time (anti $\mathcal{PT}$-) or $\mathcal{PT}$ symmetry exhibit a rich phenomenology absent in their Hermitian counterparts. To date all model systems and their diverse realizations across classical and quantum platforms have been local in time, i.e. Markovian. Here we propose a non-Markovian system with anti-$\mathcal{PT}$-symmetry where a single time-delay encodes the memory, and experimentally demonstrate its consequences with two time-delay coupled semiconductor lasers. A transcendental characteristic equation with infinitely many eigenvalue pairs sets our model apart. We show that a sequence of amplifying-to-decaying dominant mode transitions is induced by the time delay in our minimal model. The signatures of these transitions quantitatively match results obtained from four, coupled, nonlinear rate equations for laser dynamics, and are experimentally observed as constant-width sideband oscillations in the laser intensity profiles. Our work introduces a new paradigm of non-Hermitian systems with memory, paves the way for their realization in classical systems, and may apply to time-delayed feedback-control for quantum systems.
