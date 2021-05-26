newsbreak-logo
Cancel
CreatorsPublishersAdvertisers
View more in
Education

The intuition behind Reinforcement Learning

By Editors' Picks
towardsdatascience.com
 3 days ago

Cover picture for the articleA short and simplistic introduction to Reinforcement Learning. At first glance, Reinforcement Learning may appear incredibly overwhelming. State, Action, Environment, Rewards, value function, Q and plethora of other such terms and definitions maybe quite a burden for the beginner; what if I told you that we have been implementing this concept, quite literally , all our lives? Yes. You read that right! Almost everyday we employ RL techniques in our day-to-day lives. Through this article I hope to show you just how intuitive Reinforcement Learning really is and how commonplace it is our lives. Perhaps this approach might help demystify this concept a bit. I hope to focus on the intuition rather than implementation so I tried to use Mathematical formulas as sparingly as possible.

towardsdatascience.com
IN THIS ARTICLE
#Intuition#Mathematics#Big Amount#Deep Learning#Algorithms#Implementation#Robotics#Rl#Markov State#Google#Supervised Learning#Techniques#Mathematical Formulas#Approach#Mathematical Equations#Observations#Environment#Terminology#Central Motivation#Conclusions
YOU MAY ALSO LIKE
News Break
Education
Related
Pythontowardsdatascience.com

Maths behind Supervised Learning for Dummies: The theory in plain words (Part I).

Machine learning underlies the coolest technologies of today. This is how Ethem Alpaydin starts the Preface of his book, “Introduction to Machine Learning”, which I had to read some years ago when I started with the Data Science world. Today, the book is still considered as a bible in the academic. The difference with many of the books you can find in Amazon is that it does not speak about technology, it only shows the algorithmic part of the Machine Learning. This is the point. Most people who work today with machine learning, they just use technology but have no idea about the mathematical basis of the algorithms that this technology encapsulates. It is very simple to call the ‘LDA’ algorithm in python on a pandas dataframe; but is even easier to execute a linear regression with caret on a R dataframe. This is fine if you need to build a model with some accuracy; but if you want to really explote your resources, you really need to completely understand the algorithms and maths behind the library or package you employ.
Computersarxiv.org

Adaptive ABAC Policy Learning: A Reinforcement Learning Approach

With rapid advances in computing systems, there is an increasing demand for more effective and efficient access control (AC) approaches. Recently, Attribute Based Access Control (ABAC) approaches have been shown to be promising in fulfilling the AC needs of such emerging complex computing environments. An ABAC model grants access to a requester based on attributes of entities in a system and an authorization policy; however, its generality and flexibility come with a higher cost. Further, increasing complexities of organizational systems and the need for federated accesses to their resources make the task of AC enforcement and management much more challenging. In this paper, we propose an adaptive ABAC policy learning approach to automate the authorization management task. We model ABAC policy learning as a reinforcement learning problem. In particular, we propose a contextual bandit system, in which an authorization engine adapts an ABAC model through a feedback control loop; it relies on interacting with users/administrators of the system to receive their feedback that assists the model in making authorization decisions. We propose four methods for initializing the learning model and a planning approach based on attribute value hierarchy to accelerate the learning process. We focus on developing an adaptive ABAC policy learning model for a home IoT environment as a running example. We evaluate our proposed approach over real and synthetic data. We consider both complete and sparse datasets in our evaluations. Our experimental results show that the proposed approach achieves performance that is comparable to ones based on supervised learning in many scenarios and even outperforms them in several situations.
Coding & Programmingarxiv.org

Generic Itemset Mining Based on Reinforcement Learning

One of the biggest problems in itemset mining is the requirement of developing a data structure or algorithm, every time a user wants to extract a different type of itemsets. To overcome this, we propose a method, called Generic Itemset Mining based on Reinforcement Learning (GIM-RL), that offers a unified framework to train an agent for extracting any type of itemsets. In GIM-RL, the environment formulates iterative steps of extracting a target type of itemsets from a dataset. At each step, an agent performs an action to add or remove an item to or from the current itemset, and then obtains from the environment a reward that represents how relevant the itemset resulting from the action is to the target type. Through numerous trial-and-error steps where various rewards are obtained by diverse actions, the agent is trained to maximise cumulative rewards so that it acquires the optimal action policy for forming as many itemsets of the target type as possible. In this framework, an agent for extracting any type of itemsets can be trained as long as a reward suitable for the type can be defined. The extensive experiments on mining high utility itemsets, frequent itemsets and association rules show the general effectiveness and one remarkable potential (agent transfer) of GIM-RL. We hope that GIM-RL opens a new research direction towards learning-based itemset mining.
Coding & Programmingarxiv.org

Enriching Query Semantics for Code Search with Reinforcement Learning

Code search is a common practice for developers during software implementation. The challenges of accurate code search mainly lie in the knowledge gap between source code and natural language (i.e., queries). Due to the limited code-query pairs and large code-description pairs available, the prior studies based on deep learning techniques focus on learning the semantic matching relation between source code and corresponding description texts for the task, and hypothesize that the semantic gap between descriptions and user queries is marginal. In this work, we found that the code search models trained on code-description pairs may not perform well on user queries, which indicates the semantic distance between queries and code descriptions. To mitigate the semantic distance for more effective code search, we propose QueCos, a Query-enriched Code search model. QueCos learns to generate semantic enriched queries to capture the key semantics of given queries with reinforcement learning (RL). With RL, the code search performance is considered as a reward for producing accurate semantic enriched queries. The enriched queries are finally employed for code search. Experiments on the benchmark datasets show that QueCos can significantly outperform the state-of-the-art code search models.
Coding & Programmingarxiv.org

Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning

Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration. However, existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states. We hypothesize that a key missing ingredient from the existing methods is a proper treatment of uncertainty in the offline setting. We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly. Implementation-wise, we adopt a practical and effective dropout-based uncertainty estimation method that introduces very little overhead over existing RL algorithms. Empirically, we observe that UWAC substantially improves model stability during training. In addition, UWAC out-performs existing offline RL methods on a variety of competitive tasks, and achieves significant performance gains over the state-of-the-art baseline on datasets with sparse demonstrations collected from human experts.
Computersarxiv.org

Ensemble Quantile Networks: Uncertainty-Aware Reinforcement Learning with Applications in Autonomous Driving

Reinforcement learning (RL) can be used to create a decision-making agent for autonomous driving. However, previous approaches provide only black-box solutions, which do not offer information on how confident the agent is about its decisions. An estimate of both the aleatoric and epistemic uncertainty of the agent's decisions is fundamental for real-world applications of autonomous driving. Therefore, this paper introduces the Ensemble Quantile Networks (EQN) method, which combines distributional RL with an ensemble approach, to obtain a complete uncertainty estimate. The distribution over returns is estimated by learning its quantile function implicitly, which gives the aleatoric uncertainty, whereas an ensemble of agents is trained on bootstrapped data to provide a Bayesian estimation of the epistemic uncertainty. A criterion for classifying which decisions that have an unacceptable uncertainty is also introduced. The results show that the EQN method can balance risk and time efficiency in different occluded intersection scenarios, by considering the estimated aleatoric uncertainty. Furthermore, it is shown that the trained agent can use the epistemic uncertainty information to identify situations that the agent has not been trained for and thereby avoid making unfounded, potentially dangerous, decisions outside of the training distribution.
Computersarxiv.org

Room Clearance with Feudal Hierarchical Reinforcement Learning

Reinforcement learning (RL) is a general framework that allows systems to learn autonomously through trial-and-error interaction with their environment. In recent years combining RL with expressive, high-capacity neural network models has led to impressive performance in a diverse range of domains. However, dealing with the large state and action spaces often required for problems in the real world still remains a significant challenge. In this paper we introduce a new simulation environment, "Gambit", designed as a tool to build scenarios that can drive RL research in a direction useful for military analysis. Using this environment we focus on an abstracted and simplified room clearance scenario, where a team of blue agents have to make their way through a building and ensure that all rooms are cleared of (and remain clear) of enemy red agents. We implement a multi-agent version of feudal hierarchical RL that introduces a command hierarchy where a commander at the higher level sends orders to multiple agents at the lower level who simply have to learn to follow these orders. We find that breaking the task down in this way allows us to solve a number of non-trivial floorplans that require the coordination of multiple agents much more efficiently than the standard baseline RL algorithms we compare with. We then go on to explore how qualitatively different behaviour can emerge depending on what we prioritise in the agent's reward function (e.g. clearing the building quickly vs. prioritising rescuing civilians).
Video GamesThe Next Web

Understanding Transformers, the machine learning model behind GPT-3

You know that expression When you have a hammer, everything looks like a nail? Well, in machine learning, it seems like we really have discovered a magical hammer for which everything is, in fact, a nail, and they’re called Transformers. Transformers are models that can be designed to translate text, write poems and op eds, and even generate computer code. In fact, lots of the amazing research I write about on daleonai.com is built on Transformers, like AlphaFold 2, the model that predicts the structures of proteins from their genetic sequences, as well as powerful natural language processing (NLP) models like GPT-3, BERT, T5, Switch, Meena, and others. You might say they’re more than meets the… ugh, forget it.
Coding & Programmingarxiv.org

A Stochastic Composite Augmented Lagrangian Method For Reinforcement Learning

In this paper, we consider the linear programming (LP) formulation for deep reinforcement learning. The number of the constraints depends on the size of state and action spaces, which makes the problem intractable in large or continuous environments. The general augmented Lagrangian method suffers the double-sampling obstacle in solving the LP. Namely, the conditional expectations originated from the constraint functions and the quadratic penalties in the augmented Lagrangian function impose difficulties in sampling and evaluation. Motivated from the updates of the multipliers, we overcome the obstacles in minimizing the augmented Lagrangian function by replacing the intractable conditional expectations with the multipliers. Therefore, a deep parameterized augment Lagrangian method is proposed. Furthermore, the replacement provides a promising breakthrough to integrate the two steps in the augmented Lagrangian method into a single constrained problem. A general theoretical analysis shows that the solutions generated from a sequence of the constrained optimizations converge to the optimal solution of the LP if the error is controlled properly. A theoretical analysis on the quadratic penalty algorithm under neural tangent kernel setting shows the residual can be arbitrarily small if the parameter in network and optimization algorithm is chosen suitably. Preliminary experiments illustrate that our method is competitive to other state-of-the-art algorithms.
Engineeringarxiv.org

Towards a Sample Efficient Reinforcement Learning Pipeline for Vision Based Robotics

Deep Reinforcement learning holds the guarantee of empowering self-ruling robots to master enormous collections of conduct abilities with negligible human mediation. The improvements brought by this technique enables robots to perform difficult tasks such as grabbing or reaching targets. Nevertheless, the training process is still time consuming and tedious especially when learning policies only with RGB camera information. This way of learning is capital to transfer the task from simulation to the real world since the only external source of information for the robot in real life is video. In this paper, we study how to limit the time taken for training a robotic arm with 6 Degrees Of Freedom (DOF) to reach a ball from scratch by assembling a pipeline as efficient as possible. The pipeline is divided into two parts: the first one is to capture the relevant information from the RGB video with a Computer Vision algorithm. The second one studies how to train faster a Deep Reinforcement Learning algorithm in order to make the robotic arm reach the target in front of him. Follow this link to find videos and plots in higher resolution: \url{this https URL}
Public HealthPhys.org

A machine learning model behind COVID-19 vaccine development

When starting a vaccine program, scientists generally have anecdotal understanding of the disease they're aiming to target. When COVID-19 surfaced over a year ago, there were so many unknowns about the fast-moving virus that scientists had to act quickly and rely on new methods and techniques just to even begin understanding the basics of the disease.
Coding & Programmingarxiv.org

OpenGraphGym-MG: Using Reinforcement Learning to Solve Large Graph Optimization Problems on MultiGPU Systems

Large scale graph optimization problems arise in many fields. This paper presents an extensible, high performance framework (namedOpenGraphGym-MG) that uses deep reinforcement learning and graph embedding to solve large graph optimization problems with multiple GPUs. The paper uses a common RL algorithm (deep Q-learning) and a representative graph embedding (structure2vec) to demonstrate the extensibility of the framework and, most importantly, to illustrate the novel optimization techniques, such as spatial parallelism, graph-level and node-level batched processing, distributed sparse graph storage, efficient parallel RL training and inference algorithms, repeated gradient descent iterations, and adaptive multiple-node selections. This study performs a comprehensive performance analysis on parallel efficiency and memory cost that proves the parallel RL training and inference algorithms are efficient and highly scalable on a number of GPUs. This study also conducts a range of large graph experiments, with both generated graphs (over 30 million edges) and real-world graphs, using a single compute node (with six GPUs) of the Summit supercomputer. Good scalability in both RL training and inference is achieved: as the number of GPUs increases from one to six, the time of a single step of RL training and a single step of RL inference on large graphs with more than 30 million edges, is reduced from 316.4s to 54.5s, and 23.8s to 3.4s, respectively. The research results on a single node lay out a solid foundation for the future work to address graph optimization problems with a large number of GPUs across multiple nodes in the Summit.
Sciencearxiv.org

Trajectory Modeling via Random Utility Inverse Reinforcement Learning

We consider the problem of modeling trajectories of drivers in a road network from the perspective of inverse reinforcement learning. As rational agents, drivers are trying to maximize some reward function unknown to an external observer as they make up their trajectories. We apply the concept of random utility from microeconomic theory to model the unknown reward function as a function of observable features plus an error term which represents features known only to the driver. We develop a parameterized generative model for the trajectories based on a random utility Markov decision process formulation of drivers decisions. We show that maximum entropy inverse reinforcement learning is a particular case of our proposed formulation when we assume a Gumbel density function for the unobserved reward error terms. We illustrate Bayesian inference on model parameters through a case study with real trajectory data from a large city obtained from sensors placed on sparsely distributed points on the street network.
Carsarxiv.org

Objective-aware Traffic Simulation via Inverse Reinforcement Learning

Traffic simulators act as an essential component in the operating and planning of transportation systems. Conventional traffic simulators usually employ a calibrated physical car-following model to describe vehicles' behaviors and their interactions with traffic environment. However, there is no universal physical model that can accurately predict the pattern of vehicle's behaviors in different situations. A fixed physical model tends to be less effective in a complicated environment given the non-stationary nature of traffic dynamics. In this paper, we formulate traffic simulation as an inverse reinforcement learning problem, and propose a parameter sharing adversarial inverse reinforcement learning model for dynamics-robust simulation learning. Our proposed model is able to imitate a vehicle's trajectories in the real world while simultaneously recovering the reward function that reveals the vehicle's true objective which is invariant to different dynamics. Extensive experiments on synthetic and real-world datasets show the superior performance of our approach compared to state-of-the-art methods and its robustness to variant dynamics of traffic.
Thrive Global

Using Intuition and Facts For Better Decisions

When organizations are making decisions, they are highly likely to rely on prior data that informs them of what has been happening in the past. From this, they are able to predict the future and therefore make wise decisions that are based upon facts. Today, the advent of data science...
Computersarxiv.org

Reinforcement Learning for Adaptive Video Compressive Sensing

We apply reinforcement learning to video compressive sensing to adapt the compression ratio. Specifically, video snapshot compressive imaging (SCI), which captures high-speed video using a low-speed camera is considered in this work, in which multiple (B) video frames can be reconstructed from a snapshot measurement. One research gap in previous studies is how to adapt B in the video SCI system for different scenes. In this paper, we fill this gap utilizing reinforcement learning (RL). An RL model, as well as various convolutional neural networks for reconstruction, are learned to achieve adaptive sensing of video SCI systems. Furthermore, the performance of an object detection network using directly the video SCI measurements without reconstruction is also used to perform RL-based adaptive video compressive sensing. Our proposed adaptive SCI method can thus be implemented in low cost and real time. Our work takes the technology one step further towards real applications of video SCI.
Marketsarxiv.org

Deep Reinforcement Learning for Optimal Stopping with Application in Financial Engineering

Optimal stopping is the problem of deciding the right time at which to take a particular action in a stochastic system, in order to maximize an expected reward. It has many applications in areas such as finance, healthcare, and statistics. In this paper, we employ deep Reinforcement Learning (RL) to learn optimal stopping policies in two financial engineering applications: namely option pricing, and optimal option exercise. We present for the first time a comprehensive empirical evaluation of the quality of optimal stopping policies identified by three state of the art deep RL algorithms: double deep Q-learning (DDQN), categorical distributional RL (C51), and Implicit Quantile Networks (IQN). In the case of option pricing, our findings indicate that in a theoretical Black-Schole environment, IQN successfully identifies nearly optimal prices. On the other hand, it is slightly outperformed by C51 when confronted to real stock data movements in a put option exercise problem that involves assets from the S&P500 index. More importantly, the C51 algorithm is able to identify an optimal stopping policy that achieves 8% more out-of-sample returns than the best of four natural benchmark policies. We conclude with a discussion of our findings which should pave the way for relevant future research.
Coding & Programmingarxiv.org

Incentivized Bandit Learning with Self-Reinforcing User Preferences

In this paper, we investigate a new multi-armed bandit (MAB) online learning model that considers real-world phenomena in many recommender systems: (i) the learning agent cannot pull the arms by itself and thus has to offer rewards to users to incentivize arm-pulling indirectly; and (ii) if users with specific arm preferences are well rewarded, they induce a "self-reinforcing" effect in the sense that they will attract more users of similar arm preferences. Besides addressing the tradeoff of exploration and exploitation, another key feature of this new MAB model is to balance reward and incentivizing payment. The goal of the agent is to maximize the total reward over a fixed time horizon $T$ with a low total payment. Our contributions in this paper are two-fold: (i) We propose a new MAB model with random arm selection that considers the relationship of users' self-reinforcing preferences and incentives; and (ii) We leverage the properties of a multi-color Polya urn with nonlinear feedback model to propose two MAB policies termed "At-Least-$n$ Explore-Then-Commit" and "UCB-List". We prove that both policies achieve $O(log T)$ expected regret with $O(log T)$ expected payment over a time horizon $T$. We conduct numerical simulations to demonstrate and verify the performances of these two policies and study their robustness under various settings.
Coding & Programmingarxiv.org

Successive Convex Approximation Based Off-Policy Optimization for Constrained Reinforcement Learning

We propose a successive convex approximation based off-policy optimization (SCAOPO) algorithm to solve the general constrained reinforcement learning problem, which is formulated as a constrained Markov decision process (CMDP) in the context of average cost. The SCAOPO is based on solving a sequence of convex objective/feasibility optimization problems obtained by replacing the objective and constraint functions in the original problems with convex surrogate functions. At each iteration, the convex surrogate problem can be efficiently solved by Lagrange dual method even the policy is parameterized by a high-dimensional function. Moreover, the SCAOPO enables to reuse old experiences from previous updates, thereby significantly reducing the implementation cost when deployed in the real-world engineering systems that need to online learn the environment. In spite of the time-varying state distribution and the stochastic bias incurred by the off-policy learning, the SCAOPO with a feasible initial point can still provably converge to a Karush-Kuhn-Tucker (KKT) point of the original problem almost surely.
Internetarxiv.org

De-Biased Modelling of Search Click Behavior with Reinforcement Learning

Users' clicks on Web search results are one of the key signals for evaluating and improving web search quality and have been widely used as part of current state-of-the-art Learning-To-Rank(LTR) models. With a large volume of search logs available for major search engines, effective models of searcher click behavior have emerged to evaluate and train LTR models. However, when modeling the users' click behavior, considering the bias of the behavior is imperative. In particular, when a search result is not clicked, it is not necessarily chosen as not relevant by the user, but instead could have been simply missed, especially for lower-ranked results. These kinds of biases in the click log data can be incorporated into the click models, propagating the errors to the resulting LTR ranking models or evaluation metrics. In this paper, we propose the De-biased Reinforcement Learning Click model (DRLC). The DRLC model relaxes previously made assumptions about the users' examination behavior and resulting latent states. To implement the DRLC model, convolutional neural networks are used as the value networks for reinforcement learning, trained to learn a policy to reduce bias in the click logs. To demonstrate the effectiveness of the DRLC model, we first compare performance with the previous state-of-art approaches using established click prediction metrics, including log-likelihood and perplexity. We further show that DRLC also leads to improvements in ranking performance. Our experiments demonstrate the effectiveness of the DRLC model in learning to reduce bias in click logs, leading to improved modeling performance and showing the potential for using DRLC for improving Web search quality.