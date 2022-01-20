ContributorsPublishersAdvertisers
Sim-to-Lab-to-Real: Safe Reinforcement Learning with Shielding and Generalization Guarantees

By Kai-Chieh Hsu, Allen Z. Ren, Duy Phuong Nguyen, Anirudha Majumdar, Jaime F. Fisac
 4 days ago

Safety is a critical component of autonomous systems and remains a challenge for learning-based policies to be utilized in the real world. In particular, policies learned using reinforcement learning often fail to generalize to novel environments due to unsafe behavior. In this...

Related
Solving Dynamic Graph Problems with Multi-Attention Deep Reinforcement Learning

Graph problems such as traveling salesman problem, or finding minimal Steiner trees are widely studied and used in data engineering and computer science. Typically, in real-world applications, the features of the graph tend to change over time, thus, finding a solution to the problem becomes challenging. The dynamic version of many graph problems are the key for a plethora of real-world problems in transportation, telecommunication, and social networks. In recent years, using deep learning techniques to find heuristic solutions for NP-hard graph combinatorial problems has gained much interest as these learned heuristics can find near-optimal solutions efficiently. However, most of the existing methods for learning heuristics focus on static graph problems. The dynamic nature makes NP-hard graph problems much more challenging to learn, and the existing methods fail to find reasonable solutions.
Opportunities of Hybrid Model-based Reinforcement Learning for Cell Therapy Manufacturing Process Development and Control

Driven by the key challenges of cell therapy manufacturing, including high complexity, high uncertainty, and very limited process data, we propose a stochastic optimization framework named "hybrid-RL" to efficiently guide process development and control. We first create the bioprocess probabilistic knowledge graph that is a hybrid model characterizing the understanding of biomanufacturing process mechanisms and quantifying inherent stochasticity, such as batch-to-batch variation and bioprocess noise. It can capture the key features, including nonlinear reactions, time-varying kinetics, and partially observed bioprocess state. This hybrid model can leverage on existing mechanistic models and facilitate the learning from process data. Given limited process data, a computational sampling approach is used to generate posterior samples quantifying the model estimation uncertainty. Then, we introduce hybrid model-based Bayesian reinforcement learning (RL), accounting for both inherent stochasticity and model uncertainty, to guide optimal, robust, and interpretable decision making, which can overcome the key challenges of cell therapy manufacturing. In the empirical study, cell therapy manufacturing examples are used to demonstrate that the proposed hybrid-RL framework can outperform the classical deterministic mechanistic model assisted process optimization.
Learning Robust Policies for Generalized Debris Capture with an Automated Tether-Net System

Tether-net launched from a chaser spacecraft provides a promising method to capture and dispose of large space debris in orbit. This tether-net system is subject to several sources of uncertainty in sensing and actuation that affect the performance of its net launch and closing control. Earlier reliability-based optimization approaches to design control actions however remain challenging and computationally prohibitive to generalize over varying launch scenarios and target (debris) state relative to the chaser. To search for a general and reliable control policy, this paper presents a reinforcement learning framework that integrates a proximal policy optimization (PPO2) approach with net dynamics simulations. The latter allows evaluating the episodes of net-based target capture, and estimate the capture quality index that serves as the reward feedback to PPO2. Here, the learned policy is designed to model the timing of the net closing action based on the state of the moving net and the target, under any given launch scenario. A stochastic state transition model is considered in order to incorporate synthetic uncertainties in state estimation and launch actuation. Along with notable reward improvement during training, the trained policy demonstrates capture performance (over a wide range of launch/target scenarios) that is close to that obtained with reliability-based optimization run over an individual scenario.
Multi-echelon Supply Chains with Uncertain Seasonal Demands and Lead Times Using Deep Reinforcement Learning

We address the problem of production planning and distribution in multi-echelon supply chains. We consider uncertain demands and lead times which makes the problem stochastic and non-linear. A Markov Decision Process formulation and a Non-linear Programming model are presented. As a sequential decision-making problem, Deep Reinforcement Learning (RL) is a possible solution approach. This type of technique has gained a lot of attention from Artificial Intelligence and Optimization communities in recent years. Considering the good results obtained with Deep RL approaches in different areas there is a growing interest in applying them in problems from the Operations Research field. We have used a Deep RL technique, namely Proximal Policy Optimization (PPO2), to solve the problem considering uncertain, regular and seasonal demands and constant or stochastic lead times. Experiments are carried out in different scenarios to better assess the suitability of the algorithm. An agent based on a linearized model is used as a baseline. Experimental results indicate that PPO2 is a competitive and adequate tool for this type of problem. PPO2 agent is better than baseline in all scenarios with stochastic lead times (7.3-11.2%), regardless of whether demands are seasonal or not. In scenarios with constant lead times, the PPO2 agent is better when uncertain demands are non-seasonal (2.2-4.7%). The results show that the greater the uncertainty of the scenario, the greater the viability of this type of approach.
Automated Reinforcement Learning (AutoRL): A Survey and Open Problems

Jack Parker-Holder, Raghu Rajan, Xingyou Song, André Biedenkapp, Yingjie Miao, Theresa Eimer, Baohe Zhang, Vu Nguyen, Roberto Calandra, Aleksandra Faust, Frank Hutter, Marius Lindauer. The combination of Reinforcement Learning (RL) with deep learning has led to a series of impressive feats, with many believing (deep) RL provides a path...
Non-Asymptotic Guarantees for Robust Statistical Learning under $(1+\varepsilon)$-th Moment Assumption

There has been a surge of interest in developing robust estimators for models with heavy-tailed data in statistics and machine learning. This paper proposes a log-truncated M-estimator for a large family of statistical regressions and establishes its excess risk bound under the condition that the data have $(1+\varepsilon)$-th moment with $\varepsilon \in (0,1]$. With an additional assumption on the associated risk function, we obtain an $\ell_2$-error bound for the estimation. Our theorems are applied to establish robust M-estimators for concrete regressions. Besides convex regressions such as quantile regression and generalized linear models, many non-convex regressions can also be fit into our theorems, we focus on robust deep neural network regressions, which can be solved by the stochastic gradient descent algorithms. Simulations and real data analysis demonstrate the superiority of log-truncated estimations over standard estimations.
Hybrid Reinforcement Learning-Based Eco-Driving Strategy for Connected and Automated Vehicles at Signalized Intersections

Taking advantage of both vehicle-to-everything (V2X) communication and automated driving technology, connected and automated vehicles are quickly becoming one of the transformative solutions to many transportation problems. However, in a mixed traffic environment at signalized intersections, it is still a challenging task to improve overall throughput and energy efficiency considering the complexity and uncertainty in the traffic system. In this study, we proposed a hybrid reinforcement learning (HRL) framework which combines the rule-based strategy and the deep reinforcement learning (deep RL) to support connected eco-driving at signalized intersections in mixed traffic. Vision-perceptive methods are integrated with vehicle-to-infrastructure (V2I) communications to achieve higher mobility and energy efficiency in mixed connected traffic. The HRL framework has three components: a rule-based driving manager that operates the collaboration between the rule-based policies and the RL policy; a multi-stream neural network that extracts the hidden features of vision and V2I information; and a deep RL-based policy network that generate both longitudinal and lateral eco-driving actions. In order to evaluate our approach, we developed a Unity-based simulator and designed a mixed-traffic intersection scenario. Moreover, several baselines were implemented to compare with our new design, and numerical experiments were conducted to test the performance of the HRL model. The experiments show that our HRL method can reduce energy consumption by 12.70% and save 11.75% travel time when compared with a state-of-the-art model-based Eco-Driving approach.
Benchmarking Deep Reinforcement Learning Algorithms for Vision-based Robotics

This paper presents a benchmarking study of some of the state-of-the-art reinforcement learning algorithms used for solving two simulated vision-based robotics problems. The algorithms considered in this study include soft actor-critic (SAC), proximal policy optimization (PPO), interpolated policy gradients (IPG), and their variants with Hindsight Experience replay (HER). The performances of these algorithms are compared against PyBullet's two simulation environments known as KukaDiverseObjectEnv and RacecarZEDGymEnv respectively. The state observations in these environments are available in the form of RGB images and the action space is continuous, making them difficult to solve. A number of strategies are suggested to provide intermediate hindsight goals required for implementing HER algorithm on these problems which are essentially single-goal environments. In addition, a number of feature extraction architectures are proposed to incorporate spatial and temporal attention in the learning process. Through rigorous simulation experiments, the improvement achieved with these components are established. To the best of our knowledge, such a benchmarking study is not available for the above two vision-based robotics problems making it a novel contribution in the field.
Agent-Temporal Attention for Reward Redistribution in Episodic Multi-Agent Reinforcement Learning

This paper considers multi-agent reinforcement learning (MARL) tasks where agents receive a shared global reward at the end of an episode. The delayed nature of this reward affects the ability of the agents to assess the quality of their actions at intermediate time-steps. This paper focuses on developing methods to learn a temporal redistribution of the episodic reward to obtain a dense reward signal. Solving such MARL problems requires addressing two challenges: identifying (1) relative importance of states along the length of an episode (along time), and (2) relative importance of individual agents' states at any single time-step (among agents). In this paper, we introduce Agent-Temporal Attention for Reward Redistribution in Episodic Multi-Agent Reinforcement Learning (AREL) to address these two challenges. AREL uses attention mechanisms to characterize the influence of actions on state transitions along trajectories (temporal attention), and how each agent is affected by other agents at each time-step (agent attention). The redistributed rewards predicted by AREL are dense, and can be integrated with any given MARL algorithm. We evaluate AREL on challenging tasks from the Particle World environment and the StarCraft Multi-Agent Challenge. AREL results in higher rewards in Particle World, and improved win rates in StarCraft compared to three state-of-the-art reward redistribution methods. Our code is available at this https URL.
Reinforcement Learning in Time-Varying Systems: an Empirical Study

Recent research has turned to Reinforcement Learning (RL) to solve challenging decision problems, as an alternative to hand-tuned heuristics. RL can learn good policies without the need for modeling the environment's dynamics. Despite this promise, RL remains an impractical solution for many real-world systems problems. A particularly challenging case occurs when the environment changes over time, i.e. it exhibits non-stationarity. In this work, we characterize the challenges introduced by non-stationarity and develop a framework for addressing them to train RL agents in live systems. Such agents must explore and learn new environments, without hurting the system's performance, and remember them over time. To this end, our framework (1) identifies different environments encountered by the live system, (2) explores and trains a separate expert policy for each environment, and (3) employs safeguards to protect the system's performance. We apply our framework to two systems problems: straggler mitigation and adaptive video streaming, and evaluate it against a variety of alternative approaches using real-world and synthetic data. We show that each component of our framework is necessary to cope with non-stationarity.
Criticality-Based Varying Step-Number Algorithm for Reinforcement Learning

In the context of reinforcement learning we introduce the concept of criticality of a state, which indicates the extent to which the choice of action in that particular state influences the expected return. That is, a state in which the choice of action is more likely to influence the final outcome is considered as more critical than a state in which it is less likely to influence the final outcome.
Reinforcement Learning based Air Combat Maneuver Generation

The advent of artificial intelligence technology paved the way of many researches to be made within air combat sector. Academicians and many other researchers did a research on a prominent research direction called autonomous maneuver decision of UAV. Elaborative researches produced some outcomes, but decisions that include Reinforcement Learning(RL) came out to be more efficient. There have been many researches and experiments done to make an agent reach its target in an optimal way, most prominent are Genetic Algorithm(GA) , A star, RRT and other various optimization techniques have been used. But Reinforcement Learning is the well known one for its success. In DARPHA Alpha Dogfight Trials, reinforcement learning prevailed against a real veteran F16 human pilot who was trained by Boeing. This successor model was developed by Heron Systems. After this accomplishment, reinforcement learning bring tremendous attention on itself. In this research we aimed our UAV which has a dubin vehicle dynamic property to move to the target in two dimensional space in an optimal path using Twin Delayed Deep Deterministic Policy Gradients (TD3) and used in experience replay Hindsight Experience Replay(HER).We did tests on two different environments and used simulations.
Learning from Atypical Behavior: Temporary Interest Aware Recommendation Based on Reinforcement Learning

Traditional robust recommendation methods view atypical user-item interactions as noise and aim to reduce their impact with some kind of noise filtering technique, which often suffers from two challenges. First, in real world, atypical interactions may signal users' temporary interest different from their general preference. Therefore, simply filtering out the atypical interactions as noise may be inappropriate and degrade the personalization of recommendations. Second, it is hard to acquire the temporary interest since there are no explicit supervision signals to indicate whether an interaction is atypical or not. To address this challenges, we propose a novel model called Temporary Interest Aware Recommendation (TIARec), which can distinguish atypical interactions from normal ones without supervision and capture the temporary interest as well as the general preference of users. Particularly, we propose a reinforcement learning framework containing a recommender agent and an auxiliary classifier agent, which are jointly trained with the objective of maximizing the cumulative return of the recommendations made by the recommender agent. During the joint training process, the classifier agent can judge whether the interaction with an item recommended by the recommender agent is atypical, and the knowledge about learning temporary interest from atypical interactions can be transferred to the recommender agent, which makes the recommender agent able to alone make recommendations that balance the general preference and temporary interest of users. At last, the experiments conducted on real world datasets verify the effectiveness of TIARec.
An Improved Reinforcement Learning Algorithm for Learning to Branch

Most combinatorial optimization problems can be formulated as mixed integer linear programming (MILP), in which branch-and-bound (B\&B) is a general and widely used method. Recently, learning to branch has become a hot research topic in the intersection of machine learning and combinatorial optimization. In this paper, we propose a novel reinforcement learning-based B\&B algorithm. Similar to offline reinforcement learning, we initially train on the demonstration data to accelerate learning massively. With the improvement of the training effect, the agent starts to interact with the environment with its learned policy gradually. It is critical to improve the performance of the algorithm by determining the mixing ratio between demonstration and self-generated data. Thus, we propose a prioritized storage mechanism to control this ratio automatically. In order to improve the robustness of the training process, a superior network is additionally introduced based on Double DQN, which always serves as a Q-network with competitive performance. We evaluate the performance of the proposed algorithm over three public research benchmarks and compare it against strong baselines, including three classical heuristics and one state-of-the-art imitation learning-based branching algorithm. The results show that the proposed algorithm achieves the best performance among compared algorithms and possesses the potential to improve B\&B algorithm performance continuously.
Reward Relabelling for combined Reinforcement and Imitation Learning on sparse-reward tasks

During recent years, deep reinforcement learning (DRL) has made successful incursions into complex decision-making applications such as robotics, autonomous driving or video games. In the search for more sample-efficient algorithms, a promising direction is to leverage as much external off-policy data as possible. One staple of this data-driven approach is to learn from expert demonstrations. In the past, multiple ideas have been proposed to make good use of the demonstrations added to the replay buffer, such as pretraining on demonstrations only or minimizing additional cost functions. We present a new method, able to leverage demonstrations and episodes collected online in any sparse-reward environment with any off-policy algorithm. Our method is based on a reward bonus given to demonstrations and successful episodes, encouraging expert imitation and self-imitation. First, we give a reward bonus to the transitions coming from demonstrations to encourage the agent to match the demonstrated behaviour. Then, upon collecting a successful episode, we relabel its transitions with the same bonus before adding them to the replay buffer, encouraging the agent to also match its previous successes. Our experiments focus on manipulation robotics, specifically on three tasks for a 6 degrees-of-freedom robotic arm in simulation. We show that our method based on reward relabeling improves the performance of the base algorithm (SAC and DDPG) on these tasks, even in the absence of demonstrations. Furthermore, integrating into our method two improvements from previous works allows our approach to outperform all baselines.
Reinforcement Learning to Solve NP-hard Problems: an Application to the CVRP

In this paper, we evaluate the use of Reinforcement Learning (RL) to solve a classic combinatorial optimization problem: the Capacitated Vehicle Routing Problem (CVRP). We formalize this problem in the RL framework and compare two of the most promising RL approaches with traditional solving techniques on a set of benchmark instances. We measure the different approaches with the quality of the solution returned and the time required to return it. We found that despite not returning the best solution, the RL approach has many advantages over traditional solvers. First, the versatility of the framework allows the resolution of more complex combinatorial problems. Moreover, instead of trying to solve a specific instance of the problem, the RL algorithm learns the skills required to solve the problem. The trained policy can then quasi instantly provide a solution to an unseen problem without having to solve it from scratch. Finally, the use of trained models makes the RL solver by far the fastest, and therefore make this approach more suited for commercial use where the user experience is paramount. Techniques like Knowledge Transfer can also be used to improve the training efficiency of the algorithm and help solve bigger and more complex problems.
A Reliable Reinforcement Learning for Resource Allocation in Uplink NOMA-URLLC Networks

In this paper, we propose a deep state-action-reward-state-action (SARSA) $\lambda$ learning approach for optimising the uplink resource allocation in non-orthogonal multiple access (NOMA) aided ultra-reliable low-latency communication (URLLC). To reduce the mean decoding error probability in time-varying network environments, this work designs a reliable learning algorithm for providing a long-term resource allocation, where the reward feedback is based on the instantaneous network performance. With the aid of the proposed algorithm, this paper addresses three main challenges of the reliable resource sharing in NOMA-URLLC networks: 1) user clustering; 2) Instantaneous feedback system; and 3) Optimal resource allocation. All of these designs interact with the considered communication environment. Lastly, we compare the performance of the proposed algorithm with conventional Q-learning and SARSA Q-learning algorithms. The simulation outcomes show that: 1) Compared with the traditional Q learning algorithms, the proposed solution is able to converges within \myb{200} episodes for providing as low as $10^{-2}$ long-term mean error; 2) NOMA assisted URLLC outperforms traditional OMA systems in terms of decoding error probabilities; and 3) The proposed feedback system is efficient for the long-term learning process.
Dynamic Cooperative Vehicle Platoon Control Considering Longitudinal and Lane-changing Dynamics

This paper presents a distributed cascade Proportional Integral Derivate (DCPID) control algorithm for the connected and automated vehicle (CAV) platoon considering the heterogeneity of CAVs in terms of the inertial lag. Furthermore, a real-time dynamic cooperative lane-changing model for CAVs, which can seamlessly combine the DCPID algorithm and the improved sine function is developed. The DCPID algorithm determines the appropriate longitudinal acceleration and speed of the lane-changing vehicle considering the speed fluctuations of the front vehicle on the target lane (TFV). In the meantime, the sine function plans a reference trajectory which is further updated in real time using the model predictive control (MPC) to avoid potential collisions until lane-changing is completed. Both the local and the asymptotic stability conditions of the DCPID algorithm are mathematically derived, and the sensitivity of the DCPID control parameters under different states is analyzed. Simulation experiments are conducted to assess the performance of the proposed model and the results indicate that the DCPID algorithm can provide robust control for tracking and adjusting the desired spacing and velocity for all 400 scenarios, even in the relatively extreme initial state. Besides, the proposed dynamic cooperative lane-changing model can guarantee an effective and safe lane-changing with different speeds and even in emergency situations (such as the sudden deceleration of the TFV).
Recursive Constraints to Prevent Instability in Constrained Reinforcement Learning

We consider the challenge of finding a deterministic policy for a Markov decision process that uniformly (in all states) maximizes one reward subject to a probabilistic constraint over a different reward. Existing solutions do not fully address our precise problem definition, which nevertheless arises naturally in the context of safety-critical robotic systems. This class of problem is known to be hard, but the combined requirements of determinism and uniform optimality can create learning instability. In this work, after describing and motivating our problem with a simple example, we present a suitable constrained reinforcement learning algorithm that prevents learning instability, using recursive constraints. Our proposed approach admits an approximative form that improves efficiency and is conservative w.r.t. the constraint.
Conservative Distributional Reinforcement Learning with Safety Constraints

Safety exploration can be regarded as a constrained Markov decision problem where the expected long-term cost is constrained. Previous off-policy algorithms convert the constrained optimization problem into the corresponding unconstrained dual problem by introducing the Lagrangian relaxation technique. However, the cost function of the above algorithms provides inaccurate estimations and causes the instability of the Lagrange multiplier learning. In this paper, we present a novel off-policy reinforcement learning algorithm called Conservative Distributional Maximum a Posteriori Policy Optimization (CDMPO). At first, to accurately judge whether the current situation satisfies the constraints, CDMPO adapts distributional reinforcement learning method to estimate the Q-function and C-function. Then, CDMPO uses a conservative value function loss to reduce the number of violations of constraints during the exploration process. In addition, we utilize Weighted Average Proportional Integral Derivative (WAPID) to update the Lagrange multiplier stably. Empirical results show that the proposed method has fewer violations of constraints in the early exploration process. The final test results also illustrate that our method has better risk control.
