Coding & Programming

Automatic tuning of hyper-parameters of reinforcement learning algorithms using Bayesian optimization with behavioral cloning

By Juan Cruz Barsce, Jorge A. Palombarini, Ernesto C. Martínez
arxiv.org
 4 days ago

Optimal setting of several hyper-parameters in machine learning algorithms is key to make the most of available data. To this aim, several methods such as evolutionary strategies, random search, Bayesian optimization and heuristic rules of thumb have been proposed. In reinforcement learning (RL), the information content of data gathered by the...

arxiv.org

DIY Photography

Cloning lens elements using silicone and epoxy resin

I’ve been following the YouTube channel Breaking Taps for quite a long time. It’s a fun mix of science, engineering and technology covering physics and the occasional bit of chemistry with some just flat out cool experiments, crazy tech and fascinating slow motion in between – like that time when he made his own scanning laser microscope (it’s pretty cool, you should watch it).
SCIENCE
arxiv.org

Recent Advances in Reinforcement Learning in Finance

The rapid changes in the finance industry due to the increasing amount of data have revolutionized the techniques on data processing and data analysis and brought new theoretical and computational challenges. In contrast to classical stochastic control theory and other analytical approaches for solving financial decision-making problems that heavily reply on model assumptions, new developments from reinforcement learning (RL) are able to make full use of the large amount of financial data with fewer model assumptions and to improve decisions in complex financial environments. This survey paper aims to review the recent developments and use of RL approaches in finance. We give an introduction to Markov decision processes, which is the setting for many of the commonly used RL approaches. Various algorithms are then introduced with a focus on value and policy based methods that do not require any model assumptions. Connections are made with neural networks to extend the framework to encompass deep RL algorithms. Our survey concludes by discussing the application of these RL algorithms in a variety of decision-making problems in finance, including optimal execution, portfolio optimization, option pricing and hedging, market making, smart order routing, and robo-advising.
MARKETS
arxiv.org

Bayesian Optimal Two-sample Tests in High-dimension

We propose optimal Bayesian two-sample tests for testing equality of high-dimensional mean vectors and covariance matrices between two populations. In many applications including genomics and medical imaging, it is natural to assume that only a few entries of two mean vectors or covariance matrices are different. Many existing tests that rely on aggregating the difference between empirical means or covariance matrices are not optimal or yield low power under such setups. Motivated by this, we develop Bayesian two-sample tests employing a divide-and-conquer idea, which is powerful especially when the difference between two populations is sparse but large. The proposed two-sample tests manifest closed forms of Bayes factors and allow scalable computations even in high-dimensions. We prove that the proposed tests are consistent under relatively mild conditions compared to existing tests in the literature. Furthermore, the testable regions from the proposed tests turn out to be optimal in terms of rates. Simulation studies show clear advantages of the proposed tests over other state-of-the-art methods in various scenarios. Our tests are also applied to the analysis of the gene expression data of two cancer data sets.
SCIENCE
arxiv.org

A simulation driven optimization algorithm for scheduling sorting center operations

Parcel sorting operations in logistics enterprises aim to achieve a high throughput of parcels through sorting centers. These sorting centers are composed of large circular conveyor belts on which incoming parcels are placed, with multiple arms known as chutes for sorting the parcels by destination, followed by packing into roller cages and loading onto outbound trucks. Modern sorting systems need to complement their hardware innovations with sophisticated algorithms and software to map destinations and workforce to specific chutes. While state of the art systems operate with fixed mappings, we propose an optimization approach that runs before every shift, and uses real-time forecast of destination demand and labor availability in order to maximize throughput. We use simulation to improve the performance and robustness of the optimization solution to stochasticity in the environment, through closed-loop tuning of the optimization parameters.
INDUSTRY
arxiv.org

High-Dimensional Stock Portfolio Trading with Deep Reinforcement Learning

This paper proposes a Deep Reinforcement Learning algorithm for financial portfolio trading based on Deep Q-learning. The algorithm is capable of trading high-dimensional portfolios from cross-sectional datasets of any size which may include data gaps and non-unique history lengths in the assets. We sequentially set up environments by sampling one asset for each environment while rewarding investments with the resulting asset's return and cash reservation with the average return of the set of assets. This enforces the agent to strategically assign capital to assets that it predicts to perform above-average. We apply our methodology in an out-of-sample analysis to 48 US stock portfolio setups, varying in the number of stocks from ten up to 500 stocks, in the selection criteria and in the level of transaction costs. The algorithm on average outperforms all considered passive and active benchmark investment strategies by a large margin using only one hyperparameter setup for all portfolios.
MARKETS
arxiv.org

Ambiguous Dynamic Treatment Regimes: A Reinforcement Learning Approach

A main research goal in various studies is to use an observational data set and provide a new set of counterfactual guidelines that can yield causal improvements. Dynamic Treatment Regimes (DTRs) are widely studied to formalize this process. However, available methods in finding optimal DTRs often rely on assumptions that are violated in real-world applications (e.g., medical decision-making or public policy), especially when (a) the existence of unobserved confounders cannot be ignored, and (b) the unobserved confounders are time-varying (e.g., affected by previous actions). When such assumptions are violated, one often faces ambiguity regarding the underlying causal model that is needed to be assumed to obtain an optimal DTR. This ambiguity is inevitable, since the dynamics of unobserved confounders and their causal impact on the observed part of the data cannot be understood from the observed data. Motivated by a case study of finding superior treatment regimes for patients who underwent transplantation in our partner hospital and faced a medical condition known as New Onset Diabetes After Transplantation (NODAT), we extend DTRs to a new class termed Ambiguous Dynamic Treatment Regimes (ADTRs), in which the casual impact of treatment regimes is evaluated based on a "cloud" of potential causal models. We then connect ADTRs to Ambiguous Partially Observable Mark Decision Processes (APOMDPs) proposed by Saghafian (2018), and develop two Reinforcement Learning methods termed Direct Augmented V-Learning (DAV-Learning) and Safe Augmented V-Learning (SAV-Learning), which enable using the observed data to efficiently learn an optimal treatment regime. We establish theoretical results for these learning methods, including (weak) consistency and asymptotic normality. We further evaluate the performance of these learning methods both in our case study and in simulation experiments.
HEALTH
towardsdatascience.com

12 Useful Algorithms for 12 Days of Christmas

Really cool algorithms that all Data Scientists should know. Be sure to SUBSCRIBE here to never miss another article on data science guides, tricks and tips, life lessons, and more!. Introduction. It’s that time of the year again! This time, instead of 12 Data Science Projects for 12 Days of...
TECHNOLOGY
arxiv.org

A Fully Single Loop Algorithm for Bilevel Optimization without Hessian Inverse

In this paper, we propose a new Hessian inverse free Fully Single Loop Algorithm (FSLA) for bilevel optimization problems. Classic algorithms for bilevel optimization admit a double loop structure which is computationally expensive. Recently, several single loop algorithms have been proposed with optimizing the inner and outer variable alternatively. However, these algorithms not yet achieve fully single loop. As they overlook the loop needed to evaluate the hyper-gradient for a given inner and outer state. In order to develop a fully single loop algorithm, we first study the structure of the hyper-gradient and identify a general approximation formulation of hyper-gradient computation that encompasses several previous common approaches, e.g. back-propagation through time, conjugate gradient, \emph{etc.} Based on this formulation, we introduce a new state variable to maintain the historical hyper-gradient information. Combining our new formulation with the alternative update of the inner and outer variables, we propose an efficient fully single loop algorithm. We theoretically show that the error generated by the new state can be bounded and our algorithm converges with the rate of $O(\epsilon^{-2})$. Finally, we verify the efficacy our algorithm empirically through multiple bilevel optimization based machine learning tasks.
MATHEMATICS
arxiv.org

Optimizing Write Fidelity of MRAMs via Iterative Water-filling Algorithm

Magnetic random-access memory (MRAM) is a promising memory technology due to its high density, non-volatility, and high endurance. However, achieving high memory fidelity incurs significant write-energy costs, which should be reduced for large-scale deployment of MRAMs. In this paper, we formulate a \emph{biconvex} optimization problem to optimize write fidelity given energy and latency constraints. The basic idea is to allocate non-uniform write pulses depending on the importance of each bit position. The fidelity measure we consider is mean squared error (MSE), for which we optimize write pulses via \emph{alternating convex search (ACS)}. By using Karush-Kuhn-Tucker (KKT) conditions, we derive analytic solutions and propose an \emph{iterative water-filling-type} algorithm by leveraging the analytic solutions. Hence, the proposed iterative water-filling algorithm is computationally more efficient than the original ACS while their solutions are identical. Although the original ACS and the proposed iterative water-filling algorithm do not guarantee global optimality, the MSEs obtained by the proposed algorithm are comparable to the MSEs by complicated global nonlinear programming solvers. Furthermore, we prove that the proposed algorithm can reduce the MSE exponentially with the number of bits per word. For an 8-bit accessed word, the proposed algorithm reduces the MSE by a factor of 21. We also evaluate the proposed algorithm for MNIST dataset classification supposing that the model parameters of deep neural networks are stored in MRAMs. The numerical results show that the optimized write pulses can achieve \SI{40}{\%} write energy reduction for a given classification accuracy.
COMPUTERS
arxiv.org

Learning Reinforced Dynamic Representations for Sequential Recommendation

Recently, sequential recommendation systems are important in solving the information overload in many online services. Current methods in sequential recommendation focus on learning a fixed number of representations for each user at any time, with a single representation or multi-interest representations for the user. However, when a user is exploring items on an e-commerce recommendation system, the number of this user's interests may change overtime (e.g. increase/reduce one interest), affected by the user's evolving self needs. Moreover, different users may have various number of interests. In this paper, we argue that it is meaningful to explore a personalized dynamic number of user interests, and learn a dynamic group of user interest representations accordingly. We propose a Reinforced sequential model with dynamic number of interest representations for recommendation systems (RDRSR). Specifically, RDRSR is composed of a dynamic interest discriminator (DID) module and a dynamic interest allocator (DIA) module. The DID module explores the number of a user's interests by learning the overall sequential characteristics with bi-directional self-attention and Gumbel-Softmax. The DIA module allocates the historical clicked items into a group of sub-sequences and constructs user's dynamic interest representations. We formalize the allocation problem in the form of Markov Decision Process(MDP), and sample an action from policy pi for each item to determine which sub-sequence it belongs to. Additionally, experiments on the real-world datasets demonstrates our model's effectiveness.
INTERNET
arxiv.org

A Reinforcement Learning-based Adaptive Control Model for Future Street Planning, An Algorithm and A Case Study

With the emerging technologies in Intelligent Transportation System (ITS), the adaptive operation of road space is likely to be realised within decades. An intelligent street can learn and improve its decision-making on the right-of-way (ROW) for road users, liberating more active pedestrian space while maintaining traffic safety and efficiency. However, there is a lack of effective controlling techniques for these adaptive street infrastructures. To fill this gap in existing studies, we formulate this control problem as a Markov Game and develop a solution based on the multi-agent Deep Deterministic Policy Gradient (MADDPG) algorithm. The proposed model can dynamically assign ROW for sidewalks, autonomous vehicles (AVs) driving lanes and on-street parking areas in real-time. Integrated with the SUMO traffic simulator, this model was evaluated using the road network of the South Kensington District against three cases of divergent traffic conditions: pedestrian flow rates, AVs traffic flow rates and parking demands. Results reveal that our model can achieve an average reduction of 3.87% and 6.26% in street space assigned for on-street parking and vehicular operations. Combined with space gained by limiting the number of driving lanes, the average proportion of sidewalks to total widths of streets can significantly increase by 10.13%.
TECHNOLOGY
arxiv.org

Reinforcement Learning for Navigation of Mobile Robot with LiDAR

This paper presents a technique for navigation of mobile robot with Deep Q-Network (DQN) combined with Gated Recurrent Unit (GRU). The DQN integrated with the GRU allows action skipping for improved navigation performance. This technique aims at efficient navigation of mobile robot such as autonomous parking robot. Framework for reinforcement learning can be applied to the DQN combined with the GRU in a real environment, which can be modeled by the Partially Observable Markov Decision Process (POMDP). By allowing action skipping, the ability of the DQN combined with the GRU in learning key-action can be improved. The proposed algorithm is applied to explore the feasibility of solution in real environment by the ROS-Gazebo simulator, and the simulation results show that the proposed algorithm achieves improved performance in navigation and collision avoidance as compared to the results obtained by DQN alone and DQN combined with GRU without allowing action skipping.
COMPUTERS
arxiv.org

Greedy-based Value Representation for Optimal Coordination in Multi-agent Reinforcement Learning

Due to the representation limitation of the joint Q value function, multi-agent reinforcement learning (MARL) methods with linear or monotonic value decomposition suffer from the relative overgeneralization. As a result, they can not ensure the optimal coordination. Existing methods address the relative overgeneralization by achieving complete expressiveness or learning a bias, which is insufficient to solve the problem. In this paper, we propose the optimal consistency, a criterion to evaluate the optimality of coordination. To achieve the optimal consistency, we introduce the True-Global-Max (TGM) principle for linear and monotonic value decomposition, where the TGM principle can be ensured when the optimal stable point is the unique stable point. Therefore, we propose the greedy-based value representation (GVR) to ensure the optimal stable point via inferior target shaping and eliminate the non-optimal stable points via superior experience replay. Theoretical proofs and empirical results demonstrate that our method can ensure the optimal consistency under sufficient exploration. In experiments on various benchmarks, GVR significantly outperforms state-of-the-art baselines.
COMPUTERS
arxiv.org

Organ localisation using supervised and semi supervised approaches combining reinforcement learning with imitation learning

Computer aided diagnostics often requires analysis of a region of interest (ROI) within a radiology scan, and the ROI may be an organ or a suborgan. Although deep learning algorithms have the ability to outperform other methods, they rely on the availability of a large amount of annotated data. Motivated by the need to address this limitation, an approach to localisation and detection of multiple organs based on supervised and semi-supervised learning is presented here. It draws upon previous work by the authors on localising the thoracic and lumbar spine region in CT images. The method generates six bounding boxes of organs of interest, which are then fused to a single bounding box. The results of experiments on localisation of the Spleen, Left and Right Kidneys in CT Images using supervised and semi supervised learning (SSL) demonstrate the ability to address data limitations with a much smaller data set and fewer annotations, compared to other state-of-the-art methods. The SSL performance was evaluated using three different mixes of labelled and unlabelled data (i.e.30:70,35:65,40:60) for each of lumbar spine, spleen left and right kidneys respectively. The results indicate that SSL provides a workable alternative especially in medical imaging where it is difficult to obtain annotated data.
SCIENCE
towardsdatascience.com

Bayesian Price Optimization with PyMC3

In this article, we’re going to explore price optimization from the Bayesian perspective. So what is price optimization?. It’s optimizing the price of a good or service given costs and revenue. Revenue is generally subject to the “demand curve,” which is simply a relation between price and units demanded by consumers. A price too low would attract anyone and everyone who’s in the market but wouldn’t bring in enough revenue to outweigh the costs. But the inverse is also true. If the price is too high, so few customers (if any) will be attracted; so costs could outweigh revenue all the same. Hence the task of optimization.
MARKETS
arxiv.org

Federated Deep Reinforcement Learning for the Distributed Control of NextG Wireless Networks

Next Generation (NextG) networks are expected to support demanding tactile internet applications such as augmented reality and connected autonomous vehicles. Whereas recent innovations bring the promise of larger link capacity, their sensitivity to the environment and erratic performance defy traditional model-based control rationales. Zero-touch data-driven approaches can improve the ability of the network to adapt to the current operating conditions. Tools such as reinforcement learning (RL) algorithms can build optimal control policy solely based on a history of observations. Specifically, deep RL (DRL), which uses a deep neural network (DNN) as a predictor, has been shown to achieve good performance even in complex environments and with high dimensional inputs. However, the training of DRL models require a large amount of data, which may limit its adaptability to ever-evolving statistics of the underlying environment. Moreover, wireless networks are inherently distributed systems, where centralized DRL approaches would require excessive data exchange, while fully distributed approaches may result in slower convergence rates and performance degradation. In this paper, to address these challenges, we propose a federated learning (FL) approach to DRL, which we refer to federated DRL (F-DRL), where base stations (BS) collaboratively train the embedded DNN by only sharing models' weights rather than training data. We evaluate two distinct versions of F-DRL, value and policy based, and show the superior performance they achieve compared to distributed and centralized DRL.
COMPUTERS
arxiv.org

Data-driven forward-inverse problems and modulational instability for Yajima-Oikawa system using deep learning with parameter regularization

We investigate data-driven forward-inverse problems for Yajima-Oikawa system by employing two technologies which improve the performance of PINN in deep physics-informed neural network (PINN), namely neuron-wise locally adaptive activation functions and L2 norm parameter regularization. In particular, we not only recover three different forms of vector rogue waves (RWs) in the forward problem of Yajima-Oikawa (YO) system, including bright-bright RWs, intermediatebright RWs and dark-bright RWs, but also study the inverse problem of YO system by data-driven with noise of different intensity. Compared with PINN method using only locally adaptive activation function, the PINN method with two strategies shows amazing robustness when studying the inverse problem of YO system with noisy training data, that is, the improved PINN model proposed by us has excellent noise immunity. The asymptotic analysis of wavenumber k and the MI analysis for YO system with unknown parameters are derived systematically by applying the linearized instability analysis on plane wave.
COMPUTERS
arxiv.org

JueWu-MC: Playing Minecraft with Sample-efficient Hierarchical Reinforcement Learning

Learning rational behaviors in open-world games like Minecraft remains to be challenging for Reinforcement Learning (RL) research due to the compound challenge of partial observability, high-dimensional visual perception and delayed reward. To address this, we propose JueWu-MC, a sample-efficient hierarchical RL approach equipped with representation learning and imitation learning to deal with perception and exploration. Specifically, our approach includes two levels of hierarchy, where the high-level controller learns a policy to control over options and the low-level workers learn to solve each sub-task. To boost the learning of sub-tasks, we propose a combination of techniques including 1) action-aware representation learning which captures underlying relations between action and representation, 2) discriminator-based self-imitation learning for efficient exploration, and 3) ensemble behavior cloning with consistency filtering for policy robustness. Extensive experiments show that JueWu-MC significantly improves sample efficiency and outperforms a set of baselines by a large margin. Notably, we won the championship of the NeurIPS MineRL 2021 research competition and achieved the highest performance score ever.
VIDEO GAMES
arxiv.org

Application of Deep Reinforcement Learning to Payment Fraud

The large variety of digital payment choices available to consumers today has been a key driver of e-commerce transactions in the past decade. Unfortunately, this has also given rise to cybercriminals and fraudsters who are constantly looking for vulnerabilities in these systems by deploying increasingly sophisticated fraud attacks. A typical fraud detection system employs standard supervised learning methods where the focus is on maximizing the fraud recall rate. However, we argue that such a formulation can lead to sub-optimal solutions. The design requirements for these fraud models requires that they are robust to the high-class imbalance in the data, adaptive to changes in fraud patterns, maintain a balance between the fraud rate and the decline rate to maximize revenue, and be amenable to asynchronous feedback since usually there is a significant lag between the transaction and the fraud realization. To achieve this, we formulate fraud detection as a sequential decision-making problem by including the utility maximization within the model in the form of the reward function. The historical decline rate and fraud rate define the state of the system with a binary action space composed of approving or declining the transaction. In this study, we primarily focus on utility maximization and explore different reward functions to this end. The performance of the proposed Reinforcement Learning system has been evaluated for two publicly available fraud datasets using Deep Q-learning and compared with different classifiers. We aim to address the rest of the issues in future work.
TECHNOLOGY
arxiv.org

Control Parameters Considered Harmful: Detecting Range Specification Bugs in Drone Configuration Modules via Learning-Guided Search

In order to support a variety of missions and deal with different flight environments, drone control programs typically provide configurable control parameters. However, such a flexibility introduces vulnerabilities. One such vulnerability, referred to as range specification bugs, has been recently identified. The vulnerability originates from the fact that even though each individual parameter receives a value in the recommended value range, certain combinations of parameter values may affect the drone physical stability. In this paper we develop a novel learning-guided search system to find such combinations, that we refer to as incorrect configurations. Our system applies metaheuristic search algorithms mutating configurations to detect the configuration parameters that have values driving the drone to unstable physical states. To guide the mutations, our system leverages a machine learning predictor as the fitness evaluator. Finally, by utilizing multi-objective optimization, our system returns the feasible ranges based on the mutation search results. Because in our system the mutations are guided by a predictor, evaluating the parameter configurations does not require realistic/simulation executions. Therefore, our system supports a comprehensive and yet efficient detection of incorrect configurations. We have carried out an experimental evaluation of our system. The evaluation results show that the system successfully reports potentially incorrect configurations, of which over 85% lead to actual unstable physical states.
ELECTRONICS

