DNN gradient lossless compression: Can GenNorm be the answer?

By Zhong-Jing Chen, Eduin E. Hernandez, Yu-Chih Huang, Stefano Rini
 5 days ago

In this paper, the problem of optimal gradient lossless compression in Deep Neural Network (DNN) training is considered. Gradient compression is relevant in many distributed DNN training scenarios, including the recently popular federated learning (FL) scenario in which each remote users are connected to the parameter server (PS) through...

A Framework for Routing DNN Inference Jobs over Distributed Computing Networks

Ubiquitous artificial intelligence (AI) is considered one of the key services in 6G systems. AI services typically rely on deep neural network (DNN) requiring heavy computation. Hence, in order to support ubiquitous AI, it is crucial to provide a solution for offloading or distributing computational burden due to DNN, especially at end devices with limited resources. We develop a framework for assigning the computation tasks of DNN inference jobs to the nodes with computing resources in the network, so as to reduce the inference latency in the presence of limited computing power at end devices. To this end, we propose a layered graph model that enables to solve the problem of assigning computation tasks of a single DNN inference job via simple conventional routing. Using this model, we develop algorithms for routing DNN inference jobs over the distributed computing network. We show through numerical evaluations that our algorithms can select nodes and paths adaptively to the computational attributes of given DNN inference jobs in order to reduce the end-to-end latency.
Training Generative Adversarial Networks with Adaptive Composite Gradient

The wide applications of Generative adversarial networks benefit from the successful training methods, guaranteeing that an object function converges to the local minima. Nevertheless, designing an efficient and competitive training method is still a challenging task due to the cyclic behaviors of some gradient-based ways and the expensive computational cost of these methods based on the Hessian matrix. This paper proposed the adaptive Composite Gradients (ACG) method, linearly convergent in bilinear games under suitable settings. Theory and toy-function experiments suggest that our approach can alleviate the cyclic behaviors and converge faster than recently proposed algorithms. Significantly, the ACG method is not only used to find stable fixed points in bilinear games as well as in general games. The ACG method is a novel semi-gradient-free algorithm since it does not need to calculate the gradient of each step, reducing the computational cost of gradient and Hessian by utilizing the predictive information in future iterations. We conducted two mixture of Gaussians experiments by integrating ACG to existing algorithms with Linear GANs. Results show ACG is competitive with the previous algorithms. Realistic experiments on four prevalent data sets (MNIST, Fashion-MNIST, CIFAR-10, and CelebA) with DCGANs show that our ACG method outperforms several baselines, which illustrates the superiority and efficacy of our method.
Time Discretization-Invariant Safe Action Repetition for Policy Gradient Methods

In reinforcement learning, continuous time is often discretized by a time scale $\delta$, to which the resulting performance is known to be highly sensitive. In this work, we seek to find a $\delta$-invariant algorithm for policy gradient (PG) methods, which performs well regardless of the value of $\delta$. We first identify the underlying reasons that cause PG methods to fail as $\delta \to 0$, proving that the variance of the PG estimator can diverge to infinity in stochastic environments under a certain assumption of stochasticity. While durative actions or action repetition can be employed to have $\delta$-invariance, previous action repetition methods cannot immediately react to unexpected situations in stochastic environments. We thus propose a novel $\delta$-invariant method named Safe Action Repetition (SAR) applicable to any existing PG algorithm. SAR can handle the stochasticity of environments by adaptively reacting to changes in states during action repetition. We empirically show that our method is not only $\delta$-invariant but also robust to stochasticity, outperforming previous $\delta$-invariant approaches on eight MuJoCo environments with both deterministic and stochastic settings. Our code is available at this https URL.
How To Compress Videos on iPhone To Save Storage

IPhones are known for their high-quality camera which can shoot and record high-end videos. All the users who own iPhones with good storage generally do not run out of space but those users who own iPhones with standard storage capacity face the issue. Because they have to manage the storage...
Double Control Variates for Gradient Estimation in Discrete Latent Variable Models

Stochastic gradient-based optimisation for discrete latent variable models is challenging due to the high variance of gradients. We introduce a variance reduction technique for score function estimators that makes use of double control variates. These control variates act on top of a main control variate, and try to further reduce the variance of the overall estimator. We develop a double control variate for the REINFORCE leave-one-out estimator using Taylor expansions. For training discrete latent variable models, such as variational autoencoders with binary latent variables, our approach adds no extra computational cost compared to standard training with the REINFORCE leave-one-out estimator. We apply our method to challenging high-dimensional toy examples and training variational autoencoders with binary latent variables. We show that our estimator can have lower variance compared to other state-of-the-art estimators.
Estimating High Order Gradients of the Data Distribution by Denoising

The first order derivative of a data density can be estimated efficiently by denoising score matching, and has become an important component in many applications, such as image generation and audio synthesis. Higher order derivatives provide additional local information about the data distribution and enable new applications. Although they can be estimated via automatic differentiation of a learned density model, this can amplify estimation errors and is expensive in high dimensional settings. To overcome these limitations, we propose a method to directly estimate high order derivatives (scores) of a data density from samples. We first show that denoising score matching can be interpreted as a particular case of Tweedie's formula. By leveraging Tweedie's formula on higher order moments, we generalize denoising score matching to estimate higher order derivatives. We demonstrate empirically that models trained with the proposed method can approximate second order derivatives more efficiently and accurately than via automatic differentiation. We show that our models can be used to quantify uncertainty in denoising and to improve the mixing speed of Langevin dynamics via Ozaki discretization for sampling synthetic data and natural images.
Self-Compression in Bayesian Neural Networks

Machine learning models have achieved human-level performance on various tasks. This success comes at a high cost of computation and storage overhead, which makes machine learning algorithms difficult to deploy on edge devices. Typically, one has to partially sacrifice accuracy in favor of an increased performance quantified in terms of reduced memory usage and energy consumption. Current methods compress the networks by reducing the precision of the parameters or by eliminating redundant ones. In this paper, we propose a new insight into network compression through the Bayesian framework. We show that Bayesian neural networks automatically discover redundancy in model parameters, thus enabling self-compression, which is linked to the propagation of uncertainty through the layers of the network. Our experimental results show that the network architecture can be successfully compressed by deleting parameters identified by the network itself while retaining the same level of accuracy.
ResNEsts and DenseNEsts: Block-based DNN Models with Improved Representation Guarantees

Models recently used in the literature proving residual networks (ResNets) are better than linear predictors are actually different from standard ResNets that have been widely used in computer vision. In addition to the assumptions such as scalar-valued output or single residual block, these models have no nonlinearities at the final residual representation that feeds into the final affine layer. To codify such a difference in nonlinearities and reveal a linear estimation property, we define ResNEsts, i.e., Residual Nonlinear Estimators, by simply dropping nonlinearities at the last residual representation from standard ResNets. We show that wide ResNEsts with bottleneck blocks can always guarantee a very desirable training property that standard ResNets aim to achieve, i.e., adding more blocks does not decrease performance given the same set of basis elements. To prove that, we first recognize ResNEsts are basis function models that are limited by a coupling problem in basis learning and linear prediction. Then, to decouple prediction weights from basis learning, we construct a special architecture termed augmented ResNEst (A-ResNEst) that always guarantees no worse performance with the addition of a block. As a result, such an A-ResNEst establishes empirical risk lower bounds for a ResNEst using corresponding bases. Our results demonstrate ResNEsts indeed have a problem of diminishing feature reuse; however, it can be avoided by sufficiently expanding or widening the input space, leading to the above-mentioned desirable property. Inspired by the DenseNets that have been shown to outperform ResNets, we also propose a corresponding new model called Densely connected Nonlinear Estimator (DenseNEst). We show that any DenseNEst can be represented as a wide ResNEst with bottleneck blocks. Unlike ResNEsts, DenseNEsts exhibit the desirable property without any special architectural re-design.
This Week’s Unboxing: Gradient Boosted Models’ “Black Box”

Gradient boosted models are often labelled as black box examples of machine learning algorithms. A black box model is defined on Wikipedia as a “system which can be viewed in terms of its inputs and outputs […] without any knowledge of its internal workings”. Is this actually an accurate depiction of GBMs? Hardly, I would say. You can look into every single line of code in open-source packages like XGBoost, and export detailed tree structures.
Bolstering Stochastic Gradient Descent with Model Building

Stochastic gradient descent method and its variants constitute the core optimization algorithms that achieve good convergence rates for solving machine learning problems. These rates are obtained especially when these algorithms are fine-tuned for the application at hand. Although this tuning process can require large computational costs, recent work has shown that these costs can be reduced by line search methods that iteratively adjust the stepsize. We propose an alternative approach to stochastic line search by using a new algorithm based on forward step model building. This model building step incorporates a second-order information that allows adjusting not only the stepsize but also the search direction. Noting that deep learning model parameters come in groups (layers of tensors), our method builds its model and calculates a new step for each parameter group. This novel diagonalization approach makes the selected step lengths adaptive. We provide convergence rate analysis, and experimentally show that the proposed algorithm achieves faster convergence and better generalization in most problems. Moreover, our experiments show that the proposed method is quite robust as it converges for a wide range of initial stepsizes.
Predictive coding, precision and natural gradients

There is an increasing convergence between biologically plausible computational models of inference and learning with local update rules and the global gradient-based optimization of neural network models employed in machine learning. One particularly exciting connection is the correspondence between the locally informed optimization in predictive coding networks and the error backpropagation algorithm that is used to train state-of-the-art deep artificial neural networks. Here we focus on the related, but still largely under-explored connection between precision weighting in predictive coding networks and the Natural Gradient Descent algorithm for deep neural networks. Precision-weighted predictive coding is an interesting candidate for scaling up uncertainty-aware optimization -- particularly for models with large parameter spaces -- due to its distributed nature of the optimization process and the underlying local approximation of the Fisher information metric, the adaptive learning rate that is central to Natural Gradient Descent. Here, we show that hierarchical predictive coding networks with learnable precision indeed are able to solve various supervised and unsupervised learning tasks with performance comparable to global backpropagation with natural gradients and outperform their classical gradient descent counterpart on tasks where high amounts of noise are embedded in data or label inputs. When applied to unsupervised auto-encoding of image inputs, the deterministic network produces hierarchically organized and disentangled embeddings, hinting at the close connections between predictive coding and hierarchical variational inference.
Optimal control of PDEs using physics-informed neural networks

Physics-informed neural networks (PINNs) have recently become a popular method for solving forward and inverse problems governed by partial differential equations (PDEs). By incorporating the residual of the PDE into the loss function of a neural network-based surrogate model for the unknown state, PINNs can seamlessly blend measurement data with physical constraints. Here, we extend this framework to PDE-constrained optimal control problems, for which the governing PDE is fully known and the goal is to find a control variable that minimizes a desired cost objective. We provide a set of guidelines for obtaining a good optimal control solution; first by ensuring that the PDE remains well satisfied during the training process, second by assessing rigorously the quality of the computed optimal control. We then validate the performance of the PINN framework by comparing it to adjoint-based nonlinear optimal control, which performs gradient descent on the discretized control variable while satisfying the discretized PDE. This comparison is carried out on several distributed control examples based on the Laplace, Burgers, Kuramoto-Sivashinsky, and Navier-Stokes equations. Finally, we discuss the advantages and caveats of using the PINN and adjoint-based approaches for solving optimal control problems constrained by nonlinear PDEs.
An Inexact Riemannian Proximal Gradient Method

This paper considers the problem of minimizing the summation of a differentiable function and a nonsmooth function on a Riemannian manifold. In recent years, proximal gradient method and its invariants have been generalized to the Riemannian setting for solving such problems. Different approaches to generalize the proximal mapping to the Riemannian setting lead versions of Riemannian proximal gradient methods. However, their convergence analyses all rely on solving their Riemannian proximal mapping exactly, which is either too expensive or impracticable. In this paper, we study the convergence of an inexact Riemannian proximal gradient method. It is proven that if the proximal mapping is solved sufficiently accurately, then the global convergence and local convergence rate based on the Riemannian Kurdyka-Łojasiewicz property can be guaranteed. Moreover, practical conditions on the accuracy for solving the Riemannian proximal mapping are provided. As a byproduct, the proximal gradient method on the Stiefel manifold proposed in~[CMSZ2020] can be viewed as the inexact Riemannian proximal gradient method provided the proximal mapping is solved to certain accuracy. Finally, numerical experiments on sparse principal component analysis are conducted to test the proposed practical conditions.
Practical Timing Side Channel Attacks on Memory Compression

Compression algorithms are widely used as they save memory without losing data. However, elimination of redundant symbols and sequences in data leads to a compression side channel. So far, compression attacks have only focused on the compression-ratio side channel, i.e., the size of compressed data,and largely targeted HTTP traffic and website content.
QCD Static Force in Gradient Flow

We compute the QCD static force and potential using gradient flow at next-to-leading order in the strong coupling. The static force is the spatial derivative of the static potential: it encodes the QCD interaction at both short and long distances. While on the one side the static force has the advantage of being free of the $O(\Lambda_{\rm QCD})$ renormalon affecting the static potential when computed in perturbation theory, on the other side its direct lattice QCD computation suffers from poor convergence. The convergence can be improved by using gradient flow, where the gauge fields in the operator definition of a given quantity are replaced by flowed fields at flow time $t$, which effectively smear the gauge fields over a distance of order $\sqrt{t}$, while they reduce to the QCD fields in the limit $t \to 0$. Based on our next-to-leading order calculation, we explore the properties of the static force for arbitrary values of $t$, as well as in the $t \to 0$ limit, which may be useful for lattice QCD studies.
Distribution Compression in Near-Linear Time

In distribution compression, one aims to accurately summarize a probability distribution $\mathbb{P}$ using a small number of representative points. Near-optimal thinning procedures achieve this goal by sampling $n$ points from a Markov chain and identifying $\sqrt{n}$ points with $\widetilde{\mathcal{O}}(1/\sqrt{n})$ discrepancy to $\mathbb{P}$. Unfortunately, these algorithms suffer from quadratic or super-quadratic runtime in the sample size $n$. To address this deficiency, we introduce Compress++, a simple meta-procedure for speeding up any thinning algorithm while suffering at most a factor of $4$ in error. When combined with the quadratic-time kernel halving and kernel thinning algorithms of Dwivedi and Mackey (2021), Compress++ delivers $\sqrt{n}$ points with $\mathcal{O}(\sqrt{\log n/n})$ integration error and better-than-Monte-Carlo maximum mean discrepancy in $\mathcal{O}(n \log^3 n)$ time and $\mathcal{O}( \sqrt{n} \log^2 n )$ space. Moreover, Compress++ enjoys the same near-linear runtime given any quadratic-time input and reduces the runtime of super-quadratic algorithms by a square-root factor. In our benchmarks with high-dimensional Monte Carlo samples and Markov chains targeting challenging differential equation posteriors, Compress++ matches or nearly matches the accuracy of its input algorithm in orders of magnitude less time.
Moment Transform-Based Compressive Sensing in Image Processing

Over the last decades, images have become an important source of information in many domains, thus their high quality has become necessary to acquire better information. One of the important issues that arise is image denoising, which means recovering a signal from inaccurately and/or partially measured samples. This interpretation is highly correlated to the compressive sensing theory, which is a revolutionary technology and implies that if a signal is sparse then the original signal can be obtained from a few measured values, which are much less, than the ones suggested by other used theories like Shannon's sampling theories. A strong factor in Compressive Sensing (CS) theory to achieve the sparsest solution and the noise removal from the corrupted image is the selection of the basis dictionary. In this paper, Discrete Cosine Transform (DCT) and moment transform (Tchebichef, Krawtchouk) are compared in order to achieve image denoising of Gaussian additive white noise based on compressive sensing and sparse approximation theory. The experimental results revealed that the basis dictionaries constructed by the moment transform perform competitively to the traditional DCT. The latter transform shows a higher PSNR of 30.82 dB and the same 0.91 SSIM value as the Tchebichef transform. Moreover, from the sparsity point of view, Krawtchouk moments provide approximately 20-30% more sparse results than DCT.
Using Convolutional Neural Networks to Detect Compression Algorithms

Machine learning is penetrating various domains virtually, thereby proliferating excellent results. It has also found an outlet in digital forensics, wherein it is becoming the prime driver of computational efficiency. A prominent feature that exhibits the effectiveness of ML algorithms is feature extraction that can be instrumental in the applications for digital forensics. Convolutional Neural Networks are further used to identify parts of the file. To this end, we observed that the literature does not include sufficient information about the identification of the algorithms used to compress file fragments. With this research, we attempt to address this gap as compression algorithms are beneficial in generating higher entropy comparatively as they make the data more compact. We used a base dataset, compressed every file with various algorithms, and designed a model based on that. The used model was accurately able to identify files compressed using compress, lzip and bzip2.
Artificial Compressibility Approaches in Flux Reconstruction for Incompressible Viscous Flow Simulations

Several competing artificial compressibility methods for the incompressible flow equations are examined using the high-order flux reconstruction method. The established artificial compressibility method (ACM) of \citet{Chorin1967} is compared to the alternative entropically damped (EDAC) method of \citet{Clausen2013}, as well as an ACM formulation with hyperbolised diffusion. While the former requires the solution to be converged to a divergence free state at each physical time step through pseudo iterations, the latter can be applied explicitly. We examine the sensitivity of both methods to the parameterisation for a series of test cases over a range of Reynolds numbers. As the compressibility is reduced, EDAC is found to give linear improvements in divergence whereas ACM yields diminishing returns. For the Taylor--Green vortex, EDAC is found to perform well; however on the more challenging circular cylinder at $Re=3900$, EDAC gives rise to early transition of the free shear-layer and over-production of the turbulence kinetic energy. This is attributed to the spatial pressure fluctuations of the method. Similar behaviour is observed for an aerofoil at $Re=60,000$ with an attached transitional boundary layer. It is concluded that hyperbolic diffusion of ACM can be beneficial but at the cost of case setup time, and EDAC can be an efficient method for incompressible flow. However, care must be taken as pressure fluctuations can have a significant impact on physics and the remedy causes the governing equation to become overly stiff.
