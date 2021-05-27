Cancel
CreatorsPublishersAdvertisers
View more in
Mathematics

Towards Understanding Knowledge Distillation

By Mary Phuong, Christoph H. Lampert
arxiv.org
 22 days ago

Knowledge distillation, i.e., one classifier being trained on the outputs of another classifier, is an empirically very successful technique for knowledge transfer between classifiers. It has even been observed that classifiers learn much faster and more reliably if trained with the outputs of another classifier as soft labels, instead of from ground truth data. So far, however, there is no satisfactory theoretical explanation of this phenomenon. In this work, we provide the first insights into the working mechanisms of distillation by studying the special case of linear and deep linear classifiers. Specifically, we prove a generalization bound that establishes fast convergence of the expected risk of a distillation-trained linear classifier. From the bound and its proof we extract three key factors that determine the success of distillation: * data geometry -- geometric properties of the data distribution, in particular class separation, has a direct influence on the convergence speed of the risk; * optimization bias -- gradient descent optimization finds a very favorable minimum of the distillation objective; and * strong monotonicity -- the expected risk of the student classifier always decreases when the size of the training set grows.

arxiv.org
IN THIS ARTICLE
#Distillation#Icml#Lg#Machine Learning
YOU MAY ALSO LIKE
News Break
Mathematics
News Break
Science
News Break
Computer Science
Related
Sciencetowardsdatascience.com

On the Gap between Adoption and Understanding

Federico Bianchi and Dirk Hovy (2021). On the Gap between Adoption and Understanding in NLP. Findings of the Association for Computational Linguistics. Associations of Computational Linguistics (to appear). The main focus of this work is to describe issues that currently affect NLP research and hinder scientific development. NLP is driven...
Computersarxiv.org

ERNIE-Tiny : A Progressive Distillation Framework for Pretrained Transformer Compression

Pretrained language models (PLMs) such as BERT adopt a training paradigm which first pretrain the model in general data and then finetune the model on task-specific data, and have recently achieved great success. However, PLMs are notorious for their enormous parameters and hard to be deployed on real-life applications. Knowledge distillation has been prevailing to address this problem by transferring knowledge from a large teacher to a much smaller student over a set of data. We argue that the selection of thee three key components, namely teacher, training data, and learning objective, is crucial to the effectiveness of distillation. We, therefore, propose a four-stage progressive distillation framework ERNIE-Tiny to compress PLM, which varies the three components gradually from general level to task-specific level. Specifically, the first stage, General Distillation, performs distillation with guidance from pretrained teacher, gerenal data and latent distillation loss. Then, General-Enhanced Distillation changes teacher model from pretrained teacher to finetuned teacher. After that, Task-Adaptive Distillation shifts training data from general data to task-specific data. In the end, Task-Specific Distillation, adds two additional losses, namely Soft-Label and Hard-Label loss onto the last stage. Empirical results demonstrate the effectiveness of our framework and generalization gain brought by this http URL particular, experiments show that a 4-layer ERNIE-Tiny maintains over 98.0%performance of its 12-layer teacher BERT base on GLUE benchmark, surpassing state-of-the-art (SOTA) by 1.0% GLUE score with the same amount of parameters. Moreover, ERNIE-Tiny achieves a new compression SOTA on five Chinese NLP tasks, outperforming BERT base by 0.4% accuracy with 7.5x fewer parameters and9.4x faster inference speed.
Computersarxiv.org

Beyond BatchNorm: Towards a General Understanding of Normalization in Deep Learning

Inspired by BatchNorm, there has been an explosion of normalization layers in deep learning. Recent works have identified a multitude of beneficial properties in BatchNorm to explain its success. However, given the pursuit of alternative normalization techniques, these properties need to be generalized so that any given layer's success/failure can be accurately predicted. In this work, we take a first step towards this goal by extending known properties of BatchNorm in randomly initialized deep neural networks (DNNs) to nine recently proposed normalization layers. Our primary findings follow: (i) Similar to BatchNorm, activations-based normalization layers can avoid exploding activations in ResNets; (ii) Use of GroupNorm ensures rank of activations is at least $\Omega(\sqrt{\frac{\text{width}}{\text{Group Size}}})$, thus explaining why LayerNorm witnesses slow optimization speed; (iii) Small group sizes result in large gradient norm in earlier layers, hence justifying training instability issues in Instance Normalization and illustrating a speed-stability tradeoff in GroupNorm. Overall, our analysis reveals several general mechanisms that explain the success of normalization techniques in deep learning, providing us with a compass to systematically explore the vast design space of DNN normalization layers.
Sciencearxiv.org

Marginal Utility Diminishes: Exploring the Minimum Knowledge for BERT Knowledge Distillation

Recently, knowledge distillation (KD) has shown great success in BERT compression. Instead of only learning from the teacher's soft label as in conventional KD, researchers find that the rich information contained in the hidden layers of BERT is conducive to the student's performance. To better exploit the hidden knowledge, a common practice is to force the student to deeply mimic the teacher's hidden states of all the tokens in a layer-wise manner. In this paper, however, we observe that although distilling the teacher's hidden state knowledge (HSK) is helpful, the performance gain (marginal utility) diminishes quickly as more HSK is distilled. To understand this effect, we conduct a series of analysis. Specifically, we divide the HSK of BERT into three dimensions, namely depth, length and width. We first investigate a variety of strategies to extract crucial knowledge for each single dimension and then jointly compress the three dimensions. In this way, we show that 1) the student's performance can be improved by extracting and distilling the crucial HSK, and 2) using a tiny fraction of HSK can achieve the same performance as extensive HSK distillation. Based on the second finding, we further propose an efficient KD paradigm to compress BERT, which does not require loading the teacher during the training of student. For two kinds of student models and computing devices, the proposed KD paradigm gives rise to training speedup of 2.7x ~ 3.4x.
Computersarxiv.org

Knowledge distillation: A good teacher is patient and consistent

There is a growing discrepancy in computer vision between large-scale models that achieve state-of-the-art performance and models that are affordable in practical applications. In this paper we address this issue and significantly bridge the gap between these two types of models. Throughout our empirical investigation we do not aim to necessarily propose a new method, but strive to identify a robust and effective recipe for making state-of-the-art large scale models affordable in practice. We demonstrate that, when performed correctly, knowledge distillation can be a powerful tool for reducing the size of large models without compromising their performance. In particular, we uncover that there are certain implicit design choices, which may drastically affect the effectiveness of distillation. Our key contribution is the explicit identification of these design choices, which were not previously articulated in the literature. We back up our findings by a comprehensive empirical study, demonstrate compelling results on a wide range of vision datasets and, in particular, obtain a state-of-the-art ResNet-50 model for ImageNet, which achieves 82.8\% top-1 accuracy.
Economywriteoutofla.com

Study: My Understanding of Photo

Organization Consulting Service – Is it Right For You?. Company expert services to help a business in numerous ways. To detail a few, an organization expert has the following necessary expertise: They have the fundamental skills and also knowledge about the market field they operate in and recognize the demands of their customers. Therefore business expert service helps an individual or company in these three important facets. An organization expert additionally has the abilities of studying the marketing trends as well as understanding the needs of their customers. The business specialists aid you recognize your business’s key issues. When a brand-new product is released, most times it will certainly go through numerous testing stages, and also just if it is regarded a success will certainly it obtain a turn out for the rest of the launch stages. If you are an organization consultant services provider, you will certainly be assisting the firm with these essential choices. The professionals likewise help in recognizing one of the most affordable items, as well as help in producing an advertising strategy based upon the consumer’s needs and wants. It has been observed that most firms throughout markets today utilize a huge selection of consulting services to get a far better suggestion about their market and also improve their methods. In addition, most company owner employ organization consultants due to the fact that the solutions help in determining the covert profit margins as well as profits, in addition to the areas where the business needs improvement. As an example, if there is a particular advertising approach that is not getting the preferred response, after that a business consultant services provider could suggest the change in marketing technique to get a better reaction. Business consulting is an extensive solution but it need to be dressmaker made to satisfy the certain demands of a specific company. An additional benefit of business consulting services is that they help in recognizing the right kind of specialists as well as skilled professionals to join the business. There are various type of professionals, and depending upon the need of a client, a specific type of expert may be required to do a particular task. Therefore, the business can recognize the particular benefits that appear of employing a speaking with firm as well as can choose the best kind of specialist or specialists to do the job. There are many companies that give business consulting solutions in the UK. A lot of these companies use numerous variety of services to the clients. Some companies offer services like strategic planning, organization evaluation, marketing research, financial consulting and job administration consulting. Therefore, a business owner can try to find an expert that fulfills his requirements. The majority of these firms assist their clients in supplying inexpensive remedies to boost their efficiency and streamline their operations. The majority of these business experts or firms bill a nominal fee for providing the service. This fee is typically agreed upon prior to the work begins as well as mostly for a yearly basis. A few of them additionally bill a backup charge, which suggests that after the work is done, if there is a mistake made, they would certainly pay you for it. Nonetheless, this need to be the last choice as the consultant will certainly not enjoy regarding it and might take legal action against you. It would certainly likewise be great to look for independent service specialist solutions, which would certainly use one of the most advanced tools and innovations to get the job done much faster and also at less expensive prices.
Softwareproformacolorpress.com

Understanding The Recognition Pattern Of AI

Of theseven patterns of AIthat represent the ways in which AI is being implemented, one of the most common is the recognition pattern. The main idea of therecognition pattern of AIis that we’re using machine learning and cognitive technology to help identify and categorize unstructured data into specific classifications. This unstructured data could be images, video, text, or even quantitative data. The power of this pattern is that we’re enabling machines to do the thing that our brains seem to do so easily: identify what we’re perceiving in the real world around us.
Coding & Programmingarxiv.org

XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation

While deep and large pre-trained models are the state-of-the-art for various natural language processing tasks, their huge size poses significant challenges for practical uses in resource constrained settings. Recent works in knowledge distillation propose task-agnostic as well as task-specific methods to compress these models, with task-specific ones often yielding higher compression rate. In this work, we develop a new task-agnostic distillation framework XtremeDistilTransformers that leverages the advantage of task-specific methods for learning a small universal model that can be applied to arbitrary tasks and languages. To this end, we study the transferability of several source tasks, augmentation resources and model architecture for distillation. We evaluate our model performance on multiple tasks, including the General Language Understanding Evaluation (GLUE) benchmark, SQuAD question answering dataset and a massive multi-lingual NER dataset with 41 languages.
Coding & Programmingarxiv.org

Learning by Distillation: A Self-Supervised Learning Framework for Optical Flow Estimation

We present DistillFlow, a knowledge distillation approach to learning optical flow. DistillFlow trains multiple teacher models and a student model, where challenging transformations are applied to the input of the student model to generate hallucinated occlusions as well as less confident predictions. Then, a self-supervised learning framework is constructed: confident predictions from teacher models are served as annotations to guide the student model to learn optical flow for those less confident predictions. The self-supervised learning framework enables us to effectively learn optical flow from unlabeled data, not only for non-occluded pixels, but also for occluded pixels. DistillFlow achieves state-of-the-art unsupervised learning performance on both KITTI and Sintel datasets. Our self-supervised pre-trained model also provides an excellent initialization for supervised fine-tuning, suggesting an alternate training paradigm in contrast to current supervised learning methods that highly rely on pre-training on synthetic data. At the time of writing, our fine-tuned models ranked 1st among all monocular methods on the KITTI 2015 benchmark, and outperform all published methods on the Sintel Final benchmark. More importantly, we demonstrate the generalization capability of DistillFlow in three aspects: framework generalization, correspondence generalization and cross-dataset generalization.
Sciencearxiv.org

Exploring and Distilling Posterior and Prior Knowledge for Radiology Report Generation

Automatically generating radiology reports can improve current clinical practice in diagnostic radiology. On one hand, it can relieve radiologists from the heavy burden of report writing; On the other hand, it can remind radiologists of abnormalities and avoid the misdiagnosis and missed diagnosis. Yet, this task remains a challenging job for data-driven neural networks, due to the serious visual and textual data biases. To this end, we propose a Posterior-and-Prior Knowledge Exploring-and-Distilling approach (PPKED) to imitate the working patterns of radiologists, who will first examine the abnormal regions and assign the disease topic tags to the abnormal regions, and then rely on the years of prior medical knowledge and prior working experience accumulations to write reports. Thus, the PPKED includes three modules: Posterior Knowledge Explorer (PoKE), Prior Knowledge Explorer (PrKE) and Multi-domain Knowledge Distiller (MKD). In detail, PoKE explores the posterior knowledge, which provides explicit abnormal visual regions to alleviate visual data bias; PrKE explores the prior knowledge from the prior medical knowledge graph (medical knowledge) and prior radiology reports (working experience) to alleviate textual data bias. The explored knowledge is distilled by the MKD to generate the final reports. Evaluated on MIMIC-CXR and IU-Xray datasets, our method is able to outperform previous state-of-the-art models on these two datasets.
Sciencearxiv.org

Simon Says: Evaluating and Mitigating Bias in Pruned Neural Networks with Knowledge Distillation

In recent years the ubiquitous deployment of AI has posed great concerns in regards to algorithmic bias, discrimination, and fairness. Compared to traditional forms of bias or discrimination caused by humans, algorithmic bias generated by AI is more abstract and unintuitive therefore more difficult to explain and mitigate. A clear gap exists in the current literature on evaluating and mitigating bias in pruned neural networks. In this work, we strive to tackle the challenging issues of evaluating, mitigating, and explaining induced bias in pruned neural networks. Our paper makes three contributions. First, we propose two simple yet effective metrics, Combined Error Variance (CEV) and Symmetric Distance Error (SDE), to quantitatively evaluate the induced bias prevention quality of pruned models. Second, we demonstrate that knowledge distillation can mitigate induced bias in pruned neural networks, even with unbalanced datasets. Third, we reveal that model similarity has strong correlations with pruning induced bias, which provides a powerful method to explain why bias occurs in pruned neural networks. Our code is available at this https URL.
Books & Literatureeconlib.org

Knowledge, Reality, and Value

The Book Club on Mike Huemer’s Knowledge, Reality, and Value continues. Today, I cover Part 2 of the book. To repeat, though I’m a huge fan of the book, I’m focusing almost entirely on disagreements. 1. One of Huemer’s preferred responses to the classic Brain-in-a-Vat (BIV) scenario is that –...
Coding & Programmingmathworks.com

Knowledge Based Neural Networks

Knowledge Based Neural Networks are a bi-fidelity machine learning architecture that allow the outputs of a coarse scale model,. , to be augmented by the predictions of a neural network. Having been trained using a dataset comprising outputs of a high-fidelity model, Fe(x), the KBaNN corrects the outputs of the coarse model to emulate the output of.
Coding & Programmingarxiv.org

Generate, Annotate, and Learn: Generative Models Advance Self-Training and Knowledge Distillation

Semi-Supervised Learning (SSL) has seen success in many application domains, but this success often hinges on the availability of task-specific unlabeled data. Knowledge distillation (KD) has enabled compressing deep networks and ensembles, achieving the best results when distilling knowledge on fresh task-specific unlabeled examples. However, task-specific unlabeled data can be challenging to find. We present a general framework called "generate, annotate, and learn (GAL)" that uses unconditional generative models to synthesize in-domain unlabeled data, helping advance SSL and KD on different tasks. To obtain strong task-specific generative models, we adopt generic generative models, pretrained on open-domain data, and fine-tune them on inputs from specific tasks. Then, we use existing classifiers to annotate generated unlabeled examples with soft pseudo labels, which are used for additional training. When self-training is combined with samples generated from GPT2-large, fine-tuned on the inputs of each GLUE task, we outperform a strong RoBERTa-large baseline on the GLUE benchmark. Moreover, KD on GPT-2 samples yields a new state-of-the-art for 6-layer transformers on the GLUE leaderboard. Finally, self-training with GAL offers significant gains on image classification on CIFAR-10 and four tabular tasks from the UCI repository.
Computersarxiv.org

Dynamic Distillation Network for Cross-Domain Few-Shot Recognition with Unlabeled Data

Most existing works in few-shot learning rely on meta-learning the network on a large base dataset which is typically from the same domain as the target dataset. We tackle the problem of cross-domain few-shot learning where there is a large shift between the base and target domain. The problem of cross-domain few-shot recognition with unlabeled target data is largely unaddressed in the literature. STARTUP was the first method that tackles this problem using self-training. However, it uses a fixed teacher pretrained on a labeled base dataset to create soft labels for the unlabeled target samples. As the base dataset and unlabeled dataset are from different domains, projecting the target images in the class-domain of the base dataset with a fixed pretrained model might be sub-optimal. We propose a simple dynamic distillation-based approach to facilitate unlabeled images from the novel/base dataset. We impose consistency regularization by calculating predictions from the weakly-augmented versions of the unlabeled images from a teacher network and matching it with the strongly augmented versions of the same images from a student network. The parameters of the teacher network are updated as exponential moving average of the parameters of the student network. We show that the proposed network learns representation that can be easily adapted to the target domain even though it has not been trained with target-specific classes during the pretraining phase. Our model outperforms the current state-of-the art method by 4.4% for 1-shot and 3.6% for 5-shot classification in the BSCD-FSL benchmark, and also shows competitive performance on traditional in-domain few-shot learning task. Our code will be available at: this https URL.
Technologytechinvestornews.com

The Pure Power of Knowledge

In this interview, brought to you by Pure Storage, Arjan Timmerman had the opportunity to sit with James Gallegos, Pure Storage’s Director of Product Marketing, and discuss what technology can offer humanity. The two discuss what Pure provides to its customers, including its Pure1 Meta, an AIOps solution, and the new options announced during the annual Pure //Accelerate event, as well as where the future for these solutions will go to.
Mathematicsarxiv.org

Experimental Demonstrations of Native Implementation of Boolean Logic Hamiltonian in a Superconducting Quantum Annealer

Daisuke Saida, Yuki Yamanashi, Mutsuo Hidaka, Fuminori Hirayama, Kentaro Imafuku, Shuichi Nagasawa, Siro Kawabata. Experimental demonstrations of quantum annealing with native implementation of Boolean logic Hamiltonians are reported. As a superconducting integrated circuit, a problem Hamiltonian whose set of ground states is consistent with a given truth table is implemented for quantum annealing with no redundant qubits. As examples of the truth table, NAND and NOR are successfully fabricated as an identical circuit. Similarly, a native implementation of a multiplier comprising six superconducting flux qubits is also demonstrated. These native implementations of Hamiltonians consistent with Boolean logic provide an efficient and scalable way of applying annealing computation to so-called circuit satisfiability problems that aim to find a set of inputs consistent with a given output over any Boolean logic functions, especially those like factorization through a multiplier Hamiltonian. A proof-of-concept demonstration of a hybrid computing architecture for domain-specific quantum computing is described.
Computersarxiv.org

Exponential Error Convergence in Data Classification with Optimized Random Features: Acceleration by Quantum Machine Learning

Random features are a central technique for scalable learning algorithms based on kernel methods. A recent work has shown that an algorithm for machine learning by quantum computer, quantum machine learning (QML), can exponentially speed up sampling of optimized random features, even without imposing restrictive assumptions on sparsity and low-rankness of matrices that had limited applicability of conventional QML algorithms; this QML algorithm makes it possible to significantly reduce and provably minimize the required number of features for regression tasks. However, a major interest in the field of QML is how widely the advantages of quantum computation can be exploited, not only in the regression tasks. We here construct a QML algorithm for a classification task accelerated by the optimized random features. We prove that the QML algorithm for sampling optimized random features, combined with stochastic gradient descent (SGD), can achieve state-of-the-art exponential convergence speed of reducing classification error in a classification task under a low-noise condition; at the same time, our algorithm with optimized random features can take advantage of the significant reduction of the required number of features so as to accelerate each iteration in the SGD and evaluation of the classifier obtained from our algorithm. These results discover a promising application of QML to significant acceleration of the leading classification algorithm based on kernel methods, without ruining its applicability to a practical class of data sets and the exponential error-convergence speed.
Sciencearxiv.org

Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean Estimation

We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset. Specifically, we are given a set $T$ of $n$ points in $\mathbb{R}^d$ and a parameter $0< \alpha <\frac 1 2$ such that an $\alpha$-fraction of the points in $T$ are i.i.d. samples from a well-behaved distribution $\mathcal{D}$ and the remaining $(1-\alpha)$-fraction of the points are arbitrary. The goal is to output a small list of vectors at least one of which is close to the mean of $\mathcal{D}$. As our main contribution, we develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees, with running time $n^{1 + o(1)} d$. All prior algorithms for this problem had additional polynomial factors in $\frac 1 \alpha$. As a corollary, we obtain the first almost-linear time algorithms for clustering mixtures of $k$ separated well-behaved distributions, nearly-matching the statistical guarantees of spectral methods. Prior clustering algorithms inherently relied on an application of $k$-PCA, thereby incurring runtimes of $\Omega(n d k)$. This marks the first runtime improvement for this basic statistical problem in nearly two decades.
Mathematicsarxiv.org

Spectral dimensions of Kre\uın-Feller operators and $L^{q}$-spectra

We study the spectral dimensions and spectral asymptotics of Kre\uın-Feller operators for arbitrary finite Borel measures on $\left(0,1\right).$ Connections between the spectral dimension, the $L^{q}$-spectrum, the partition entropy and the optimised coarse multifractal dimension are established. In particular, we show that the upper spectral dimension always corresponds to the fixed point of the $L^{q}$-spectrum of the corresponding measure. Natural bounds reveal intrinsic connections to the Minkowski dimension of the support of the associated Borel measure. Further, we give a sufficient condition on the $L^{q}$-spectrum to guarantee the existence of the spectral dimension. As an application, we confirm the existence of the spectral dimension of self-conformal measures with and without overlap as well as of certain measures of pure point type. We construct a simple example for which the spectral dimension does not exist and determine explicitly its upper and lower spectral dimension.