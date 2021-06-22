Cancel
Software-based neural video decoder leverages AI accelerator on Snapdragon 888

By Jean-Luc Aufranc (CNXSoft)
cnx-software.com
 9 days ago

Cover picture for the articleSometimes hardware blocks got to work on tasks they were not initially designed to handle. For example, AI inference used to be mostly offloaded to the GPU before neural network accelerators became more common in SoC’s. Qualcomm AI Research has now showcased a software-based neural video decoder that leverages both...

www.cnx-software.com
#Video Decoder#Video Quality#Snapdragon#Soc#Qualcomm Ai Research#Ai#Qualcomm Ai Engine#Cpu#Cnx Software
Computersarxiv.org

NAX: Co-Designing Neural Network and Hardware Architecture for Memristive Xbar based Computing Systems

In-Memory Computing (IMC) hardware using Memristive Crossbar Arrays (MCAs) are gaining popularity to accelerate Deep Neural Networks (DNNs) since it alleviates the "memory wall" problem associated with von-Neumann architecture. The hardware efficiency (energy, latency and area) as well as application accuracy (considering device and circuit non-idealities) of DNNs mapped to such hardware are co-dependent on network parameters, such as kernel size, depth etc. and hardware architecture parameters such as crossbar size. However, co-optimization of both network and hardware parameters presents a challenging search space comprising of different kernel sizes mapped to varying crossbar sizes. To that effect, we propose NAX -- an efficient neural architecture search engine that co-designs neural network and IMC based hardware architecture. NAX explores the aforementioned search space to determine kernel and corresponding crossbar sizes for each DNN layer to achieve optimal tradeoffs between hardware efficiency and application accuracy. Our results from NAX show that the networks have heterogeneous crossbar sizes across different network layers, and achieves optimal hardware efficiency and accuracy considering the non-idealities in crossbars. On CIFAR-10 and Tiny ImageNet, our models achieve 0.8%, 0.2% higher accuracy, and 17%, 4% lower EDAP (energy-delay-area product) compared to a baseline ResNet-20 and ResNet-18 models, respectively.
Engineeringarxiv.org

A Construction Kit for Efficient Low Power Neural Network Accelerator Designs

Implementing embedded neural network processing at the edge requires efficient hardware acceleration that couples high computational performance with low power consumption. Driven by the rapid evolution of network architectures and their algorithmic features, accelerator designs are constantly updated and improved. To evaluate and compare hardware design choices, designers can refer to a myriad of accelerator implementations in the literature. Surveys provide an overview of these works but are often limited to system-level and benchmark-specific performance metrics, making it difficult to quantitatively compare the individual effect of each utilized optimization technique. This complicates the evaluation of optimizations for new accelerator designs, slowing-down the research progress. This work provides a survey of neural network accelerator optimization approaches that have been used in recent works and reports their individual effects on edge processing performance. It presents the list of optimizations and their quantitative effects as a construction kit, allowing to assess the design choices for each building block separately. Reported optimizations range from up to 10'000x memory savings to 33x energy reductions, providing chip designers an overview of design choices for implementing efficient low power neural network accelerators.
Softwarearxiv.org

Cloud based Scalable Object Recognition from Video Streams using Orientation Fusion and Convolutional Neural Networks

Object recognition from live video streams comes with numerous challenges such as the variation in illumination conditions and poses. Convolutional neural networks (CNNs) have been widely used to perform intelligent visual object recognition. Yet, CNNs still suffer from severe accuracy degradation, particularly on illumination-variant datasets. To address this problem, we propose a new CNN method based on orientation fusion for visual object recognition. The proposed cloud-based video analytics system pioneers the use of bi-dimensional empirical mode decomposition to split a video frame into intrinsic mode functions (IMFs). We further propose these IMFs to endure Reisz transform to produce monogenic object components, which are in turn used for the training of CNNs. Past works have demonstrated how the object orientation component may be used to pursue accuracy levels as high as 93\%. Herein we demonstrate how a feature-fusion strategy of the orientation components leads to further improving visual recognition accuracy to 97\%. We also assess the scalability of our method, looking at both the number and the size of the video streams under scrutiny. We carry out extensive experimentation on the publicly available Yale dataset, including also a self generated video datasets, finding significant improvements (both in accuracy and scale), in comparison to AlexNet, LeNet and SE-ResNeXt, which are the three most commonly used deep learning models for visual object recognition and classification.
Computersarxiv.org

Encoder-Decoder Based Attractor Calculation for End-to-End Neural Diarization

This paper investigates an end-to-end neural diarization (EEND) method for an unknown number of speakers. In contrast to the conventional pipeline approach to speaker diarization, EEND methods are better in terms of speaker overlap handling. However, EEND still has a disadvantage in that it cannot deal with a flexible number of speakers. To remedy this problem, we introduce encoder-decoder-based attractor calculation module (EDA) to EEND. Once frame-wise embeddings are obtained, EDA sequentially generates speaker-wise attractors on the basis of a sequence-to-sequence method using an LSTM encoder-decoder. The attractor generation continues until a stopping condition is satisfied; thus, the number of attractors can be flexible. Diarization results are then estimated as dot products of the attractors and embeddings. The embeddings from speaker overlaps result in larger dot product values with multiple attractors; thus, this method can deal with speaker overlaps. Because the maximum number of output speakers is still limited by the training set, we also propose an iterative inference method to remove this restriction. Further, we propose a method that aligns the estimated diarization results with the results of an external speech activity detector, which enables fair comparison against pipeline approaches. Extensive evaluations on simulated and real datasets show that EEND-EDA outperforms the conventional pipeline approach.
BusinessHPCwire

HPE Acquires Determined AI to Accelerate Machine Learning Training

June 21, 2021 — Hewlett Packard Enterprise today announced that it has acquired Determined AI, a San Francisco-based startup that delivers a powerful and robust software stack to train AI models faster, at any scale, using its open source machine learning (ML) platform. HPE will combine Determined AI’s unique software...
Coding & Programmingarxiv.org

APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores

Over the years, accelerating neural networks with quantization has been widely studied. Unfortunately, prior efforts with diverse precisions (e.g., 1-bit weights and 2-bit activations) are usually restricted by limited precision support on GPUs (e.g., int1 and int4). To break such restrictions, we introduce the first Arbitrary Precision Neural Network framework (APNN-TC) to fully exploit quantization benefits on Ampere GPU Tensor Cores. Specifically, APNN-TC first incorporates a novel emulation algorithm to support arbitrary short bit-width computation with int1 compute primitives and XOR/AND Boolean operations. Second, APNN-TC integrates arbitrary precision layer designs to efficiently map our emulation algorithm to Tensor Cores with novel batching strategies and specialized memory organization. Third, APNN-TC embodies a novel arbitrary precision NN design to minimize memory access across layers and further improve performance. Extensive evaluations show that APNN-TC can achieve significant speedup over CUTLASS kernels and various NN models, such as ResNet and VGG.
SoftwareTimes Union

AI-Based Contract Intelligence Software from Scry Analytics

“Contract intelligence capabilities developed by Scry Analytics is one of the most innovative and transformative solutions that can help enterprises realize tangible value in short order. This solution was particularly liked by my organization for its user friendliness and for its ability to make accessible such advanced AI /ML capabilities to non-tech finance professionals to transform processes. The accuracy rate of this solution on some of the key tasks exceeded normal human operations accuracy. The performance of this solution just kept getting better with every new contract,” said Rohit Amberkar, Global Procure-to-Pay Leader for Microsoft Corporation.
Technologyarxiv.org

Robust EMRAN based Neural Aided Learning Controller for Autonomous Vehicles

This paper presents an online evolving neural network-based inverse dynamics learning controller for an autonomous vehicles' longitudinal and lateral control under model uncertainties and disturbances. The inverse dynamics of the vehicle is approximated using a feedback error learning mechanism that utilizes a dynamic Radial Basis Function neural network, referred to as the Extended Minimal Resource Allocating Network (EMRAN). EMRAN uses an extended Kalman filter approach for learning and a growing/pruning condition helps in keeping the number of hidden neurons minimum. The online learning algorithm helps in handling the uncertainties and dynamic variations and also the unknown disturbances on the road. The proposed control architecture employs two coupled conventional controllers aided by the EMRAN inverse dynamics controller. The control architecture has a conventional PID controller for cruise control and a Stanley controller for path-tracking. Performances of both the longitudinal and lateral controllers are compared with existing control methods and the results clearly indicate that the proposed control scheme handles the disturbances and parametric uncertainties better, and also provides better tracking performance in autonomous vehicles.
BusinessZDNet

Intel forms Accelerated Computing, Software business units

Chip giant Intel this afternoon said it will create two new business units, an Accelerated Computing Systems and Graphics Group, and a Software and Advanced Technology Group. The former will focus on high-performance computing and also graphics technology, while the latter will "drive Intel's vision for software," the company said.
Technologythefastmode.com

Cellwize, Intel to Accelerate Deployment of AI-driven 5G vRAN Networks

Cellwize Wireless on Monday announced a collaboration with Intel to help operators deploy 5G virtual RAN (vRAN) more quickly, with a fully automated process. The collaboration will enable Cellwize’s CHIME technology on Intel Xeon Scalable processors with built-in AI acceleration and Intel FlexRAN reference software in order to propel deployment of automated AI-driven 5G vRAN networks.
SoftwareHPCwire

Quadric Announces Unified Silicon and Software Platform Optimized for On-Device AI

BURLINGAME, Calif., June 22, 2021 — Quadric (quadric.io), an innovator in high-performance edge processing, has introduced a unified silicon and software platform that unlocks the power of on-device AI. Built to accelerate computation speeds while reducing power consumption, Quadric’s new general-purpose processor platform meets the computing needs of today’s increasingly autonomous world of smart sensors, IoT devices, factory automation, robots, 5G infrastructure and medical imaging. The platform is designed to handle any AI algorithm, as well as classic algorithms used for tasks such as digital signal processing, high-performance computing and image processing.
Coding & Programmingarxiv.org

Tensor-based framework for training flexible neural networks

Activation functions (AFs) are an important part of the design of neural networks (NNs), and their choice plays a predominant role in the performance of a NN. In this work, we are particularly interested in the estimation of flexible activation functions using tensor-based solutions, where the AFs are expressed as a weighted sum of predefined basis functions. To do so, we propose a new learning algorithm which solves a constrained coupled matrix-tensor factorization (CMTF) problem. This technique fuses the first and zeroth order information of the NN, where the first-order information is contained in a Jacobian tensor, following a constrained canonical polyadic decomposition (CPD). The proposed algorithm can handle different decomposition bases. The goal of this method is to compress large pretrained NN models, by replacing subnetworks, {\em i.e.,} one or multiple layers of the original network, by a new flexible layer. The approach is applied to a pretrained convolutional neural network (CNN) used for character classification.
SoftwareElectronic Engineering Times

Quadric Accelerator Takes On AI, Computer Vision

Quadric has optimized its architecture for both AI and standard computer vision algorithms aimed at edge applications. Silicon Valley startup Quadric has built an accelerator designed to speed both AI and standard computer vision algorithm workloads for edge devices such as robots, factory automation and medical imaging. The company’s hardware architecture is a novel hybrid data-flow and Von Neumann design which can handle workloads including neural networks, machine learning, computer vision, DSP and basic linear algebra subprograms.
TechnologyAndroid Headlines

Samsung Is Testing Its Upcoming Exynos SoC With AMD GPU

Samsung seems to be testing its upcoming Exynos SoC with AMD GPU. A well-known tipster, Ice Universe, has revealed that the SoC appeared in the ‘Wild Life’ test. He also included a screenshot, which you can see below the article. Samsung is testing its Exynos SoC which is backed by...
Sciencearxiv.org

Accelerating Recurrent Neural Networks for Gravitational Wave Experiments

This paper presents novel reconfigurable architectures for reducing the latency of recurrent neural networks (RNNs) that are used for detecting gravitational waves. Gravitational interferometers such as the LIGO detectors capture cosmic events such as black hole mergers which happen at unknown times and of varying durations, producing time-series data. We have developed a new architecture capable of accelerating RNN inference for analyzing time-series data from LIGO detectors. This architecture is based on optimizing the initiation intervals (II) in a multi-layer LSTM (Long Short-Term Memory) network, by identifying appropriate reuse factors for each layer. A customizable template for this architecture has been designed, which enables the generation of low-latency FPGA designs with efficient resource utilization using high-level synthesis tools. The proposed approach has been evaluated based on two LSTM models, targeting a ZYNQ 7045 FPGA and a U250 FPGA. Experimental results show that with balanced II, the number of DSPs can be reduced up to 42% while achieving the same IIs. When compared to other FPGA-based LSTM designs, our design can achieve about 4.92 to 12.4 times lower latency.
Cell PhonesSamMobile

The Snapdragon 888+ is here, but will it power a Samsung phone?

Qualcomm has finally unveiled the Snapdragon 888+ 5G mobile chipset. The unveiling was done during the first day of the MWC (Mobile World Congress) Barcelona 2021 event. The new processor is a slightly higher binned version of the Snapdragon 888 processor, which was launched earlier this year, and comes with higher clock speeds.
Softwarearxiv.org

Efficient Document Image Classification Using Region-Based Graph Neural Network

Document image classification remains a popular research area because it can be commercialized in many enterprise applications across different industries. Recent advancements in large pre-trained computer vision and language models and graph neural networks has lent document image classification many tools. However using large pre-trained models usually requires substantial computing resources which could defeat the cost-saving advantages of automatic document image classification. In the paper we propose an efficient document image classification framework that uses graph convolution neural networks and incorporates textual, visual and layout information of the document. We have rigorously benchmarked our proposed algorithm against several state-of-art vision and language models on both publicly available dataset and a real-life insurance document classification dataset. Empirical results on both publicly available and real-world data show that our methods achieve near SOTA performance yet require much less computing resources and time for model training and inference. This results in solutions than offer better cost advantages, especially in scalable deployment for enterprise applications. The results showed that our algorithm can achieve classification performance quite close to SOTA. We also provide comprehensive comparisons of computing resources, model sizes, train and inference time between our proposed methods and baselines. In addition we delineate the cost per image using our method and other baselines.
Computersarxiv.org

Robust Pose Transfer with Dynamic Details using Neural Video Rendering

Pose transfer of human videos aims to generate a high fidelity video of a target person imitating actions of a source person. A few studies have made great progress either through image translation with deep latent features or neural rendering with explicit 3D features. However, both of them rely on large amounts of training data to generate realistic results, and the performance degrades on more accessible internet videos due to insufficient training frames. In this paper, we demonstrate that the dynamic details can be preserved even trained from short monocular videos. Overall, we propose a neural video rendering framework coupled with an image-translation-based dynamic details generation network (D2G-Net), which fully utilizes both the stability of explicit 3D features and the capacity of learning components. To be specific, a novel texture representation is presented to encode both the static and pose-varying appearance characteristics, which is then mapped to the image space and rendered as a detail-rich frame in the neural rendering stage. Moreover, we introduce a concise temporal loss in the training stage to suppress the detail flickering that is made more visible due to high-quality dynamic details generated by our method. Through extensive comparisons, we demonstrate that our neural human video renderer is capable of achieving both clearer dynamic details and more robust performance even on accessible short videos with only 2k - 4k frames.