DOI: 10.1109/TPAMI.2024.3355495
Terbit pada 18 Januari 2024 Pada IEEE Transactions on Pattern Analysis and Machine Intelligence

AIfES: A Next-Generation Edge AI Framework

L. Krupp P. Gembaczka Lars Wulfert + 4 penulis

Abstrak

Edge Artificial Intelligence (AI) relies on the integration of Machine Learning (ML) into even the smallest embedded devices, thus enabling local intelligence in real-world applications, e.g. for image or speech processing. Traditional Edge AI frameworks lack important aspects required to keep up with recent and upcoming ML innovations. These aspects include low flexibility concerning the target hardware and limited support for custom hardware accelerator integration. Artificial Intelligence for Embedded Systems Framework (AIfES) has the goal to overcome these challenges faced by traditional edge AI frameworks. In this paper, we give a detailed overview of the architecture of AIfES and the applied design principles. Finally, we compare AIfES with TensorFlow Lite for Microcontrollers (TFLM) on an ARM Cortex-M4-based System-on-Chip (SoC) using fully connected neural networks (FCNNs) and convolutional neural networks (CNNs). AIfES outperforms TFLM in both execution time and memory consumption for the FCNNs. Additionally, using AIfES reduces memory consumption by up to 54% when using CNNs. Furthermore, we show the performance of AIfES during the training of FCNN as well as CNN and demonstrate the feasibility of training a CNN on a resource-constrained device with a memory usage of slightly more than 100 kB of RAM.

Artikel Ilmiah Terkait

TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems

Meghna Natraj Jared Duke Shlomi Regev + 10 lainnya

17 Oktober 2020

Deep learning inference on embedded devices is a burgeoning field with myriad applications because tiny embedded devices are omnipresent. But we must overcome major challenges before we can benefit from this opportunity. Embedded processors are severely resource constrained. Their nearest mobile counterparts exhibit at least a 100---1,000x difference in compute capability, memory availability, and power consumption. As a result, the machine-learning (ML) models and associated ML inference framework must not only execute efficiently but also operate in a few kilobytes of memory. Also, the embedded devices' ecosystem is heavily fragmented. To maximize efficiency, system vendors often omit many features that commonly appear in mainstream systems, including dynamic memory allocation and virtual memory, that allow for cross-platform interoperability. The hardware comes in many flavors (e.g., instruction-set architecture and FPU support, or lack thereof). We introduce TensorFlow Lite Micro (TF Micro), an open-source ML inference framework for running deep-learning models on embedded systems. TF Micro tackles the efficiency requirements imposed by embedded-system resource constraints and the fragmentation challenges that make cross-platform interoperability nearly impossible. The framework adopts a unique interpreter-based approach that provides flexibility while overcoming these challenges. This paper explains the design decisions behind TF Micro and describes its implementation details. Also, we present an evaluation to demonstrate its low resource requirement and minimal run-time performance overhead.

Embedded Development Boards for Edge-AI: A Comprehensive Report

Kiran Mehmood Usama Latif Saad Wazir + 2 lainnya

2 September 2020

The use of Deep Learning and Machine Learning is becoming pervasive day by day which is opening doors to new opportunities in every aspect of technology. Its application Ranges from Health-care to Self-driving Cars, Home Automation to Smart-agriculture, and Industry 4.0. Traditionally the majority of the processing for IoT applications is being done on a central cloud but that has its issues; which include latency, security, bandwidth, and privacy, etc. It is estimated that there will be around 20 Million IoT devices by 2020 which will increase problems with sending data to the cloud and doing the processing there. A new trend of processing the data on the edge of the network is emerging. The idea is to do processing as near the point of data production as possible. Doing processing on the nodes generating the data is called Edge Computing and doing processing on a layer between the cloud and the point of data production is called Fog computing. There are no standard definitions for any of these, hence they are usually used interchangeably. In this paper, we have reviewed the development boards available for running Artificial Intelligence algorithms on the Edge

From Near-Sensor to In-Sensor: A State-of-the-Art Review of Embedded AI Vision Systems

Gilles Sicard Vincent Lorrain William Fabre + 2 lainnya

1 Agustus 2024

In modern cyber-physical systems, the integration of AI into vision pipelines is now a standard practice for applications ranging from autonomous vehicles to mobile devices. Traditional AI integration often relies on cloud-based processing, which faces challenges such as data access bottlenecks, increased latency, and high power consumption. This article reviews embedded AI vision systems, examining the diverse landscape of near-sensor and in-sensor processing architectures that incorporate convolutional neural networks. We begin with a comprehensive analysis of the critical characteristics and metrics that define the performance of AI-integrated vision systems. These include sensor resolution, frame rate, data bandwidth, computational throughput, latency, power efficiency, and overall system scalability. Understanding these metrics provides a foundation for evaluating how different embedded processing architectures impact the entire vision pipeline, from image capture to AI inference. Our analysis delves into near-sensor systems that leverage dedicated hardware accelerators and commercially available components to efficiently process data close to their source, minimizing data transfer overhead and latency. These systems offer a balance between flexibility and performance, allowing for real-time processing in constrained environments. In addition, we explore in-sensor processing solutions that integrate computational capabilities directly into the sensor. This approach addresses the rigorous demand constraints of embedded applications by significantly reducing data movement and power consumption while also enabling in-sensor feature extraction, pre-processing, and CNN inference. By comparing these approaches, we identify trade-offs related to flexibility, power consumption, and computational performance. Ultimately, this article provides insights into the evolving landscape of embedded AI vision systems and suggests new research directions for the development of next-generation machine vision systems.

SECDA: Efficient Hardware/Software Co-Design of FPGA-based DNN Accelerators for Edge Inference

D. Kaeli José Cano Nicolas Bohm Agostini + 2 lainnya

1 Oktober 2021

Edge computing devices inherently face tight resource constraints, which is especially apparent when deploying Deep Neural Networks (DNN) with high memory and compute demands. FPGAs are commonly available in edge devices. Since these reconfigurable circuits can achieve higher throughput and lower power consumption than general purpose processors, they are especially well-suited for DNN acceleration. However, existing solutions for designing FPGA-based DNN accelerators for edge devices come with high development overheads, given the cost of repeated FPGA synthesis passes, reimplementation in a Hardware Description Language (HDL) of the simulated design, and accelerator system integration. In this paper we propose SECDA, a new hardware/software co-design methodology to reduce design time of optimized DNN inference accelerators on edge devices with FPGAs. SECDA combines cost-effective SystemC simulation with hardware execution, streamlining design space exploration and the development process via reduced design evaluation time. As a case study, we use SECDA to efficiently develop two different DNN accelerator designs on a PYNQ-Z1 board, a platform that includes an edge FPGA. We quickly and iteratively explore the system's hardware/software stack, while identifying and mitigating performance bottlenecks. We evaluate the two accelerator designs with four common DNN models, achieving an average performance speedup across models of up to 3.5× with a 2.9× reduction in energy consumption over CPU-only inference. Our code is available at https://github.com/gicLAB/SECDA

Hardware-aware Partitioning of Convolutional Neural Network Inference for Embedded AI Applications

Tim Hotfilter J. Becker Fabian Kreß + 4 lainnya

1 Mei 2022

Embedded image processing applications like multicamera-based object detection or semantic segmentation are often based on Convolutional Neural Networks (CNNs) to provide precise and reliable results. The deployment of CNNs in embedded systems, however, imposes additional constraints such as latency restrictions and limited energy consumption in the sensor platform. These requirements have to be considered during hardware/software co-design of embedded Artifical Intelligence (AI) applications. In addition, the transmission of uncompressed image data from the sensor to a central edge node requires large bandwidth on the link, which must also be taken into account during the design phase.Therefore, we present a simulation toolchain for fast evaluation of hardware-aware CNN partitioning for embedded AI applications. This approach explores an efficient workload distribution between sensor nodes and a central edge node. Neither processing all layers close to the sensor nor transmitting all uncompressed raw data to the edge node is an optimal solution for each use case. Hence, our proposed simulation toolchain evaluates power and performance metrics for each reasonable partitioning point in a CNN. In contrast to the state of the art, our approach does not only consider the neural network architecture. In the evaluation, our simulation toolchain additionally takes into account hardware components such as special accelerators and memories that are implemented in the sensor node.Exemplary, we show the simulation results for three commonly used CNNs in embedded systems. Thereby, we identify advantageous partitioning points regarding inference latency and energy consumption. With the support of the toolchain, we are able to identify three beneficial partitioning points for FCN ResNet-50 and two for GoogLeNet as well as for SqueezeNet V1.1.

Daftar Referensi

1 referensi

TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems

Meghna Natraj Jared Duke + 11 lainnya

17 Oktober 2020

Deep learning inference on embedded devices is a burgeoning field with myriad applications because tiny embedded devices are omnipresent. But we must overcome major challenges before we can benefit from this opportunity. Embedded processors are severely resource constrained. Their nearest mobile counterparts exhibit at least a 100---1,000x difference in compute capability, memory availability, and power consumption. As a result, the machine-learning (ML) models and associated ML inference framework must not only execute efficiently but also operate in a few kilobytes of memory. Also, the embedded devices' ecosystem is heavily fragmented. To maximize efficiency, system vendors often omit many features that commonly appear in mainstream systems, including dynamic memory allocation and virtual memory, that allow for cross-platform interoperability. The hardware comes in many flavors (e.g., instruction-set architecture and FPU support, or lack thereof). We introduce TensorFlow Lite Micro (TF Micro), an open-source ML inference framework for running deep-learning models on embedded systems. TF Micro tackles the efficiency requirements imposed by embedded-system resource constraints and the fragmentation challenges that make cross-platform interoperability nearly impossible. The framework adopts a unique interpreter-based approach that provides flexibility while overcoming these challenges. This paper explains the design decisions behind TF Micro and describes its implementation details. Also, we present an evaluation to demonstrate its low resource requirement and minimal run-time performance overhead.

Artikel yang Mensitasi

0 sitasi

Tidak ada artikel yang mensitasi.