The NXP eIQ ML (edge intelligence machine learning) software environment provides tools to perform inference on embedded systems using neural network models. The software includes optimizations that leverage the hardware capabilities of the i.MX93 family for improved performance. Examples of applications that typically use neural network inference include object/pattern recognition, gesture control, voice processing, and sound monitoring.

eIQ includes support for the following inference engine:

Include eIQ packages in Digi Embedded Yocto

Add the meta-multimedia layer to your conf/bblayers.conf configuration file if it isn’t there already:

conf/bblayers.conf
   /usr/local/dey-4.0/sources/meta-digi/meta-digi-arm \
   /usr/local/dey-4.0/sources/meta-digi/meta-digi-dey \
+  /usr/local/dey-4.0/sources/meta-openembedded/meta-multimedia \
"

Edit your conf/local.conf file to include the eIQ package group in your Digi Embedded Yocto image:

conf/local.conf
IMAGE_INSTALL:append = " packagegroup-imx-ml"

This package group contains all of NXP’s eIQ packages compatible with the ConnectCore 93.

Including this package group increases the size of the rootfs image significantly. To minimize the increase in image size, select a subset of its packages depending on your needs. See the package group’s recipe for more information on the packages it contains.

NPU: Ethos-U software architecture

The software for Ethos-U support includes three main components.

  • Vela model compiler: offline tool to compile the TFLite model graph for Ethos-U. The compiler replaces supported operators in the model with custom "ethos-u" operator containing the command stream for Ethos-U NPU. The output of the compiler is a modified TFLite model graph for TFLite/TFLite-Micro inference engines.

  • Cortex-A software stack for Linux: containing MPU inference engine (TensorFlow Lite), driver library, and kernel side device driver for Linux Kernel.

  • Cortex-M software stack: containing MCU inference engine software (TFLite-Micro, CMSIS-NN), and NPU driver.

The typical inference workflow is as follows:

  1. The Vela model compiler converts the TFLite model into a Vela model and generates the optimized version for Ethos-U NPU.

  2. The optimized model is fed into one of the following:

    1. TFLite inference engine, which recognizes the custom "ethos-u" operator, allocates the buffer for input/output feature map (IFM/OFM), and executes the operator via Ethos-U Linux driver.

    2. Inference API, which allocates the buffer for input/output feature map and sends entire model via EthosU driver.

  3. The Ethos-U driver composes the inference task message and sends it over RPMSG to Cortex-M.

  4. The Ethos-U Runner on Cortex-M dispatches the task to TFLite-Micro or Ethos-U driver directly, according to the task type.

    1. If the task type is accelerating the "ethos-u" operator (using the TFLite), the runner calls the Ethos-U driver directly.

    2. If the task type is accelerating the entire model (using the Inference API), the runner dispatches the model to TFLite-Micro and further calls Ethos-U driver for processing.

  5. After the Ethos-U driver completes the inference task, it writes the result into the OFM buffer and sends the response back to Cortex-A via RPMSG.

TensorFlow

TensorFlow support

TensorFlow Lite is a set of tools that enables on-device machine learning by helping developers run their models on mobile, embedded, and edge devices. TensorFlow Lite supports computation on the following hardware units:

  • CPU Arm Cortex-A cores

  • NPU hardware acceleration using Ethos-U Delegate.

Ethos-U Delegate is an external delegate on i.MX93 Linux platforms. It enables the inference to be accelerated via the on-chip hardware accelerator. Ethos-U Delegate directly uses the hardware accelerator driver (Ethos-U driver stack) to fully utilize the capabilities of the accelerator.

TensorFlow example with CPU

The following example shows how to use TensorFlow Lite by performing the inference in the CPU.

# cd /usr/bin/tensorflow-lite-2.12.1/examples
# ./label_image -i grace_hopper.bmp
INFO: Loaded model ./mobilenet_v1_1.0_224_quant.tflite
INFO: resolved reporter
INFO: invoked
INFO: average time: 142.956 ms
INFO: 0.764706: 653 military uniform
INFO: 0.121569: 907 Windsor tie
INFO: 0.0156863: 458 bow tie
INFO: 0.0117647: 466 bulletproof vest
INFO: 0.00784314: 835 suit

The output displays the time taken to process the sample, with an average time of 142.956 ms.

TensorFlow example with NPU

First compile the model for Ethos-U using the Vela tool, reusing the existing model:

# cd /usr/bin/ethosu/examples/
# vela ../../tensorflow-lite-2.12.1/examples/mobilenet_v1_1.0_224_quant.tflite --output-dir /usr/bin/tensorflow-lite-2.12.1/examples/

Then, run the example by specifying the converted inference model (-m option) and the NPU library (external_delegate_path option):

# cd /usr/bin/tensorflow-lite-2.12.1/examples
# ./label_image -i grace_hopper.bmp -m mobilenet_v1_1.0_224_quant_vela.tflite \
     --external_delegate_path=/usr/lib/libethosu_delegate.so
INFO: Loaded model mobilenet_v1_1.0_224_quant_vela.tflite
INFO: resolved reporter
INFO: Ethosu delegate: device_name set to /dev/ethosu0.
INFO: Ethosu delegate: cache_file_path set to .
INFO: Ethosu delegate: timeout set to 60000000000.
INFO: Ethosu delegate: enable_cycle_counter set to 0.
INFO: Ethosu delegate: enable_profiling set to 0.
INFO: Ethosu delegate: profiling_buffer_size set to 2048.
INFO: Ethosu delegate: pmu_event0 set to 0.
INFO: Ethosu delegate: pmu_event1 set to 0.
INFO: Ethosu delegate: pmu_event2 set to 0.
INFO: Ethosu delegate: pmu_event3 set to 0.
EXTERNAL delegate created.
INFO: EthosuDelegate: 1 nodes delegated out of 1 nodes with 1 partitions.
INFO: Applied EXTERNAL delegate.
INFO: invoked
INFO: average time: 3.842 ms
INFO: 0.780392: 653 military uniform
INFO: 0.105882: 907 Windsor tie
INFO: 0.0156863: 458 bow tie
INFO: 0.0117647: 466 bulletproof vest
INFO: 0.00784314: 835 suit

The output displays the time taken to process the sample, with an average time of 3.842 ms.

TensorFlow NPU vs CPU performance

Inside the TensorFlow folder, there is a benchmark tool that measures the inference time. If your model has been converted for Ethos-U, the NPU will be utilized automatically. To compare the performance, you can use the previously used models for both the CPU and NPU. Begin by running the example without Ethos-U support:

# cd /usr/bin/tensorflow-lite-2.12.1/examples
# time ./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --num_runs=1000 --num_threads=2
STARTING!
Log parameter values verbosely: [0]
Min num runs: [1000]
Num threads: [2]
Graph: [mobilenet_v1_1.0_224_quant.tflite]
#threads used for CPU inference: [2]
#threads used for CPU inference: [2]
Loaded model mobilenet_v1_1.0_224_quant.tflite
The input model file size (MB): 4.27635
Initialized session in 11.997ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=15 first=57921 curr=33998 min=33586 max=57921 avg=35436.3 std=6014

Running benchmark for at least 1000 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=1000 first=33801 curr=33601 min=33422 max=47372 avg=33759.5 std=682

Inference timings in us: Init: 11997, First inference: 57921, Warmup (avg): 35436.3, Inference (avg): 33759.5
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=4.45703 overall=12.7148

real    0m34.424s
user    1m7.911s
sys     0m0.036s

The test results indicate that the benchmark took approximately 34.4 seconds to complete, with an average inference time of 33759.5 microseconds when utilizing two CPUs. The results also indicate that CPU load reached close to 100% CPU utilization. To further analyze performance, you can repeat the test using a converted Vela model that leverages the EthosU NPU. To do so, repeat the test, instructing the converted Vela model to use the EthosU NPU.

# cd /usr/bin/tensorflow-lite-2.12.1/examples
# time ./benchmark_model --graph=mobilenet_v1_1.0_224_quant_vela.tflite --num_runs=1000 \
     --external_delegate_path=/usr/lib/libethosu_delegate.so
STARTING!
Log parameter values verbosely: [0]
Min num runs: [1000]
Num threads: [2]
Graph: [mobilenet_v1_1.0_224_quant_vela.tflite]
#threads used for CPU inference: [2]
#threads used for CPU inference: [2]
External delegate path: [/usr/lib/libethosu_delegate.so]
Loaded model mobilenet_v1_1.0_224_quant_vela.tflite
INFO: Ethosu delegate: device_name set to /dev/ethosu0.
INFO: Ethosu delegate: cache_file_path set to .
INFO: Ethosu delegate: timeout set to 60000000000.
INFO: Ethosu delegate: enable_cycle_counter set to 0.
INFO: Ethosu delegate: enable_profiling set to 0.
INFO: Ethosu delegate: profiling_buffer_size set to 2048.
INFO: Ethosu delegate: pmu_event0 set to 0.
INFO: Ethosu delegate: pmu_event1 set to 0.
INFO: Ethosu delegate: pmu_event2 set to 0.
INFO: Ethosu delegate: pmu_event3 set to 0.
EXTERNAL delegate created.
INFO: EthosuDelegate: 1 nodes delegated out of 1 nodes with 1 partitions.
Explicitly applied EXTERNAL delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 3.35866
Initialized session in 20.346ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=130 first=3934 curr=3818 min=3805 max=3934 avg=3819.06 std=16

Running benchmark for at least 1000 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=1000 first=3819 curr=3818 min=3803 max=3952 avg=3817.89 std=11

Inference timings in us: Init: 20346, First inference: 3934, Warmup (avg): 3819.06, Inference (avg): 3817.89
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=8.04297 overall=8.40625

real    0m4.417s
user    0m0.153s
sys     0m0.101s

The test results indicate that it took approximately 4.4 seconds to complete, with an average inference time of 3817.89 microseconds when utilizing the NPU. The results also indicate that CPU load reached approximately 1% CPU utilization.

NXP eIQ examples

Overview

The generated image with packagegroup-imx-ml contains the eIQ demos provided by NXP in the eiq-example.

eIQ examples and source code are provided by NXP, so the exact commands in the following steps may need to be altered slightly. Use them as reference.

The eIQ examples available in the image are inside the /usr/bin/eiq-examples-git folder:

#  ls -l /usr/bin/eiq-examples-git/
drwxr-xr-x    2 root     root          4096 Mar  9  2018 dms
-rw-r--r--    1 root     root          4069 Mar  9  2018 download_models.py
drwxr-xr-x    2 root     root          4096 Mar  9  2018 face_recognition
drwxr-xr-x    2 root     root          4096 Mar  9  2018 gesture_detection
drwxr-xr-x    2 root     root          4096 Mar  9  2018 image_classification
drwxr-xr-x    2 root     root          4096 Mar  9  2018 object_detection

That folder contains:

  • download_models.py: This is a script that downloads the required TensorFlow models and creates copies of those models converted with Vela for use with the NPU.

  • Demo directories: There are multiple demos, and each demo folder contains a Python script to run it.

Setup

The sequence to work with the demos is:

  1. Download the required models. This is only required once. (The script also converts the downloaded models with Vela).

    To download the models, the device must have network connectivity.
  2. Run the download_models.py script.

    # cd /usr/bin/eiq-examples-git
    # python3 download_models.py
    Downloading  gesture recognition  model(s) file(s) from https://drive.google.com/
    uc?export=download&&id=1yjWyXsac5CbGWYuHWYhhnr_9cAwg3uNI
    ...
    Downloading  dms iris landmark  model(s) file(s) from https://s3.ap-northeast-2.w
    asabisys.com/pinto-model-zoo/049_iris_landmark/resources.tar.gz
    Converting facenet_512_int_quantized.tflite
    ...
    Batch Inference time                 3.10 ms,  322.73 inferences/s (batch size 1)

    Some relevant notes about the download process:

    • The download size is quite large and may take approximately an hour.

    • The script converts the downloaded models with Vela, increasing the script’s duration. Your device remains busy during this process.

    • The device requires extra space to store all the models.

    • If the script is stopped or fails, it starts the download from the beginning, ignoring any previously downloaded data.

      Consider editing the script to reuse download data and avoid restarting the full download if the script stops or fails.

      Once the process is completed, you’ll see the following folders:

    • models: Downloaded models.

    • vela_models: Converted Vela models.

      The downloaded models can be reused. Digi recommends backing up those folders so the downloaded models can be reused on other devices.
  3. Choose the demo and run it. There is a Python script inside each folder, each with its own parameters. You can run it with the -h parameter to check available parameters.

Running an example (using the CPU)

As a general rule, enter one of the demo folders, run the python script inside, and check the help prompt of the main script. For instance, to run the object_detection demo which identifies objects in the camera’s input:

# cd /usr/bin/eiq-examples-git/object_detection
# python3 main.py -h
usage: main.py [-h] [-i INPUT] [-d DELEGATE]

options:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        input to be classified
  -d DELEGATE, --delegate DELEGATE
                        delegate path
# python3 main.py -i /dev/video0
[ WARN:0@0.513] global cap_gstreamer.cpp:2784 handleMessage OpenCV | GStreamer warning: Embedded video playback halted; module source reported: Could not read from resource.
[ WARN:0@0.517] global cap_gstreamer.cpp:1679 open OpenCV | GStreamer warning: unable to start pipeline
[ WARN:0@0.517] global cap_gstreamer.cpp:1164 isPipelinePlaying OpenCV | GStreamer warning: GStreamer: pipeline have not been created
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
rectangle:(223,200),(621,475) label:person
rectangle:(389,387),(544,480) label:chair

Running an example (using the NPU)

To run the example using the NPU, delegate the inference to the ethosu library. Additionally, provide the converted Vela model (typically done via the -m parameter). If the script doesn’t provide a parameter for the model, you must manually modify the script so that it uses the Vela-converted model instead of the regular one. To manually modify the script, inspect it to find where the path to the model is set, and replace that path with the one to the converted model.

The following example demonstrates how it can be done for the object_detection application:

# cd /usr/bin/eiq-examples-git/object_detection
# grep MODEL_PATH *
main.py:MODEL_PATH = "../models/ssd_mobilenet_v1_quant.tflite"
main.py:    interpreter = tflite.Interpreter(model_path=MODEL_PATH, experimental_delegates=ext_delegate)
main.py:    interpreter = tflite.Interpreter(model_path=MODEL_PATH)
# find /usr/bin/eiq-examples-git/ -name ssd_mobilenet_v1_quant*tflite
/usr/bin/eiq-examples-git/vela_models/ssd_mobilenet_v1_quant_vela.tflite
/usr/bin/eiq-examples-git/models/ssd_mobilenet_v1_quant.tflite
# sed -i 's/models\//vela_models\//g' main.py
# sed -i 's/quant.tflite/quant_vela.tflite/g' main.py

After this change, run the demo, adding the parameter to delegate on the NPU:

# python3 main.py -i /dev/video0 --delegate=/usr/lib/libethosu_delegate.so
INFO: Ethosu delegate: device_name set to /dev/ethosu0.
INFO: Ethosu delegate: cache_file_path set to .
INFO: Ethosu delegate: timeout set to 60000000000.
INFO: Ethosu delegate: enable_cycle_counter set to 0.
INFO: Ethosu delegate: enable_profiling set to 0.
INFO: Ethosu delegate: profiling_buffer_size set to 2048.
INFO: Ethosu delegate: pmu_event0 set to 0.
INFO: Ethosu delegate: pmu_event1 set to 0.
INFO: Ethosu delegate: pmu_event2 set to 0.
INFO: Ethosu delegate: pmu_event3 set to 0.
INFO: EthosuDelegate: 1 nodes delegated out of 2 nodes with 1 partitions.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
rectangle:(74,240),(599,474) label:person
If you use the regular models for inference on the NPU instead of the ones converted with Vela, the demo will automatically convert the model before running the inference, adding a significant delay in the demo’s execution time.

More information

See NXP’s i.MX Machine Learning User’s Guide for more information on eIQ.