ONNX: Enhancing AI Model Portability and Performance

What is ONNX?

ONNX stands for Open Neural Network Exchange [1]. It is an open-source format created to represent machine learning models. Developed collaboratively by Microsoft, Amazon, and Facebook, ONNX is designed to make models portable and interoperable across different AI frameworks and hardware. This interoperability is crucial for developers who want to use the best tools available during different stages of their workflow, from training to deployment.

‍

The Core Features of ONNX

1. Framework Interoperability: ONNX acts as a bridge among frameworks like TensorFlow and PyTorch. This feature allows developers to move models between these frameworks without recoding the entire model. For instance, you might train a model in PyTorch because of its dynamic computational graph, and then convert it to ONNX for easy deployment in a production environment that favors TensorFlow.

2. Hardware Compatibility: ONNX models can run on various hardware platforms. This flexibility is facilitated by ONNX Runtime, an engine developed for efficiently running ONNX models. It supports diverse platforms, including CPUs, GPUs, and even edge devices, ensuring that developers can deploy their AI solutions broadly.

3. Performance Optimizations: The ONNX Runtime is optimized to provide high performance for model inferencing. It uses graph optimizations, operator fusions, and kernel tuning to speed up the execution.

‍

It's important to note that while some frameworks like PyTorch and TensorFlow are continuously optimizing a model’s performance, converting a model to ONNX does not always guarantee improved inference speed. In fact, in some cases, it can even result in a slower model. This is because each framework has its own highly optimized ways of executing operations, especially for specific types of hardware.

‍

When a model is converted to ONNX, these optimizations may not always translate perfectly, leading to potential inefficiencies. Therefore, while ONNX offers great benefits in terms of interoperability and deployment flexibility, its impact on performance can vary and should be evaluated on a case-by-case basis.

‍

How ONNX Works

1. Model: An ONNX model is essentially a serialized snapshot of a machine learning model. Stored in the .onnx file format, it encapsulates the complete architecture of the model, including the weights and metadata necessary for execution. This file format is designed for portability, allowing the model to be used across different machine learning frameworks and deployment environments without compatibility issues.

‍

2. Graph: The core structure of an ONNX model is its computational graph. This graph is a visual and functional representation of the model’s operations, illustrating how data flows and transforms through the model. In the graph:

Nodes represent operations or computations. Each node is an instance where a mathematical function is applied to the data.‍
Edges represent tensors, the multi-dimensional data arrays that flow between these nodes. Edges carry the output from one node to the input of another, chaining the operations together to achieve the desired end result.

‍

3. Nodes and Tensors: Nodes in an ONNX graph define specific operations, such as additions, multiplications, or more complex functions like convolution or batch normalization. Tensors, on the other hand, are the data elements that pass through these nodes. They carry the numerical data (such as input features, weights, or intermediate results) that nodes manipulate. The characteristics of tensors (like shape and type) are crucial as they need to match the requirements of the operations they are involved in.

‍

4. Operators: Operators in ONNX are pre-defined computational building blocks for the nodes. Each operator specifies a particular operation that can be performed on tensors. ONNX provides a comprehensive library of standard operators, which simplifies the task of model conversion from different frameworks since these operators are widely supported and optimized across various platforms. This standardization is particularly beneficial for hardware accelerators and runtime environments, allowing them to optimize these operations specifically, resulting in faster and more efficient model execution.

‍

Conversion to ONNX

Switching models over to the ONNX format is a key move for developers who want to tap into the benefits of ONNX, like its ability to work across different platforms and boost efficiency. But, the process isn't always straightforward and can vary a lot depending on the machine learning framework you start with.

‍

Let's dig into how some popular frameworks handle the switch to ONNX and check out the tools and tricks that make it happen.

‍

PyTorch to ONNX

PyTorch is known for its dynamic computational graph and user-friendly interface, making it a favorite among researchers and developers. When it comes to converting PyTorch models to ONNX, the framework offers a relatively straightforward method:

‍

Using torch.onnx.export(): This function takes a PyTorch model, along with sample input data (to trace the model operations), and outputs an ONNX model. The function needs parameters like the model, the model's input, and the filename to save the ONNX file. It also allows for setting dynamic axes and other advanced configurations to handle models that use variable input sizes and batch numbers.

‍

This built-in support simplifies the conversion process, but developers still need to be mindful of specific PyTorch features such as dynamic loops or custom autograd functions, which might not translate directly to ONNX.

‍

TensorFlow to ONNX

TensorFlow's static graph nature aligns well with the ONNX paradigm. However, converting TensorFlow models to ONNX is not always direct and may require additional tools:

Using tf2onnx: This is a popular tool for converting TensorFlow models to the ONNX format. The tool supports command-line interfaces and Python APIs, making it versatile for various use cases. Users need to provide the TensorFlow model and define the input and output nodes for the conversion process

‍

While tf2onnx supports a wide range of TensorFlow operations, there can still be challenges, especially with newer or less common TensorFlow functionalities, which might require manual intervention or custom code to handle.

‍

Leveraging ONNX Runtime for Model Acceleration

ONNX Runtime is a performance-focused engine for running ONNX models. It enables model inference to be optimized and accelerated, supporting a variety of platforms and hardware configurations. This makes ONNX Runtime a major component in the ONNX ecosystem, especially for those looking to enhance the execution speed and efficiency of their machine learning models.

‍

Enhancing Model Performance with ONNX Runtime

ONNX Runtime is a sophisticated engine designed for running ONNX models, offering capabilities for model inference optimization and acceleration across diverse platforms and hardware [3]. This performance-focused tool is an essential part of the ONNX ecosystem, enabling developers to maximize the efficiency and speed of their machine learning models.

‍

Key Aspects of ONNX Runtime

Cross-Platform and Hardware Support: ONNX Runtime supports a wide array of environments, from cloud servers to edge devices, and is compatible with multiple operating systems like Windows, macOS, and Linux. It handles a variety of hardware with specific optimizations for CPUs, GPUs, and other accelerators, ensuring broad deployment capabilities.

‍

Optimization Techniques: To enhance performance, ONNX Runtime applies several graph optimizations before model execution. These optimizations include layer fusion, operation elimination, and efficient memory utilization. By rearranging computations and merging layers, ONNX Runtime ensures optimal use of hardware resources, leading to faster inference.

‍

Execution Providers (EPs): ONNX Runtime introduces the concept of Execution Providers, which are interfaces to various hardware accelerators. EPs allow ONNX Runtime to take full advantage of specific hardware features to accelerate model execution. Examples include:

‍

CPUExecutionProvider
CUDAExecutionProvider for NVIDIA GPUs
MKLDNNExecutionProvider for Intel CPUs

‍

Advanced Model Acceleration Techniques

ONNX Runtime is equipped with various advanced techniques designed to optimize and accelerate model performance across different platforms. Here’s an in-depth look at these techniques:

‍

1. Graph Optimizations

Graph optimizations in ONNX Runtime are designed to streamline the computational graph of a model to improve efficiency and speed [2]. These optimizations are applied at three distinct levels:

Basic Optimizations: Basic optimizations are designed to enhance model efficiency by simplifying the computational graph before execution. These include techniques such as constant folding, where constant expressions are pre-computed during the compilation rather than during runtime, thus saving computational resources. Node elimination is another basic optimization where redundant operations that don't contribute to the final output are removed. These steps reduce unnecessary calculations and simplify the model.

- Constant Folding: This optimization pre-computes parts of the graph that depend only on constant initializers, thus eliminating the need for these computations during runtime. By resolving these expressions during the compilation phase, we save valuable computational resources.

- Node Elimination: This involves removing nodes that do not affect the final output of the model, thereby simplifying the computational graph. Types of node eliminations include:

- Semantics-Preserving Node Fusions: These optimizations combine multiple nodes into a single, more efficient node without altering the overall functionality.

‍

Extended Optimizations: This level focuses on more complex optimizations tailored to specific execution providers, such as CPUs and GPUs. It includes advanced fusions of operations, where multiple operations are combined into a single, more efficient operation. For example, combining convolution and batch normalization into one operation can reduce the overhead of multiple reads and writes of intermediate tensors, thereby speeding up the processing time

‍

‍

Layout Optimizations: These optimizations involve rearranging the data layouts (like changing from NCHW to NHWC format or vice versa) to better suit the hardware architecture. This can significantly improve data access patterns and computational efficiency, especially on hardware that is optimized for a specific data format. These optimizations help in maximizing data throughput and reducing latency.

‍

2. Hardware-Specific Kernels

ONNX Runtime leverages hardware-specific kernels to further enhance performance. These kernels are highly optimized for particular types of hardware:

For CPUs: ONNX Runtime might use libraries like Intel MKL-DNN to optimize operations for Intel CPUs, utilizing the best available instruction sets like AVX or AVX-512.
For GPUs: On NVIDIA GPUs, ONNX Runtime can utilize CUDA kernels to accelerate operations, taking advantage of the parallel processing capabilities of GPUs. Similarly, for AMD GPUs, it might use ROCm libraries.

These specialized kernels are designed to make the most out of the underlying hardware, significantly boosting the performance of model inference.

‍

3. Parallel Execution

ONNX Runtime supports parallel execution of model operations, which is a powerful way to reduce inference time:

Inter-Op Parallelism: This involves executing different operations in parallel across multiple threads. If a model's graph has branches that are independent of each other, these can be processed simultaneously on different cores of a CPU or different CUDA streams on GPUs.
Intra-Op Parallelism: Within a single operation, especially for large operations like matrix multiplications or convolutions, ONNX Runtime can perform parts of these operations in parallel. This approach takes advantage of multi-core CPUs and the massive thread parallelism provided by GPUs.
Data Parallelism: Particularly useful in GPUs, this involves splitting the data into smaller batches and processing them in parallel, which can dramatically speed up the training and inference processes for large datasets.

‍

Our Journey in Converting a Transformer Model to ONNX at Ikomia

At Ikomia, we took on the task of developing a text extraction model for ID cards using the Donut transformers model. Realizing the potential benefits, we decided to convert our Pytorch model to ONNX to boost its portability and speed up its inference times. However, the road was anything but smooth.

‍

Transformers, with their intricate encoder-decoder structures, bring a unique set of challenges when moving to ONNX, largely due to their complexity and the sophisticated mechanisms they use. Here’s a quick rundown of the major hurdles we encountered:

‍

1. Variable Sequence Lengths: Transformers are designed to handle texts of varying lengths, which complicates their conversion due to the need for dynamic axis management and the precise translation of padding and masking techniques.‍

‍

2. Encoder-Decoder Interaction: Setting up cross-attention layers and making sure that state transfer between the encoder and decoder is accurate during conversion is crucial for the model to work properly. Additionally, managing loops in the model's logic presented its own set of challenges, as ONNX expects static graph definitions.‍

‍

3. Versioning and Compatibility: Staying current with ONNX versions while ensuring our model remained compatible with new transformer architectures and operator versions introduced more layers of complexity.

‍

Despite these challenges, we managed to push through and even provided an open-source package to help others streamline their conversion process.

‍

It took a deep dive into the inner workings of our model and several days of intense effort. While ultimately, it was worth it to enhance the model's usability across different platforms, unfortunately, it did not improve the performance.

‍

Conclusion

‍

ONNX revolutionizes AI development by ensuring interoperability and portability of machine learning models across different frameworks and hardware platforms. Its core features include framework interoperability, hardware compatibility, and performance optimizations via ONNX Runtime, making it a valuable tool for seamless model deployment.

‍

Although converting complex models can in some cases be challenging, the benefits of ONNX in enhancing model usability and deployment flexibility often outweigh these difficulties. As ONNX continues to evolve, it will become even more essential in AI workflows, enabling broader and more efficient use of AI models.

‍