ONNX stands for Open Neural Network Exchange [1]. It is an open-source format created to represent machine learning models. Developed collaboratively by Microsoft, Amazon, and Facebook, ONNX is designed to make models portable and interoperable across different AI frameworks and hardware. This interoperability is crucial for developers who want to use the best tools available during different stages of their workflow, from training to deployment.
1. Framework Interoperability: ONNX acts as a bridge among frameworks like TensorFlow and PyTorch. This feature allows developers to move models between these frameworks without recoding the entire model. For instance, you might train a model in PyTorch because of its dynamic computational graph, and then convert it to ONNX for easy deployment in a production environment that favors TensorFlow.
2. Hardware Compatibility: ONNX models can run on various hardware platforms. This flexibility is facilitated by ONNX Runtime, an engine developed for efficiently running ONNX models. It supports diverse platforms, including CPUs, GPUs, and even edge devices, ensuring that developers can deploy their AI solutions broadly.
3. Performance Optimizations: The ONNX Runtime is optimized to provide high performance for model inferencing. It uses graph optimizations, operator fusions, and kernel tuning to speed up the execution.
It's important to note that while some frameworks like PyTorch and TensorFlow are continuously optimizing a model’s performance, converting a model to ONNX does not always guarantee improved inference speed. In fact, in some cases, it can even result in a slower model. This is because each framework has its own highly optimized ways of executing operations, especially for specific types of hardware.
When a model is converted to ONNX, these optimizations may not always translate perfectly, leading to potential inefficiencies. Therefore, while ONNX offers great benefits in terms of interoperability and deployment flexibility, its impact on performance can vary and should be evaluated on a case-by-case basis.
1. Model: An ONNX model is essentially a serialized snapshot of a machine learning model. Stored in the .onnx file format, it encapsulates the complete architecture of the model, including the weights and metadata necessary for execution. This file format is designed for portability, allowing the model to be used across different machine learning frameworks and deployment environments without compatibility issues.
2. Graph: The core structure of an ONNX model is its computational graph. This graph is a visual and functional representation of the model’s operations, illustrating how data flows and transforms through the model. In the graph:
3. Nodes and Tensors: Nodes in an ONNX graph define specific operations, such as additions, multiplications, or more complex functions like convolution or batch normalization. Tensors, on the other hand, are the data elements that pass through these nodes. They carry the numerical data (such as input features, weights, or intermediate results) that nodes manipulate. The characteristics of tensors (like shape and type) are crucial as they need to match the requirements of the operations they are involved in.
4. Operators: Operators in ONNX are pre-defined computational building blocks for the nodes. Each operator specifies a particular operation that can be performed on tensors. ONNX provides a comprehensive library of standard operators, which simplifies the task of model conversion from different frameworks since these operators are widely supported and optimized across various platforms. This standardization is particularly beneficial for hardware accelerators and runtime environments, allowing them to optimize these operations specifically, resulting in faster and more efficient model execution.
Switching models over to the ONNX format is a key move for developers who want to tap into the benefits of ONNX, like its ability to work across different platforms and boost efficiency. But, the process isn't always straightforward and can vary a lot depending on the machine learning framework you start with.
Let's dig into how some popular frameworks handle the switch to ONNX and check out the tools and tricks that make it happen.
PyTorch is known for its dynamic computational graph and user-friendly interface, making it a favorite among researchers and developers. When it comes to converting PyTorch models to ONNX, the framework offers a relatively straightforward method:
This built-in support simplifies the conversion process, but developers still need to be mindful of specific PyTorch features such as dynamic loops or custom autograd functions, which might not translate directly to ONNX.
TensorFlow's static graph nature aligns well with the ONNX paradigm. However, converting TensorFlow models to ONNX is not always direct and may require additional tools:
While tf2onnx supports a wide range of TensorFlow operations, there can still be challenges, especially with newer or less common TensorFlow functionalities, which might require manual intervention or custom code to handle.
ONNX Runtime is a performance-focused engine for running ONNX models. It enables model inference to be optimized and accelerated, supporting a variety of platforms and hardware configurations. This makes ONNX Runtime a major component in the ONNX ecosystem, especially for those looking to enhance the execution speed and efficiency of their machine learning models.
ONNX Runtime is a sophisticated engine designed for running ONNX models, offering capabilities for model inference optimization and acceleration across diverse platforms and hardware [3]. This performance-focused tool is an essential part of the ONNX ecosystem, enabling developers to maximize the efficiency and speed of their machine learning models.
Cross-Platform and Hardware Support: ONNX Runtime supports a wide array of environments, from cloud servers to edge devices, and is compatible with multiple operating systems like Windows, macOS, and Linux. It handles a variety of hardware with specific optimizations for CPUs, GPUs, and other accelerators, ensuring broad deployment capabilities.
Optimization Techniques: To enhance performance, ONNX Runtime applies several graph optimizations before model execution. These optimizations include layer fusion, operation elimination, and efficient memory utilization. By rearranging computations and merging layers, ONNX Runtime ensures optimal use of hardware resources, leading to faster inference.
Execution Providers (EPs): ONNX Runtime introduces the concept of Execution Providers, which are interfaces to various hardware accelerators. EPs allow ONNX Runtime to take full advantage of specific hardware features to accelerate model execution. Examples include:
ONNX Runtime is equipped with various advanced techniques designed to optimize and accelerate model performance across different platforms. Here’s an in-depth look at these techniques:
1. Graph Optimizations
Graph optimizations in ONNX Runtime are designed to streamline the computational graph of a model to improve efficiency and speed [2]. These optimizations are applied at three distinct levels:
- Constant Folding: This optimization pre-computes parts of the graph that depend only on constant initializers, thus eliminating the need for these computations during runtime. By resolving these expressions during the compilation phase, we save valuable computational resources.
- Node Elimination: This involves removing nodes that do not affect the final output of the model, thereby simplifying the computational graph. Types of node eliminations include:
- Semantics-Preserving Node Fusions: These optimizations combine multiple nodes into a single, more efficient node without altering the overall functionality.
2. Hardware-Specific Kernels
ONNX Runtime leverages hardware-specific kernels to further enhance performance. These kernels are highly optimized for particular types of hardware:
These specialized kernels are designed to make the most out of the underlying hardware, significantly boosting the performance of model inference.
3. Parallel Execution
ONNX Runtime supports parallel execution of model operations, which is a powerful way to reduce inference time:
At Ikomia, we took on the task of developing a text extraction model for ID cards using the Donut transformers model. Realizing the potential benefits, we decided to convert our Pytorch model to ONNX to boost its portability and speed up its inference times. However, the road was anything but smooth.
Transformers, with their intricate encoder-decoder structures, bring a unique set of challenges when moving to ONNX, largely due to their complexity and the sophisticated mechanisms they use. Here’s a quick rundown of the major hurdles we encountered:
1. Variable Sequence Lengths: Transformers are designed to handle texts of varying lengths, which complicates their conversion due to the need for dynamic axis management and the precise translation of padding and masking techniques.
2. Encoder-Decoder Interaction: Setting up cross-attention layers and making sure that state transfer between the encoder and decoder is accurate during conversion is crucial for the model to work properly. Additionally, managing loops in the model's logic presented its own set of challenges, as ONNX expects static graph definitions.
3. Versioning and Compatibility: Staying current with ONNX versions while ensuring our model remained compatible with new transformer architectures and operator versions introduced more layers of complexity.
Despite these challenges, we managed to push through and even provided an open-source package to help others streamline their conversion process.
It took a deep dive into the inner workings of our model and several days of intense effort. While ultimately, it was worth it to enhance the model's usability across different platforms, unfortunately, it did not improve the performance.
ONNX revolutionizes AI development by ensuring interoperability and portability of machine learning models across different frameworks and hardware platforms. Its core features include framework interoperability, hardware compatibility, and performance optimizations via ONNX Runtime, making it a valuable tool for seamless model deployment.
Although converting complex models can in some cases be challenging, the benefits of ONNX in enhancing model usability and deployment flexibility often outweigh these difficulties. As ONNX continues to evolve, it will become even more essential in AI workflows, enabling broader and more efficient use of AI models.
[1] https://onnx.ai/
[2] https://pytorch.org/blog/computational-graphs-constructed-in-pytorch/
[3] https://onnxruntime.ai/
[4] https://onnxruntime.ai/docs/performance/mobile-performance-tuning.html