YOLACT: Pioneering Real-Time Instance Segmentation

Allan Kouidri
YOLACT instance segmentation on zebra

YOLACT, which stands for "You Only Look At Coefficients," is a groundbreaking approach in the field of computer vision, particularly for real-time instance segmentation. This innovative technique, detailed in the paper titled "YOLACT: Real-time Instance Segmentation," has been a game-changer due to its unique blend of efficiency and accuracy. Let's dive deeper into what makes YOLACT a standout in the field of computer vision.

YOLACT instance segmentation on a group of elephants

How YOLACT works?

Instance segmentation is a complex task in Computer Vision, involving the identification and delineation of each object instance in an image. Traditional approaches, like Mask R-CNN and Faster R-CNN, often struggled with speed and accuracy, but YOLACT revolutionized this by introducing a novel, lightweight method that offers real-time performance without sacrificing accuracy.

YOLACT divides the instance segmentation process into two parallel tasks: (1) generating prototype masks and (2) predicting per-instance mask coefficients. The ingenious part of YOLACT lies in its ability to separate these two tasks, thereby simplifying the overall process:

  • Prototype Masks (1): YOLACT begins by creating a series of prototype masks for the entire image. These are not just generic templates but are carefully crafted to encapsulate various shapes and structures within the image. The innovation here is in how these prototypes serve as a comprehensive base for any object the network might encounter.

  • Per-Instance Coefficients (2): Alongside the prototype masks, YOLACT simultaneously predicts specific coefficients for each object instance. These coefficients are not arbitrary but tailored to individual objects, determining how the prototype masks are combined.

  • Final Instance Masks: The culmination of this process is the creation of final instance masks. These are derived by linearly combining the prototype masks with the per-instance coefficients. This method is not only efficient but also significantly less computationally intensive than previous techniques.
YOLACT architecture
YOLACT model overview [1]

The Architecture of YOLACT

YOLACT's architecture is ingeniously designed to efficiently handle the complex task of real-time instance segmentation. This architecture is characterized by its unique components and how they interact to deliver fast and accurate results. 

Here's a closer look at its key features:

Backbone network

  • Base Feature Extraction: YOLACT utilizes a backbone network, typically a variant of a standard architecture like ResNet or DarkNet, to extract fundamental features from the input image. This network is crucial for capturing the underlying patterns and structures in the visual data.

  • Feature Pyramid Network (FPN): On top of the backbone, YOLACT employs an FPN. This network enhances the feature extraction process by integrating information across multiple scales. It effectively captures both high-level semantic information and low-level details, which are crucial for accurate segmentation.
Feature pyramid network


  • Prototype Masks Generation: The Protonet in YOLACT is designed to create a set of prototype masks. These masks are essentially general representations of shapes and patterns found in the image. The Protonet operates on the higher-resolution layers of the FPN, allowing it to generate detailed and varied prototype masks.

Prediction Heads

  • Classification and box regression: Alongside the Protonet, YOLACT includes separate heads for object classification and bounding box regression. These heads work on the lower-resolution layers of the FPN, focusing on identifying object categories and their respective locations.

  • Mask coefficients: Another crucial component is the prediction of mask coefficients. For each detected object, a set of coefficients is predicted. These coefficients are key to customizing the prototype masks for individual instances.

Illustration YOLACT head architecture
Head Architecture. c: classes; a: anchors; Pi: feature layer; k: prototypes [1]

Assembly of final masks

  • Linear combination: The final step in YOLACT’s architecture is the assembly of instance-specific masks. This is achieved by linearly combining the prototype masks with the predicted coefficients for each object. This process, unlike traditional segmentation approaches, is computationally efficient and allows for real-time performance.

  • Fast Non-Maximum Suppression (NMS): To refine the output, YOLACT applies a fast NMS algorithm. This step ensures that overlapping detections are resolved, resulting in a cleaner and more accurate segmentation.

Optimization Strategies

  • Anchor optimization: YOLACT includes an anchor optimization strategy that improves the quality of the bounding box predictions. This optimization is critical for ensuring that the generated masks align accurately with the objects in the image.

  • Loss function: The model's training is guided by a composite loss function, which balances classification loss, box regression loss, mask loss, and coefficient diversity loss. This balanced approach ensures the model learns to effectively segment instances while maintaining real-time performance.

Advantages of YOLACT

  • Speed: YOLACT's architecture allows for real-time performance, making it suitable for applications requiring immediate processing, like autonomous vehicles or real-time video analysis.
  • Accuracy: Despite its speed, YOLACT maintains a high level of accuracy, comparable to more complex and slower instance segmentation methods.
  • Simplicity: The separation of mask generation into prototypes and coefficients simplifies the network, reducing the computational overhead and making it easier to train and deploy.


  • Autonomous vehicles: YOLACT can be used for real-time object and obstacle detection, crucial for autonomous navigation.
  • Medical imaging: In medical diagnostics, YOLACT aids in segmenting specific structures in medical scans for more accurate analysis.
  • Agriculture: For precision agriculture, YOLACT assists in identifying and segmenting individual plants, enabling targeted treatment or harvesting.

Easily Implement YOLACT for image instance segmentation

Utilize the Ikomia API for a simplified and low-effort approach to instance segmentation with YOLACT. This method effectively minimizes the usual complexities involved in coding and setting up dependencies, offering a more user-friendly experience.


Before diving into the capabilities of the Ikomia API, it's crucial to establish a proper working environment. Begin by setting up the API within a virtual environment [3]. This initial step is key to ensuring an efficient and hassle-free experience with the API’s comprehensive functionalities.

pip install ikomia

Run YOLACT with a few lines of code

You can also directly charge the notebook we have prepared.

from ikomia.dataprocess.workflow import Workflow
from ikomia.utils.displayIO import display

# Init your workflow
wf = Workflow()

# Add algorithm
yolact = wf.add_task(name="infer_yolact", auto_connect=True)

    "conf_thres": "0.5",
    "top_k": "10",

# Run on your image  

# Inpect your result

YOLACT instance segmentation on a group of zebra
  • conf_thres (float) default '0.15':  It determines the minimum confidence level for a detection to be considered valid [0,1].
  • top_k (float) - default '15': The 'top_k' parameter specifies the maximum number of detections to consider in an image.

Building Custom Workflows with Ikomia

In this tutorial, we've delved into the creation of an instance segmentation workflow utilizing YOLACT. 

In the world of object detection, fine-tuning your model for specific needs and integrating it with other advanced models is often a critical step. 

Explore fine-tuning your own YOLOv8 instance segmentation model →  

For more detailed information on utilizing the API, refer to our comprehensive documentation.

Additionally, you can explore a wide range of state-of-the-art algorithms available on the Ikomia HUB. Don’t miss out on Ikomia STUDIO, which provides an intuitive interface, offering the same extensive functionalities as the API but in a more user-friendly environment.


‍[1] Deep Residual Learning for Image Recognition

[2] Mask R-CNN by Renjith Ms

[3] How to create a virtual environment in Python

No items found.

Build with Python API


Create with STUDIO app


Deploy with SCALE