Mastering Mask R-CNN: Insights into Its Architecture and Diverse Uses

Allan Kouidri
Mask R-CNN instance segmentation on living room furnitures

In the dynamic field of computer vision, Mask R-CNN is a pivotal framework, developed by He et al. in 2017. It excels in object detection and instance segmentation, enabling precise identification and outlining of objects in images.

This blog post explores Mask R-CNN’s architecture, functionality, applications, and implementation details.

What is Mask R-CNN?

Mask R-CNN (Region-based Convolutional Neural Network) is an extension of the Faster R-CNN [LINK], a popular object detection model. While Faster R-CNN efficiently locates objects in an image, Mask R-CNN takes a step further by generating a high-quality segmentation mask for each instance

This model is thus not only able to pinpoint the location of objects within an image but also to precisely outline the shape of each object.

How Mask R-CNN works?

Mask R-CNN is ingeniously designed with two stages: the region proposal network (RPN) for locating objects, and a network head for both classifying the objects and predicting the segmentation mask. Here’s a breakdown:

Mask R-CNN framework for instance segmentation
The Mask R-CNN framework for instance segmentation [1]

1. Backbone Network with FPN:

Pre-Trained Convolutional Neural Network:
  • Foundation: The backbone network, typically a pre-trained CNN such as ResNet or ResNeXt, is the first step in processing the input image.
  • Role: This backbone extracts high-level features from the image, which are crucial for understanding the complex patterns necessary for object detection.

Integration of Feature Pyramid Network (FPN):
Feature Pyramid Network [2]
Feature Pyramid Network [2]

  • Enhancing feature extraction: An FPN is added on top of the backbone network to enhance the capability of the model to handle objects of varying sizes and scales.
  • FPN architecture: The FPN creates a multi-scale feature pyramid by merging features from different layers of the backbone. This structure includes features with diverse spatial resolutions, encompassing both high-resolution features rich in semantic information and low-resolution features that provide more precise spatial details.

Steps in FPN Functioning:
  • Feature extraction: Initially, the backbone network processes the input image to extract high-level features.
  • Feature fusion: The FPN establishes connections between various levels of the backbone. It creates a top-down pathway that fuses high-level semantic information with lower-level feature maps. This fusion enables the model to reuse and refine features at different scales, enhancing the detection capability.
  • Creation of Feature Pyramid: The fusion process results in a multi-scale feature pyramid. Each level of this pyramid corresponds to features of different resolutions. The highest levels contain finer, high-resolution features, while the lower levels encapsulate broader, low-resolution features.

Impact on Object Detection:
  • Handling objects of various sizes: The feature pyramid generated by the FPN enables Mask R-CNN to effectively detect objects of diverse sizes within an image.
  • Contextual information and accuracy: This multi-scale representation aids the model in capturing essential contextual information, allowing for more accurate and reliable detection and segmentation of objects at various scales.

2. Region Proposal Network (RPN):

  • Role: The RPN is pivotal in locating potential objects within the image. It scans the image using a sliding window approach, identifying areas that likely contain objects.
  • Functionality: By generating object-bound boxes, known as proposals, the RPN narrows down the regions of interest. These proposals are then refined and used in subsequent stages for more detailed analysis.
  • Efficiency: The integration of the RPN within Mask R-CNN allows for real-time processing and reduces computational overhead compared to standalone object detection methods.
Region proposal network illustration

3. ROI Align:

  • Advancement over ROI Pooling: ROI Align is a significant enhancement over ROI Pooling. It addresses the issue of spatial misalignments caused by the quantization process in ROI Pooling.
  • Methodology: ROI Align utilizes bilinear interpolation to accurately extract feature maps for each proposed object region. This method ensures that the extracted features align precisely with the objects, leading to more accurate segmentation and classification.
Illustration RoI Align

4. Classification and Bounding Box Regression Head:

  • Dual function: This component of Mask R-CNN simultaneously handles object classification and bounding box refinement.
  • Classification: For each region proposed by the RPN, the network predicts the object's class, distinguishing between different types of objects in the image.
  • Bounding box regression: Alongside classification, this head adjusts the coordinates of each bounding box proposal, refining their size and position to more accurately encompass the object.

5. Mask Prediction Head:

Mask R-CNN head architecture
Head Architecture [4]
  • Segmentation function: This branch is dedicated to predicting the segmentation mask of each object at a pixel level.
  • Fully convolutional network (FCN): Utilizing a small FCN for each Region of Interest (ROI), this head generates a binary mask that outlines the precise shape of the object.
  • Precision: The mask prediction is per-pixel, allowing for detailed and accurate segmentation, which is crucial for tasks requiring fine-grained object outlines.


Mask R-CNN has been revolutionary in various applications:

  • Medical imaging: It’s used for segmentation tasks like tumor detection and organ delineation.
  • Autonomous vehicles: Helps in understanding the environment by segmenting and identifying different objects like pedestrians, vehicles, and road signs.
  • Agriculture: Used for crop and weed detection, aiding in precision agriculture.
  • Augmented reality: Enhances AR experiences by allowing better interaction of virtual objects with the real world.

Challenges and Limitations

Despite its prowess, Mask R-CNN faces challenges such as:

  • Computational intensity: It requires substantial computational resources, making real-time applications challenging.
  • Complexity in small object detection: Mask R-CNN sometimes struggles with detecting and segmenting small objects.

Run Mask R-CNN with a few lines of code

Simplify Instance Segmentation with Mask R-CNN via Ikomia API. This approach minimizes coding complexities and setup dependencies.

Get started by installing the Ikomia API in a virtual environment [5].

pip install ikomia

Alternatively, access our notebook for direct use.

from ikomia.dataprocess.workflow import Workflow
from ikomia.utils import ik
from ikomia.utils.displayIO import display

# Init your workflow
wf = Workflow()

# Add algorithm
algo = wf.add_task(ik.infer_torchvision_mask_rcnn(

# Run on your image

# Inpect your result

Mask R-CNN on living room furnitures
  • conf_thres (float, default=0.5): Box threshold for the prediction [0,1]
  • iou_thres (float, default=0.5): Intersection over Union, degree of overlap between two boxes. [0,1]

Optionally, you can load your custom Mask R-CNN model if trained with train_torchvision_mask_rcnn algorithm. 

  • model_weight_file (str, optional, default=''): Path to model weights file .pth. If not provided, will use pretrain from torchvision
  • class_file (str, optional): Path to class file. Default to coco 2017 classes.

Create your custom workflows with Ikomia

Optimize your instance segmentation models with Ikomia's flexible and user-friendly tools:

Fine-tune your model for peak performance →  

  • Learn more about the Ikomia API in our detailed documentation.
  • Discover a range of state-of-the-art algorithms on Ikomia HUB.
  • Utilize Ikomia STUDIO for a seamless no-code experience, offering the same capabilities as the API.


‍[1] Mask R-CNN

[2] Feature Pyramid Networks for Object Detection 

[3] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks 

[4] Mask R-CNN 

[5] How to create a virtual environment in Python

No items found.

Build with Python API


Create with STUDIO app