Understanding SegFormer: The Future of Semantic Segmentation

Allan Kouidri
SegFormer - Illustration of a bird eye view of a wineyard with a robot

In the rapidly advancing field of Computer Vision, semantic segmentation serves as a foundational technique, employed in a variety of applications ranging from autonomous driving to precision agriculture. 

Traditional models have long relied on convolutional neural networks (CNNs) to process images. But as the demand for flexibility and adaptability grows, the industry seeks a paradigm shift. Enter SegFormer—a semantic segmentation model that harnesses the prowess of transformers, renowned for processing non-grid structured data. 

This article explores the nuances of SegFormer, examining its essential components and highlighting its distinct benefits.

Moreover, we'll guide you through training this powerful model on a custom vineyard dataset using the Ikomia STUDIO, a no-code platform.

Why consider Ikomia STUDIO?

  • Streamlined workflows: Bypass the hassles of manual setup, virtual environment and dependencies  management.
  • User-friendly: Ideal for both experts and beginners, thanks to its no-code approach.
  • Empowerment: Explore advanced Computer Vision models without facing technical complexities, while retaining full customization capabilities.
  • Stay updated: Use the latest SOTA algorithms.

Whether you're an expert in Computer Vision or just venturing into the field, this guide offers an in-depth look at SegFormer's transformative impact on semantic segmentation. Dive in and discover the next chapter of image segmentation.

What is SegFormer?

SegFormer is a semantic segmentation model that embraces transformers’ potential to process non-grid structured data, such as images. Unlike traditional convolutional neural networks (CNNs) that process images in a grid structure, transformers can handle data with various structures, offering flexibility and adaptability in dealing with diverse data types and tasks.

SegFormer, introduced in the paper "SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers" [2], has been designed to efficiently and effectively address the challenges of semantic segmentation.

What is the structure of the SegFormer?

SegFormer architecture

1. Hybrid transformer backbone

SegFormer leverages a hybrid transformer backbone to extract features from input images. This involves a convolutional layer to process the input image, followed by a transformer to capture the global context of the image.

2. Multi-Scale feature integration

To handle objects and features of varying scales in an image, SegFormer amalgamates multi-scale feature maps derived from different transformer layers. This multi-scale feature integration enables the model to recognize and accurately segment objects of different sizes and shapes.

3. MLA head

The Multi-Level Aggregation (MLA) head is a distinct component of SegFormer, which fuses feature maps from different levels, ensuring that the segmentation model can effectively utilize features from all scales. This is crucial for maintaining high-resolution details and recognizing small objects, enhancing the model's segmentation performance.

Advantages of SegFormer

- Simplicity and efficiency: SegFormer introduces a simple yet effective architecture that doesn't require intricate designs or complex auxiliary training strategies, which are often used in conventional semantic segmentation models.

- Versatility: The model exhibits exceptional versatility, capable of handling a variety of segmentation tasks across numerous domains without requiring domain-specific adjustments or fine-tuning.

- Scalability: SegFormer demonstrates remarkable scalability, performing efficiently on images of different resolutions and scales.

Benchmark performance

SegFormer benchmark

SegFormer has showcased exemplary performance across several benchmark datasets, such as ADE20K, Cityscapes, and Pascal Context, establishing new state-of-the-art results and underlining its efficacy and robustness in semantic segmentation.

Automatic vineyards segmentation

This guide illustrates how to train SegFormer, a high-performing semantic segmentation model, on a custom dataset without engaging in any coding. 

We aim to develop a model for a viticultural robot capable of autonomous driving in vineyards, enabling it to traverse and recognize obstacles within the inter-row spacing.

The development of an autonomous viticultural robot for vineyards could offer enhanced efficiency in farming by providing precision agriculture, labor savings, and sustainable practices. With capabilities like obstacle detection, data collection, and modular design, it promises to revolutionize vineyard management while reducing costs and environmental impact.

Vineyards dataset

For this tutorial, we're utilizing a concise vineyard dataset [1] from Roboflow with 71 images to illustrate the training of our custom segmentation model. While this compact dataset is ideal for demonstration purposes, a production-level application would demand a larger, more varied dataset to guarantee model accuracy and resilience. The dataset contains four labels: Plant, trunk, sky, and soil.

SegFormer results

How to train SegFormer on custom dataset?

Training a custom object detector has never been easier. The Ikomia HUB provides all the necessary building blocks for our training pipeline, and it's ready to use with no code required. You'll be able to start your training in just a few steps.

Train without code with Ikomia STUDIO

To get started, you need to install the STUDIO desktop app for Computer Vision available for both Windows and Linux users. 

STUDIO offers a no-code interface, making computer vision tasks accessible without compromising on depth or performance. While it's lightweight on system resources, users still maintain detailed control over hyper-parameters. Professionals and beginners alike can navigate its features with ease.

Plus, its open-source foundation ensures transparency, allowing users to inspect, modify, and adapt the tool to their evolving needs.

Install algorithms from the HUB 

To build a custom training workflow, we only need two algorithms.

Dataset loader

The initial step involves converting the dataset into the Ikomia format to ensure compatibility with all training algorithms.

The annotations within the infected leaves dataset are stored utilizing the Common Objects in Context (COCO) format. This format allows for detailed annotations of objects within each image, using a .JSON file to maintain this information.

Training algorithm

Subsequently, we efficiently train our semantic segmentation model, SegFormer, implemented from the Hugging Face framework.

This algorithm not only supports the SegFormer model but also includes various other model architectures, such as BeiT and Data2VecVision, offering flexibility to select the model that aligns best with your requirements.

For this tutorial, we've opted for the SegFormer-b2 model. The SegFormer-b2 balances both accuracy and computational demands, providing strong segmentation results without taxing system resources excessively. If you prioritize maximum accuracy and can accommodate higher computational needs, consider exploring the "b4" or "b5" variants.

Installation steps 

1 - Navigate to the HUB within Ikomia Studio.

2- Search for and install sequentially the algorithms: dataset_coco and train_hf_semantic_seg.

Load the dataset in Ikomia Studio using the “dataset_coco” algorithm

The COCO dataset algorithm in STUDIO enables loading of any dataset in the COCO format and seamlessly integrates with your training algorithm. Additionally, once the dataset is loaded, STUDIO provides a visual inspection tool for your annotations, ensuring their accuracy and correctness.

  1. Locate the algorithm:

  • Search for the recently installed plugin within the process library, found in the left pane.

  2. Configure parameters:

  • COCO json file: Specify the path to the COCO JSON file (e.g., ‘path/to/dataset/train/_annotations.coco.json’).
  • Image folder: Define the path to the image folder (e.g., ‘path/to/dataset/train’).
  • Task: Identify the task being performed by the training algorithm, in this case, semantic_segmentation.
  • Output folder: As the COCO format does not inherently support semantic segmentation, the semantic segmentation masks must be computed from instance segmentation masks. Specify the folder where the masks will be stored (determined by this parameter).

  3. Apply settings:

  • Click on the "Apply" button to load the dataset with the configured settings.

Setting up the SegFormer training algorithm

With the dataset loaded, it's time to incorporate the SegFormer training job into the workflow.

  1. Locating the algorithm:

Navigate to the process library (left pane) and search for the recently installed train_hf_semantic_seg algorithm.

  2. Adjusting parameters before training:

Prior to initiating training, it's imperative to dive into and configure the available parameters to ensure optimal model training.

  • Model name: train_hf_semantic_seg includes several algorithms from Hugging face.
  • Batch size: Number of samples processed before the model is updated.
  • Epochs: Number of complete passes through the training dataset.
  • Image size: Size of the input image.
  • Learning rate: Step size at which the model's parameters are updated during training.
  • Test image percentage: Divide the dataset into train and evaluation sets.
  • Advanced YAML config (optional): path to the training config file .yaml.
  • Output folder (optional): path to where the fine-tuned model will be saved.

Start training

Click the 'Apply' button now to incorporate the infer_hf_semantic_seg algorithm into the current workflow; the training process will begin immediately. 

With the smooth integration of MLflow, you can monitor the training progress in real-time. Parameters and metrics, including mean IoU, accuracy, and loss value, are automatically reported and can be viewed through the MLflow dashboard.

Below are the results obtained from a run using a SegFormer b2 model trained for 10 epochs with an input size of 320 pixels: 

  • Evaluation loss: 0.15
  • Evaluation mean IoU: 0.76
  • Evaluation mean accuracy: 0.82
  • Evaluation overall accuracy: 0.94
  • Training time: 15 minutes (single GPU – NVidia RTX 3060)

At this stage, Ikomia STUDIO offers several options:

  • Modify training parameters to initiate a new run and perform comparisons.
  • Save the current workflow for future training sessions.

Once your custom model is trained, you can easily test it within Ikomia STUDIO. Close your previous training workflow and follow these steps:

  1. Download infer_hf_semantic_seg from the HUB.
  2. Open a vineyard image. 
  3. Self the newly installed infer_hf_semantic_seg the process library (left panel)
  4. Fill parameters:

             a. Check ‘Model from checkpoint’

             b. Browse to your  custom model weight folder (user-folder/Ikomia/Plugins/Python/YoloTrain/data/models/train_hf_semantic_seg/outputs/nvidia/mit-b2/[timestamp]/checkpoint-120)

      5. Press Apply

Crafting production-ready Computer Vision applications with ease

This article has delved into the mechanics and advantages of SegFormer, showcasing its simplicity, versatility, and scalability. Beyond just theory, we provided a hands-on guide on how to train this potent model on a custom vineyard dataset using the no-code Ikomia STUDIO platform. 

Ikomia STUDIO's user-friendly and streamlined workflows make it an ideal platform for both novices and experts to explore and harness the latest advancements in Computer Vision. With no-code platforms like Ikomia, the future of semantic segmentation, and by extension, Computer Vision, is not only promising but also accessible to all.

With the Ikomia tools you can chain algorithms from different frameworks like TorchVision, YOLO, and OpenMMLab. This enables you to effortlessly construct more sophisticated and potent workflows. Explore the ID card information extraction solution we crafted, which leverages five state-of-the-art algorithms from diverse frameworks.

[1] STL Dataset from NeyestaniSetelCo

[2] doi.org/10.1609/aaai.v37i11.26477

No items found.


Build with Python API


Create with STUDIO app


Deploy with SCALE