In the dynamic field of computer vision, object detection has always been a critical area, particularly in applications like autonomous driving, surveillance, and face recognition. Among the various approaches developed over the years, Faster R-CNN has emerged as a notable milestone.
It's a model that not only detects objects within an image but also classifies them, offering a blend of speed and accuracy that was previously unattainable.
What is Faster R-CNN?
Faster R-CNN, an abbreviation for "Faster Region-based Convolutional Neural Network," is an enhanced object detection model within the R-CNN (Region-based Convolutional Neural Network) family, which also encompasses Fast R-CNN.
Developed by Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun in 2015, Faster R-CNN revolutionized how machines understood images.
The Evolution of R-CNNs
R-CNN (2014): Pioneered the use of CNNs for object detection, but was slow due to the separate processing of hundreds of region proposals per image.
Fast R-CNN (2015): Improved upon R-CNN by processing the entire image with a CNN only once, but still relied on external methods for region proposal, which was a bottleneck.
Faster R-CNN (2015): Addressed this bottleneck by introducing a Region Proposal Network (RPN), fully integrating the region proposal step with the detection network.
Building on the foundational concepts of Faster R-CNN, it's important to delve deeper into its predecessors - R-CNN and Fast R-CNN - as well as a key component of Faster R-CNN, the Region Proposal Network (RPN).
Understanding these elements provides a clearer picture of the evolution and functioning of Faster R-CNN in object detection.
R-CNN (Region-based Convolutional Neural Network)
R-CNN, introduced by Ross Girshick et al. in 2014, was a groundbreaking step in using Convolutional Neural Networks (CNNs) for object detection. Here's how it works:
Selective Search (2): R-CNN begins by using a technique called selective search to generate region proposals - potential bounding boxes in an image that might contain objects.
CNN Features (3): Each proposed region is then warped to a fixed size and passed through a CNN, which extracts a feature vector for each region.
Classification (4): These feature vectors are then fed into a set of classifiers (like SVMs) to determine the object class within each proposed region.
Post-processing: Finally, bounding box regression is used to refine the boxes, followed by non-maximum suppression to eliminate redundant overlaps.
R-CNN is computationally expensive, primarily due to the need to run a CNN for each region proposal separately.
The training process is multi-stage and complex.
Building upon the foundational R-CNN model, the Fast R-CNN model was developed to address several of its limitations. An overview of the Fast R-CNN architecture and its improvements over the traditional R-CNN will be provided in the following section.
Fast R-CNN, conceptualized by Ross Girshick in 2015, marked a significant evolution in the field of object detection, specifically addressing the inefficiencies of its predecessor, the R-CNN model.
Core Components of Fast R-CNN
Single CNN for Feature Extraction
Function: Fast R-CNN transforms the entire input image into a comprehensive feature map using a single convolutional neural network (CNN).
Benefit: This approach contrasts sharply with R-CNN, which requires running a CNN separately for each region proposal, thus making Fast R-CNN significantly more efficient in terms of computational resources and time.
Region of Interest (ROI) Pooling Layer
After the generation of the feature map, the model still utilizes selective search to propose regions. However, for each of these proposals, the ROI pooling layer extracts a fixed-size feature vector directly from the feature map. The ROI Pooling layer operates by dividing each region proposal into a fixed grid of cells.
Within each cell of this grid, a max pooling operation is executed, which essentially selects the maximum value from the pixels in that cell. These maximum values, extracted from each cell, collectively form the feature vector.
For instance, if the grid is configured to a size of 2x2, there would be four cells in total. Consequently, the length of the resulting feature vector would be four, representing the highest values from each of the four cells.
This process ensures that the features extracted are both representative of the region and consistent in dimensionality, regardless of the original size of the region proposal.
Classifier and Bounding Box Regressor
Functionality: The fixed-size feature vectors are then passed through a series of fully connected layers. These layers perform two key functions: classifying the object within the proposal and predicting the bounding box coordinates.
Output: The network simultaneously provides the class probabilities for the object in the region and the precise coordinates for the bounding box, ensuring accurate localization and classification.
Improvements Over R-CNN
Fast R-CNN introduced several key improvements over the original R-CNN model:
Speed and Efficiency: By processing the entire image only once, Fast R-CNN drastically reduces the time required for feature extraction. This single-pass approach negates the need to run a CNN for each region proposal, leading to significant speed improvements.
Simplified Training Process: Fast R-CNN allows for the joint optimization of the classifier and the bounding box regressor. This unified approach simplifies the training process, as it avoids the multi-stage, cumbersome training procedure required by R-CNN.
Memory Efficiency: Fast R-CNN is more memory efficient, as it does not require storing a large number of individual region features.
Higher Accuracy: The integrated nature of the network, along with improvements in feature extraction and region proposal processing, often leads to higher overall accuracy in object detection.
Building on the foundation laid by Fast R-CNN, the next section introduces Faster R-CNN. This model takes the concept a step further by integrating a network specifically designed for generating region proposals, thereby addressing one of the last remaining bottlenecks in the R-CNN series.
Faster R-CNN, an advancement over Fast R-CNN, is known for its efficiency and accuracy in object detection.
How Faster R-CNN Works?
Faster R-CNN combines a Region Proposal Network (RPN) with a detection network, making the process more efficient than its predecessor, Fast R-CNN, which used selective search for region proposal. This integrated approach in Faster R-CNN ensures a swift and effective object detection.
Integration of Region Proposal Network (RPN)
The RPN is pivotal in Faster R-CNN. It replaces selective search, streamlining the process. The RPN, a fully convolutional network, predicts object bounds and objectness scores across the image. It guides the Fast R-CNN detection module towards areas with potential objects, enhancing detection efficiency.
The Role of Anchors
Anchors are crucial in Faster R-CNN. An anchor is a predefined bounding box varying in scale and aspect ratio, scanning the image for potential objects. For instance, at one image position (320, 320), there might be nine anchors with different sizes (e.g., 128x128, 256x256, 512x512) and aspect ratios (1:1, 1:2, 2:1).
These anchors, covering thousands of positions across an image, enable the RPN to narrow down the number of possible regions for object detection.
Shared Features for Efficiency
Faster R-CNN allows the RPN and the detection network to share convolutional features, a step forward in efficiency. This shared feature extraction means the network performs this computationally expensive process only once for both region proposal and object detection.
Training the RPN
The RPN is trained to classify anchors as background or foreground based on their overlap with ground-truth boxes, refining the anchors accordingly. This process involves labeling anchors, extracting features, and understanding the influence of the receptive field.
ROI Pooling to Fast R-CNN Detector
Once the RPN proposes regions, these are reshaped using ROI pooling, similar to Fast R-CNN. This step ensures the regions are a fixed size, suitable for classification and bounding box regression.
Loss Function and ROI Pooling
The overall loss of the RPN combines classification and regression losses. After RPN, proposed regions of varying sizes are standardized through ROI Pooling, which divides the input feature map into fixed regions, each subjected to Max-Pooling. This uniformity allows for a flexible architecture in the final classifier and regressor.
Applications and Impact
Faster R-CNN has been influential in numerous fields:
Autonomous Vehicles: For real-time detection of pedestrians, vehicles, and other obstacles.
Medical Imaging: Helps in detecting anomalies and diseases.
Surveillance Systems: For monitoring and identifying activities and objects.
Complex Training Process: The multi-step training process can be challenging to optimize.
Struggles with Small Objects: Due to its reliance on region proposals, detecting very small objects can be difficult.
Simplified object detection with Faster R-CNN via Ikomia API
Run Faster R-CNN on Ikomia API with ease, bypassing the usual coding complexities.
Create a virtual environment: Start by setting up the Ikomia API in a virtual environment to ensure a smooth and efficient workflow. 
Install Ikomia using a simple command: ‘pip install ikomia’.
Run Faster R-CNN with a few lines of code
You can also directly charge the notebook we have prepared.
from ikomia.dataprocess.workflow import Workflow
from ikomia.utils import ik
from ikomia.utils.displayIO import display
# Init your workflow
wf = Workflow()
# Add algorithm
algo = wf.add_task(ik.infer_torchvision_faster_rcnn(conf_thres='0.5'),auto_connect=True)
# Run on your image
# Inpect your result
By default, the algorithm will use the Faster R-CNN model trained on the COCO 2017 dataset.
conf_thres (float) default '0.5': Box threshold for the prediction [0,1]