What is YOLO? An In-Depth Introduction to Object Detection in Computer Vision

Guillaume Demarcq
YOLO v7 applied on main artistic image

Let's step into the world of YOLO, an algorithm that had reshaped the landscape of object detection within the domain of computer vision. This journey invites curiosity and offers hands-on tutorials on the freshest iterations, YOLO v8 and YOLO v9. These guided experiences aim to empower enthusiasts and professionals alike to unlock YOLO's capabilities for your projects.

The Advent of YOLO

At the heart of YOLO's creation was the ambition to achieve real-time object detection that doesn't skimp on precision. Pre-YOLO methods often necessitated multiple passes over an image to identify objects, making them less than ideal for applications needing quick, accurate results. YOLO, however, introduced a paradigm shift—offering accurate detection in a single inspection.

YOLO: The Pinnacle of Computer Vision

1. Speed and Efficiency: YOLO boasts of real-time processing speeds. And with the advancements in versions like YOLO v8 and YOLO v9, the capabilities have expanded even further.

2. Precision Personified: With its unique approach, YOLO minimizes errors, ensuring that objects are detected with pinpoint accuracy.

3. A Holistic View: Unlike segmented methods, YOLO assesses the entire image in one go, capturing the essence of the scene.

YOLO detection in action: A bustling beach scene at sunset, capturing people, boats, and birds seamlessly.
YOLO detection in action: A bustling beach scene at sunset, capturing people, boats, and birds seamlessly.

Decoding the YOLO Algorithm

- A Singular Approach: YOLO reframes object detection, transitioning from image pixels directly to bounding box coordinates and class probabilities in one seamless step.

- Grid-Based Detection: The algorithm segments images into grids, predicting bounding boxes and their confidence scores within each segment.

- Intersection Over Union (IOU): This metric ensures that YOLO's predicted bounding boxes align perfectly with the actual objects.

How does YOLO work? YOLO Architecture

YOLO (You Only Look Once) simplifies object detection by using a single deep convolutional neural network (CNN) pass to predict bounding boxes and class probabilities. This streamlined process, encompassed in its name, allows for simultaneous object localization and classification. Here's a breakdown of how the original YOLO model operates:

YOLO architecture

Image Division

  • Input Process: YOLO processes an input image by dividing it into a S×S grid. Each cell in this grid is tasked with predicting bounding boxes if the center of an object falls within its area.

Feature Extraction

  • Convolutional Layers: The divided image undergoes a series of convolutional layers, which are adept at learning and detecting features from the image. Simpler features like edges are detected in the early layers, while complex features, such as object parts or shapes, are identified in deeper layers.

Prediction Phase

  • Bounding Box and Confidence Scores: Every grid cell predicts multiple bounding boxes and assigns confidence scores to these boxes. This score is a measure of the model's certainty that a box indeed contains an object and the accuracy of the predicted box.
  • Class Probability: In addition to bounding box predictions, each grid cell estimates conditional class probabilities for the object classes the model recognizes. The final score for a box is computed by multiplying the object's confidence score by the conditional class probability.

Architecture Specifics

  • Structure: The core YOLO architecture is made up of 24 convolutional layers followed by 2 fully connected layers. Among these, 1x1 reduction layers are employed to compress feature map depths, facilitating feature integration from preceding layers. The network's output is a tensor encoding bounding box coordinates, object confidence, and class probabilities.

Refinements & Output

  • Non-max Suppression: To address the issue of multiple overlapping boxes for the same object, non-max suppression is applied. This process retains only the boxes with the highest confidence scores, eliminating redundancies.
  • Thresholding: Boxes scoring below a preset confidence threshold are discarded. The model then presents the surviving boxes as its final object detections.

This overview provides a simplified glimpse into the original YOLO model's design and functionality, highlighting its efficiency and effectiveness in object detection tasks.

YOLO series: From YOLO v1 to YOLO v9

The YOLO timeline traces the development of a series of influential computer vision models designed for real-time object detection. Here's a concise overview of its evolution:

YOLO series timeline
YOLO timeline from 2015 to 2024

YOLOv1 (2015)

Introduced by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi in their paper "You Only Look Once: Unified, Real-Time Object Detection." YOLOv1 was groundbreaking for its novel approach of treating object detection as a single regression problem, directly predicting bounding boxes and class probabilities from full images in one evaluation. This approach significantly increased the speed of object detection, making real-time processing possible.

YOLOv2 (YOLO9000, 2016)

In their paper "YOLO9000: Better, Faster, Stronger," Redmon and Farhadi improved YOLO's speed and accuracy. YOLOv2 introduced various concepts, such as batch normalization, high-resolution classifiers, and a new network architecture called Darknet-19. It also introduced the concept of anchor boxes to predict more accurate bounding boxes. YOLOv2 was notable for its ability to detect over 9000 object categories by jointly training on both detection and classification datasets.

YOLOv3 (2018)

With the paper "YOLOv3: An Incremental Improvement," Redmon and Farhadi presented further improvements. YOLOv3 incorporated several enhancements, such as using a deeper network architecture called Darknet-53, employing multi-scale predictions, and improving the detection of smaller objects. Despite these advancements, it managed to maintain high speed and accuracy.

YOLOv4 (2020)

Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao released YOLOv4 in the paper "YOLOv4: Optimal Speed and Accuracy of Object Detection." YOLOv4 focused on optimizing the speed and accuracy further, making it accessible for a wider range of devices, including those with limited computational power.

It introduced improvements like the use of the CSPDarknet53 backbone, spatial pyramid pooling, and PANet path-aggregation. YOLOv4 was also designed to be more user-friendly in terms of training and deploying on different platforms.

YOLOv5 (2020)

Despite the naming convention, YOLOv5 is not an official continuation by the original authors but was developed and released by Ultralytics. It is a significant departure in terms of framework, being implemented in PyTorch instead of Darknet. YOLOv5 has been controversial due to its naming but has seen widespread adoption due to its ease of use, performance, and continuous updates. It introduced several improvements and optimizations over YOLOv4, including model scalability (with versions from YOLOv5s to YOLOv5x), automated hyperparameter tuning, and enhanced training procedures.

YOLOv6 (2022)

YOLOv6, introduced in 2022 by Li et al., represents a notable advancement over its predecessors. This iteration distinguishes itself from YOLOv5 primarily through its underlying convolutional neural network (CNN) architecture. Opting for a variant of the EfficientNet architecture, known as EfficientNet-L2, YOLOv6 achieves a balance between efficiency and performance that surpasses the EfficientDet architecture utilized in YOLOv5. With a reduction in parameters and enhanced computational efficiency, YOLOv6 sets new benchmarks in object detection performance across a variety of standard tests.

YOLOv7 (2022)

Released only two months after YOLO v6, YOLO v7 introduces significant advancements in real-time object detection, achieving state-of-the-art performance in both speed and accuracy across a wide range of FPS (5 to 160 FPS) on GPU V100. It boasts the highest accuracy among real-time detectors with 30 FPS or higher, outperforming both transformer-based and convolutional-based detectors in speed and accuracy.

YOLOv7's design focuses on optimizing the training process without increasing inference cost, introducing trainable bag-of-freebies methods for enhanced detection accuracy. It effectively reduces parameters and computational requirements compared to prior models, maintaining fast inference speeds and high detection precision without additional datasets or pre-trained weights.

YOLOv8 (2023)

YOLO v8, is the latest version of YOLO released by Ultralytics, marks a significant evolution in the YOLO series, expanding its capabilities to include object detection, image classification, and instance segmentation. Noteworthy for its precision and streamlined model size, YOLO v8 introduces anchor-free detection, eliminating the need for anchor boxes by predicting object centers.

This advancement simplifies the model architecture and enhances efficiency in post-processing tasks like Non-Maximum Suppression. The model architecture draws inspiration from ResNet, featuring novel convolution types and module configurations, enhancing its usability and effectiveness in computer vision projects.

YOLOv9 (2024)

YOLOv9 represents the most current and advanced model in the YOLO series, setting a new benchmark for state-of-the-art (SOTA) performance in object detection.

YOLO series performance

YOLO v9 introduces Programmable Gradient Information (PGI) and a new network architecture called Generalized Efficient Layer Aggregation Network (GELAN), aimed at overcoming data loss in deep networks. PGI allows for tailored gradient information, ensuring complete input data utilization for target tasks, which leads to more reliable gradient updates.

GELAN, based on gradient path planning, focuses on parameter efficiency and computational simplicity, outperforming previous methods in parameter utilization and supporting a wide range of models. These innovations significantly enhance object detection performance on the MS COCO dataset.

YOLO in Action: Real-World Applications

- Autonomous Vehicles: YOLO's real-time detection capabilities are steering the future of self-driving cars, identifying obstacles and ensuring safe navigation.

- Wildlife Conservation: From tracking endangered species to monitoring habitats, YOLO is the conservationist's tech companion.

- Security and Surveillance: Enhancing security protocols, YOLO can detect breaches and unauthorized activities with unmatched precision.

Urban dynamics through YOLO's lens:
Urban dynamics through YOLO's lens: A vibrant city street teeming with pedestrians, cyclists, dogs, and the rhythm of traffic.


YOLO is more than just an algorithm; it's a paradigm shift in computer vision. As you embark on your journey with YOLO, our comprehensive tutorials on YOLO v7, YOLO v8 and YOLO v9. are here to assist you every step of the way. Harness the power of YOLO and elevate your projects to new heights.

Frequently asked questions:

What is YOLO in computer vision?

YOLO, which stands for "You Only Look Once," is a revolutionary algorithm in computer vision used for real-time object detection. Unlike traditional methods that require multiple passes to detect objects, YOLO accomplishes this in a single pass, making it significantly faster without compromising on accuracy.

Why is YOLO better than CNN?

While YOLO is actually based on Convolutional Neural Networks (CNN), it surpasses standard CNNs in object detection tasks due to its speed and efficiency. YOLO divides an image into a grid and predicts bounding boxes and class probabilities for each grid cell simultaneously. This approach enables YOLO to detect objects in real-time, making it faster than methods that rely on standard CNNs.

Is YOLO based on CNN?

Yes, YOLO is based on Convolutional Neural Networks (CNN). It utilizes a CNN architecture to extract features from images and predict object bounding boxes and class probabilities.

No items found.

Build with Python API


Create with STUDIO app


Deploy with SCALE