YOLO: Algorithm for Object Detection Explained [+Examples] (V7Labs)

A comprehensive review of YOLO architectures in Computer Vision: From YOLOV1 to YOLOV8 and YOLO-NAS (arXiv)

You Only Look Once (YOLO) achieved SoTA results in real-time detectors in 2015. A family of detectors emerged since then:

The YOLO architecture is simple:

  • 24 convolutional layers and two end FC layers.

  • The first 20 layers are pre-trained on Imagenet with half-resolution images

  • Then it's trained on detection with full-resolution images

  • YOLO divides the input image into a S × S grid

  • and predicts B bounding boxes of the same class, along with its confidence for C different classes per grid element

  • Each bbox prediction consists of five values: Pc, bx, by, bh, bw (Pc is the confidence score)

  • Uses 1 × 1 Conv layers to reduce the number of feature maps and keep the #parameters relatively low

  • The output of YOLO is a tensor of S × S × (B × 5 +C) optionally followed by non-maximum suppression (NMS) to remove duplicate detections.

  • In the original YOLO paper, the authors used

    • the PASCAL VOC dataset that contains 20 classes (C = 20)

    • a grid of 7 × 7 (S = 7)

    • and at most 2 classes per grid element (B = 2)

  • YOLO uses NMS (non-maximum suppression) to improve accuracy.

Pros & Cons

  • Much faster than the existing object detectors allowing real-time performance.

  • However, the localization error was larger compared to SOTA methods such as Fast R-CNN.

    • It could only detect at most two objects of the same class in the grid cell, limiting its ability to predict nearby objects.

    • It struggled to predict objects with aspect ratios not seen in the training data.

    • It learned from coarse object features due to the down-sampling layers.

