YOLO

YOLO: Algorithm for Object Detection Explained [+Examples] (V7Labs)

A comprehensive review of YOLO architectures in Computer Vision: From YOLOV1 to YOLOV8 and YOLO-NAS (arXiv)

You Only Look Once (YOLO) achieved SoTA results in real-time detectors in 2015. A family of detectors emerged since then:

The YOLO architecture is simple:

YOLO divides the input image into a S × S grid
and predicts B bounding boxes of the same class, along with its confidence for C different classes per grid element
Each bbox prediction consists of five values: Pc, bx, by, bh, bw (Pc is the confidence score)
Uses 1 × 1 Conv layers to reduce the number of feature maps and keep the #parameters relatively low
The output of YOLO is a tensor of S × S × (B × 5 +C) optionally followed by non-maximum suppression (NMS) to remove duplicate detections.
In the original YOLO paper, the authors used
- the PASCAL VOC dataset that contains 20 classes (C = 20)
- a grid of 7 × 7 (S = 7)
- and at most 2 classes per grid element (B = 2)

Much faster than the existing object detectors allowing real-time performance.
However, the localization error was larger compared to SOTA methods such as Fast R-CNN.
- It could only detect at most two objects of the same class in the grid cell, limiting its ability to predict nearby objects.
- It struggled to predict objects with aspect ratios not seen in the training data.
- It learned from coarse object features due to the down-sampling layers.

Last updated 10 months ago

Was this helpful?