YOLO
Last updated
Last updated
You Only Look Once (YOLO) achieved SoTA results in real-time detectors in 2015. A family of detectors emerged since then:
The YOLO architecture is simple:
24 convolutional layers and two end FC layers.
The first 20 layers are pre-trained on Imagenet with half-resolution images
Then it's trained on detection with full-resolution images
YOLO divides the input image into a S × S grid
and predicts B bounding boxes of the same class, along with its confidence for C different classes per grid element
Each bbox prediction consists of five values: Pc, bx, by, bh, bw (Pc is the confidence score)
Uses 1 × 1 Conv layers to reduce the number of feature maps and keep the #parameters relatively low
The output of YOLO is a tensor of S × S × (B × 5 +C) optionally followed by non-maximum suppression (NMS) to remove duplicate detections.
In the original YOLO paper, the authors used
the PASCAL VOC dataset that contains 20 classes (C = 20)
a grid of 7 × 7 (S = 7)
and at most 2 classes per grid element (B = 2)
YOLO uses NMS (non-maximum suppression) to improve accuracy.
Much faster than the existing object detectors allowing real-time performance.
However, the localization error was larger compared to SOTA methods such as Fast R-CNN.
It could only detect at most two objects of the same class in the grid cell, limiting its ability to predict nearby objects.
It struggled to predict objects with aspect ratios not seen in the training data.
It learned from coarse object features due to the down-sampling layers.