Reference
Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: Unified, real-time object detection. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016 (pp. 779-788). Link
Redmon J, Farhadi A. YOLO9000: Better, Faster, Stronger. arXiv preprint arXiv:1612.08242. 2016 Dec 25. Link
Summary
YOLO (You only look once) is the proposal-free object detection framework. The new framework design makes real-time object detection possible.
Main idea of the new framework
Suppose we are trying to detect objects of C different classes.
The key idea is to divide the image region into S x S grid cells. Each cell will predict (1) C confidence scores, and (2) B bounding boxes (B is a tunable hyperparameter). As explained in the paper, “If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.”
That’s it! The framework simply takes the raw image (resized) as input, and directly makes predict using a single neural network.
Network Macro-Architecture
The network has two main parts: convolutional feature extractor and two fully connected layers (last two layers). The convolutional part is adapted from Googlenet.
Input is 3 x 448 x 448 (DxWxH), and CNN output is 1024 x 7 x 7. After flattening the dimension into 50176, the features will pass through two fc layers 50176 –> 4096 –> (B x 5 + C)x S x S = 1470 (e.g., S=7, B=2, C=20), and finally be reshaped as 30 x 7 x 7.
Parametrization
The bounding box width and height are relative to the image size, i.e., normalized by image width and image height. The (x,y) coordinates are relative to the bounds of the grid cell, i.e., the offsets of a grid cell location.
v2: YOLO9000
Better:
- Add batchnormalization after convolution layers (meanwhile, all dropout layers are removed).
- Switch to fully convolution network. A key step toward fully convolution is to replace the final fc layers in YOLO by gloabl average pooling. (Global average pooling: computing one feature map for each output entry (e.g., a class). To predict 1000 classes, the final conv-layer produces 1000x7x7 and global average pooling layer computer the average of each feature map and produces a output of 1000x1.)
- Location prediction. Inpired by the anchors in faster r-cnn, the new model uses anchor boxes to predict bounding boxes. The regression from anchor boxes to ground truth bounding boxes is similar to the anchors in faster r-cnn, but with different parametrization (relative to the grid cell, not the whole image, to constrain the offset). The size of anchor boxes is determined by k-means clustering on ground truth.
- Add skip-connection. The feature maps before the last pooling layer is concatenated with the final feature maps by skip connection, so that the final prediction can take advantage of fine-grained features. Suppose the size of the feature maps before the last pooling is N x 2W x 2H and, therefore, the size of the last feature maps is M x W x H. The concatenation is done in two steps. First, the N x 2W x 2H features will be reshaped into 4N x W x H, by stacking adjacent features into different channels (e.g., 2 x 2 grid is stacked as 1 x 4). Then, the features can be concatenated as (4N + M) x W x H.
- Multi-scale training. Due to the new FCN-like framework design, the model can take images of any size. So, the image size will periodically change (by different rescaling operations) during training.
Faster:
A new network macro-architecture is used, which borrows ideas from network in network and global average pooling.
Stronger:
The key idea is to train the model with both detection datasets (mostly general labels) and classification datasets (widers and deeper range of labels). First, a method called hierarchical classification is designed to merge the labels in different datasets (e.g., “dog” and “Yorkshire terrier” are not mutually exclusive, and need special treatment before merging.) Second, after preparing the datasets, when the model sees an image labelled for detection, the error backpropagates based on the YOLOv2 loss. When it sees a classification image, the error backpropagates only based on the classification part.