YOLO- KING OF COMPUTER VISION APPLICATIONS

Pride Chocolate
12 min readFeb 4, 2021

What is YOLO?

YOLO (You Only Look Once) is an object detection deep learning algorithm. It uses Convolutional Neural Networks for object detection.

YOLO can detect multiple objects on a single image. It means that apart from predicting classes of the objects, YOLO also detects locations of these objects on the image.

YOLO applies a single Neural Network to the whole image. This Neural Network divides image into regions and produces probabilities for every region. After that YOLO predicts number of Bounding Boxes that cover some regions on the image and chooses the best ones according to the probabilities.

Among three boxes, it choses pink ones

To fully understand principle idea how YOLO v3 works, following terminology needed to be known:

  • Convolutional Neural Networks (CNN)
  • Residual Blocks
  • Skip Connections
  • Up-Sampling
  • Leaky ReLU
  • Intersection Over Union (IoU)
  • Non- Maximum Suppression

Architecture of YOLO v3:

YOLO uses convolutional layers. Here, YOLO v3 consists of 53 convolutional layers that are also called Darknet-53.

But for detection tasks original architecture stacked with 53 more layers that give us 106 layers of architecture for YOLO v3.

That’s why when you start any command in Darknet framework, you would see the process of loading architecture that consists of 106 layers.

The detections are made at three layers: 82, 94 and 106.

This latest version 3 incorporates some of the most essential elements which are Residual Blocks, Skip connections and Up-sampling.

Each convolutional layer is followed by batch normalization layer and Leaky ReLU activation function.

Comparison to other Detectors

YOLOv3 is extremely fast and accurate. In mAP measured at .5 IOU YOLOv3 is on par with Focal Loss but about 4x faster. Moreover, you can easily tradeoff between speed and accuracy simply by changing the size of the model, no retraining required.

Why aren't there any pooling layers, but instead, additional convolutional layers with stride 2, are used to down-sample feature maps ?

Because the use of additional convolutional layers to down-sample feature maps prevents loss of low-level features that pooling layer just exclude. As a result, capturing low-level features helped to improve ability for detection small objects.

A good example of this is on the images, where we can see that pooling exclude numbers, but convolution takes into account all numbers.

How does input to the Network looks like?

Input to the Network interprets shape of the input. The input is a batch of images of shape (n, 416, 416, 3), where n is a number of images, (416,416) are width and height, (3) is a number of channels — Red, Green and Blue.

The middle two numbers, width and height, can be changed and set as 608, or any other number (832, 1024) that is divisible by 32 without leaving a remainder .

Why this number must be exactly 32?

Because increasing resolution of input might improve model’s accuracy after training. In the current scenario, we will assume that we have input of size 416 x 416. These numbers are also called input network size.

Input images themselves can be of any size, there is no need to resize them before feeding to the network. They all will be resized according to the input network size. And there is a possibility to experiment with keeping or not keeping aspect ratio by adjusting parameters when training and testing in original Darknet framework, in Tensorflow, Keras or any other framework . Then you can compare and choose what approach best suites your custom model.

Detection at 3 scales indicates detection places at the Network.

How the Network detects objects?

YOLO v3 makes detections at three different scales and at three separate places in the Network. These separate places for detections are layers 82, 94 and 106. Network downsamples input image by following factors: 32, 16 and 8 at those separate places of the Network accordingly.

These three numbers are called stride of the network and they show how the output at three separate places in the Network is smaller than input to the Network. For instance, if we consider stride 32 and input network size 416x 416, then it will give us the output of size 13x13. Consequently, for the stride 16 the output will be 26x26 and for the stride 8 the output will be 52x52. 13x13 is responsible for detecting large objects, 26x26 is responsible for detecting medium objects and 52x52 is responsible for detecting small objects.

That is why, input to the Network must be divisible by 32 without leaving a reminder. Because if it is true for 32, then it is true for 16 and 8 as well.

Detection Kernels:

They explains filters to be trained. To produce output YOLO v3 applies 1x1 detection kernels at these three separate places in the Network. 1x1 convolutions applied to downsampled input images: 13x13, 26x26 and 52x52. Consequently, resulted feature maps will have the same spatial dimensions.

The shape of detection kernel also has its depth that is calculated by set of equations. “b” represents number of bounding boxes that each cell of the produced feature map can predict. YOLO v3 predicts 3 bounding boxes for every cell of these feature maps. That is why, “b” is equal to 3. Each bounding box has (5 + c) attributes that describe following: centre coordinates of bounding box; width and height that are dimensions of bounding box; objectness score; and list of confidences for every class this bounding box might belong to.

We will consider that YOLO v3 trained on COCO dataset that has 80 classes. Then, “c” is equal to 80 and total number of attributes for each bounding box is 85. Resulted equation is as following: 3 multiplied by 85 which gives us 255 attributes.

Now we can say, each feature map produced by detection kernels at three separate places in the Network, has one more dimension depth that incorporates 255 attributes of bounding boxes for COCO dataset. And the shapes of these feature maps are as following: (13x3x255); (26x26x255) and (52 x52x255).

Grid Cells:

Grid Cells illustrates detection cells. We already know that YOLO v3 predicts 3 bounding boxes for every cell of the feature map. Each cell, in turn, predicts an object through one of its bounding box if the centre of the object belongs to the receptive field of this cell. And this is the task of YOLO v3 while training: identify this cell that falls into the centre of the object. Again, this is one of the feature map’s cell produced by detection kernels that was discussed before.

Training YOLO v3:

When YOLO v3 is training, it has one ground truth bounding box that is responsible for detecting one object. That’s why and firstly, we need to define which cells this bounding box belongs to. And to do that let’s consider first detection scale where we have 32 as stride of the Network. The input image of (416x416) is downsampled into (13x13) grid of cells as calculated. This grid now represents produced output feature map. When all cells, that ground truth bounding box belongs to, are identified, the centre cell is assigned by YOLO v3 to be responsible for predicting this object and objectness score for this cell is equal to 1. Again, this is one of the corresponding feature map’s cell that now is responsible for detecting lemon. But during training all cells, including this one, predict 3 bounding boxes each. Which one to choose then? Which one to assign as the best predicted lemon’s bounding box? Here, comes the concept of anchor boxes.

Anchor Boxes:

Anchor Boxes demonstrates predefined priors. To predict bounding boxes YOLO v3 uses pre-defined default bounding boxes that are called anchors or priors. These anchors are used later to calculate predicted bounding box’s real width and real height. In total, 9 anchor boxes are used. Three anchor boxes for each scale. It means that at each scale every grid cell of the feature map can predict 3 bounding boxes by using 3 anchors.

To calculate these anchors, k-means clustering is applied in YOLO v3. Width and height of 9 anchors for COCO dataset are as following. They are grouped according to the scale at three separated places at the Network. Let’s consider an example on how one of the 3 anchor boxes is chosen to calculate later real width and real height of predicted bounding box.

We have input image of shape (416x416x3). Image goes through YOLO v3 deep CNN architecture till the first separate place having stride 32. Input image is downsampled by stride 32 to the dimension 13x13 and 255 depth of the feature map (13,13,255) at scale-1 produced by detection kernels as. Since we have 3 anchor boxes, then each cell encodes information about 3 predicted bounding boxes. Each predicted bounding box has following attributes: centre coordinates; predicted width and predicted height; objectness score; and list of confidences for every class this bounding box might belong to. As we use COCO dataset as an example, this list has 80 class confidences.

And now we need to extract probabilities among 3 predicted bounding boxes of this cell to identify that this box contains certain class. To do that, we compute following elementwise product of objectness score and list of confidences. Then, we find maximum probability and can say that this box detected class lemon with probability 0.55.

These calculations are applied to all 13x13 cells across 3 predicted boxes and across 80 classes. The number of predicted boxes at this first scale in the Network is 507. Moreover, these calculations are also applied to other scales in the Network giving us 2028 predicted boxes and 8112 predicted boxes. In total, YOLO version 3 predicts 10,647 boxes that are filtered with non-maximum suppression technique.

Predicted Bounding Boxes:

Predicted Bounding Boxes calculates resulted bounding boxes. We already know that anchors are bounding box’s priors and they were calculated by using K-means clustering. For COCO dataset they are as following.

To predict real width and real height of the bounding boxes, YOLO v3 calculates offsets to predefined anchors. This offset is also called log-space transform. To predict centre coordinates of the bounding boxes, YOLO v3 passes outputs through sigmoid function. Here are equations that are used to obtain predicted bounding box’s width, height and centre coordinates.

bx, by, bw, bh are the centre coordinates, width and height of the predicted bounding box. tx, ty, tw and th are outputs of the Network after training.

To better understand these outputs, let’s again have a look at how YOLO v3 is training. It has one ground truth bounding box and one centre cell to be responsible for this object. Weights of the Network are trained to predict as accurate as possible this centre cell and bounding box’s coordinates.

After training and after forward pass, Network outputs coordinates tx, ty, tw and th. Next, cx and cy that are the coordinates of the top left corner of the cell on the grid of the appropriate anchor box. Finally, pw and ph are the anchor’s boxes width and height.

YOLO v3 doesn’t predict absolute values of width and height. Instead, it predicts offsets to anchors. Why?

Because it helps to eliminate unstable gradients during training. That’s why values cx, cy, pw and ph are normalized to the real image width and real image height. And centre coordinates tx, ty are passed through sigmoid function that gives values between 0 and 1.

Consequently, to get absolute values after predicting we simply need to multiply them to the real and whole image width and height.

Objectness Score:

Objectness Score interprets probability of center cell. We know that for every cell YOLO v3 outputs bounding boxes with their attributes. These attributes are tx, ty, tw, th, p0 and 80 confidences for every class this bounding box might belong to. And these outputs are used later to choose anchor boxes by calculating scores and to calculate predicted bounding box’s real width and real height by using chosen anchors.

p0 here is so called objectness score. We remember that YOLO v3 when training assigns centre cell of the ground truth bounding box to be responsible for predicting this object. Consequently, this cell and its neighbours have objectness score nearly 1, whereas corner cells have objectness score almost 0. In other words, objectness score represents probability that this cell is a centre cell responsible for predicting one particular object and appropriate bounding box contains object inside.

The difference between objectness score and 80 probabilities for class confidences is that class confidences represent probabilities that detected object belongs to particular class like person, car, cat etc. Whereas, objectness score represents probability that bounding box contains object inside.

Mathematically objectness score can be represented as following. Where P object is a predicted probability that bounding box contains object and IoU is intersection over union between predicted bounding box and ground truth bounding box .

The result is passed through sigmoid function that gives values between 0 and 1.

Conclusion:

YOLO v3 applies convolutional neural networks to the input image.

To predict bounding boxes, it downsamples image at three separate places of this Network, called scales.

While training it uses 1x1 detection kernels that are applied to the grid of cells at these three separate places of the Network.

Network is trained to assign only one cell to be responsible for detecting one object if that cell falls into the centre of this object.

9 predefined bounding boxes are used to calculate spatial dimensions and coordinates of predicted bounding boxes.

These predefined boxes are called anchors or priors. 3 anchor boxes for each scale.

In total, YOLO v3 predicts 10,647 bounding boxes that are filtered with non-maximum suppression technique leaving only the right ones.

References:

YOLO, FPS Comparison, Darknet

--

--