# YOLO for self-driving cars, motorcycles, pedestrians & cars detection

We will discuss two methods that were developed to identify objects cars usually encounter on the road. Motorcycles (bikes), pedestrians and cars are usually the main items to identify here, but there is more as you will read later in the article. YOLO (you only look once) and its derivatives are...

The world has become more machine dependent in the recent years, but the machines that just do their specific task became not sufficient and increasingly unsatisfactory to man, and here came machine learning and machine intelligence. Now, man wants the machine to learn how to do its task and improves itself at it.

Self-driving vehicles are an old topic really but what made it very hot these recent couple of years are the  advancements in data and processing, the processing power of machines that run algorithms have become increasingly large, depending almost mainly on GPUs gave them a very powerful leap. On the other hand, the availability of training data and huge datasets have made the problem much easier. Now you can easily download Gigabytes, even Terabytes of video and images of vehicles in different environments, you can even run your algorithms on simulators and learn from the virtual environments.

You can see, most Automotive companies nowadays are forming teams and hiring a lot for self-driving projects, TESLA, NVidia, Valeo, Volvo are the leaders among other of course.

So, let us go ahead and get introduced to some of the most famous industry algorithms and underlying techniques of such technologies.

Let us say you want to build a car detection algorithm. Here is what you can do; you can first create a label training set. X and Y with closely cropped examples of cars as shown in the following images.

X will be the image, Y will be a Boolean, zero or one (car/no-car), of course we will not write the Boolean number on the image, the numbers below are for illustration only. You can start with the image of a street view having one or multiple cars or you can start with the one with the car closely cropped images. Meaning that x is only the car. Given this label training set, you can then train a system that inputs an image, like one of these closely cropped images and then the job of the system is to output y which will be zero or one, is there a car or not.

Once you have trained this system, you can then use it in Sliding Windows Detection.

### But, what is a sliding windows detection?

If you have a test image like this, what you do is you start by picking a certain window size, shown down there and then you would input into this image a small rectangular region. So, take just this below square, input that into the system, and have the system make a prediction. For the region in the square, it will say that it does not contain a car. Oh, found you! The red square does contain a car. This red square is called a bounding box or anchor box.

Anyone noticed that the previous image is a mirror of the one before, I did that intentionally as I wanted to stress on an important thing here. Data augmentation, a powerful concept that helps us with lack of data, look at the below image. What I did here is two things actually, changing the size of the image, in turn the number of pixels and changing the orientation (mirroring the image). This gave a completely different image, different to the neural network, not to us of course. Therefore, I can easily double or even triple the training set I have easily using augmentation.

Back to the real topic here, if you keep going until you have slid the window across every position in the image. You can adjust the stride of your algorithm to move faster of course, the stride is the amount of pixels the filter moves by during convolution.

Now, having done this is called the sliding window through the image. You can repeat it using a larger window so that you take a slightly larger region and run that region. Resizing this region into whatever input size the system is expecting, and feed that to the system and have it output zero or one.

There is a huge disadvantage of Sliding Windows Detection, which is the computational cost. You are cropping out so many different square regions in the image and running each of them independently through the system. Moreover, if you use a very big stride, then that will reduce the number of windows you need to pass through the system, but that huge granularity may hurt performance. Whereas if you use a very fine granularity (very small stride), then the huge number of all these little regions you are passing through the system means that there is a very high computational cost.

Usually, we used to run some linear classifiers on images to perform object detection, but that is history now. Neural Networks and convolution networks in particular has made that so much easier and more effective.

Sliding Windows Detection was not a bad method, but we can’t afford the computational cost.

What if we made the sliding window size dynamic, in other words, we are not obliged to have a fixed sized window that has to slide over the whole image? Is there a way to get this algorithm to outputs more accurate bounding boxes?

### YOLO

A good way to get this output more accurate bounding boxes is with the YOLO algorithm. YOLO stands for, “You Only Look Once”.

Let's say you have an input image at 100 by 100, you're going to lay a 3x3 grid on this image, the idea here is that you're going to make an image classification and localization algorithm on those nine grids, so each your outputs will be what is in this box (cell) and where it is located, defined by its center of course. One last thing, it is usually more than nine cells, but we assumed a 3x3 grid for simplicity. Now, you have to define the labels you use for training. For each of the nine grid cells, you specify a label Y, where the label Y will be an eight dimensional vector, so what are the eight variables here. P is an analog number (means a probability from 0à1) for the certainty of an object being here, X&Y denote the center of this object, H&W are the height and width of the bounding box to be drawn around this object guided by the center. Remains 3 Cs, those are the classes of the detection system that we want to build, here I chose cars, pedestrians and bikes/motorcycles. You can build a system that works on 5 classes (add traffic lights and signs) or any number of classes you want. Anyway, those will be one-hot encoded, means if there is a car, then C1, C2, C3 will be 1,0,0 which will accordingly increase the length of the vector.

So in the previous image, we have nine grid cells, you have a vector like this for each of the grid cells.

Let's start with the upper left grid cell, this one up here. For that one, there is no object. So, the label vector Y for the upper left grid cell would be zero, and then don't cares for the rest of these. The output label Y would be the same for this grid cell, and this grid cell, and all the grid cells with nothing, with no interesting object in them. How about the second one? To give a bit more detail, this image has two objects. What the YOLO algorithm does is it takes the midpoint of reach of the two objects and then assigns the object to the grid cell containing the midpoint. So the upper car is assigned to this grid cell, and the car in the bottom, which is this midpoint, is not assigned to this grid cell and so on.

And then you write BX, BY, BH, BW, to specify the position of this bounding box. So, for each of these nine grid cells, you end up with a eight dimensional output vector. And because you have 3 by 3 grid cells, you have nine grid cells, the total volume of the output is going to be 3 by 3 by 8.

Now, to train the network, the input will be a 100 by 100 by 3. And you have a usual convnet with conv. layers, max pool layers, and so on.

In the end, what you do is you have an input X which is the input image, you have these target labels Y which are 3 by 3 by 8 and you use map propagation to train the neural network to map from any input X to this type of output volume Y. The neural network outputs precise bounding boxes as shown below. At test time, you feed an input image X and run forward propagation until you get this output Y. And then for each of the nine outputs of each of the 3 by 3 positions in which of the output, you can then just read off 1 or 0 if there is an object associated with one of the nine positions, if there is an object, what object it is, and where is the bounding box for the object in that grid cell.

As a reminder, the way you assign an object to grid cell as you look at the midpoint of an object and then you assign that object to whichever one grid cell contains the midpoint of the object.

So each object, even if the objects spends multiple grid cells, that object is assigned only to one of the nine grid cells, or one of the 3 by 3.

This is a convolutional implementation. So, you will not be running this algorithm nine times on the 3 by 3 grid. YOLO algorithm actually runs pretty fast, that’s why it can run in real-time and that’s why it’s used in autonomous vehicles.

This algorithm should work fine as long as you don’t have multiple objects within the same grid. What if you have 2 bounding boxes in on cell of the grid…?

Non-maximum suppression:

How to run non-max suppression?

Well, first from the name comes the explanation, no maximum suppression is that if you have 2 competing sides and each one have a probability of winning, what you do is that you take the highest probable object, simple as that.

If you have two anchor boxes in the same gridcell. One of them will have very low probability, very low P, some of the bounding boxes can go outside the height and width of the grid cell that they came from. So, what you do next is get rid of the low probability predictions. So get rid of the ones that the neural network says, this object probably isn't there.

Non-maximum suppression’s code can be something like this:

```#K in the following code is for Keras library
def non_max_suppression(scores, boxes, classes, max_boxes = 10, threshold = 0.5):

#max_boxes -- integer, maximum number of predicted boxes we'd like
#threshold -- real value, threshold used for Non Maximum   Suppression filtering
#scores -- predicted score for each box
#boxes -- predicted box coordinates
#classes -- predicted class for each box

max_boxes_t = K.variable(max_boxes, dtype='int32')   #we cast it to integer to remove decimals

K.get_session().run(tf.variables_initializer([max_boxes_t]))

nms_indices = tf.image.non_max_suppression(boxes, scores, max_boxes_t, threshold=threshold,name=None)

#K.gather returns the maximum values indices only after non-maximum suppression
scores = K.gather(scores,nms_indices)
boxes = K.gather(boxes,nms_indices)
classes = K.gather(classes,nms_indices)

return scores, boxes, classes
```

We have seen two techniques for object detection in today’s most advanced systems. YOLO proved to be the best because it is real-time, which is a key feature nowadays and specially for self-driving applications. You can also look at YOLO v2 as it handles more cases than the standard YOLO we talked about. Do you have experiences with these methods? What is your opinion? Just comment below the article. Would you like to read more about this topic? Just enter your ideas in the comment box (you need to login or register to comment).