There is a clear difference between face recognition and detection: the former is the process of identifying the human face, while the latter is if you are given an input image as well as a name or ID of a person and the job of the system is to verify whether the input image is that of the claimed person.
Face verification/recognition is sometimes called a one to one problem, where you just want to know if the person is the person they claim to be, which is much harder than the detection problem. There you don’t care about the identity of the face but rather the presence of the face.
One shot learning
One of the challenges of face recognition is that you need to solve the one-shot learning problem. The problem here is that you need to be able to recognize a person given just one single image of that person's face.
Historically, deep learning algorithms did not work well, if you have only one training example, because - how will the training occur in the absence of data, but in the one shot learning problem, you have to learn from just one example to recognize the person again. Moreover, you need this in most face recognition systems because you might have only one picture.
One way to do this is to input the image of the person to a convolutional net. And the output of the softmax layer will detect whether this image matches one of those in the learning database or not and by how much, remember the softmax gives percentage of the output correct match with one of the categories or classes the network was trained onto.
However, if we want to add a new person to the team, we will have to redesign our network to have on more output neuron in the softmax layer. Now, what if this is a company’s verification system and we want to add 100 people, may be 1000, what will happen now?
Do you have to retrain the ConvNet every time?
Here comes one-shot learning. In short, it is a technique to train the network on only one image of the input and getting good results whenever this person’s face is put to test. So, what are we going to do here actually?
We are going to train the network to learn a similarity function, a function that takes two images and outputs the degree of difference between the two images. If the two images are of the same person, you want this to output a small number. If the two images are of two very different people you want it to output a large number. During recognition time, if the degree of difference between them is less than some threshold called tau, which is a hyper-parameter, then it should predict that these two pictures are the same person. If it is greater than tau, it should predict that these are different persons.
To use this for a recognition task, what you do is, given this new picture, you will use this function d to compare these two images. Maybe it will give a very large number, let's say 10, then compare this with the second image in the database. Because these two are the same person, it will output a very small number; you do this for the other images in your database and so on.
We need to define another very important term here: triplet loss
As long as it (the ConvNet) can learn this function that inputs a pair of images and tells you if they're the same person or different persons.
How you can actually train the neural network to learn this function d.
Let’s say we have to images, I1 &I2. What happens usually in a ConvNet is a convolution layer followed by a maxpooling layer that encodes something called the feature vector in an image. Now, the same happens here but in a much deeper layer in the network where each image of those is encoded into a feature vector by 2 networks that have the same parameters and they will be compared for similarity. The distance will be the norm of the difference between the encoding of the two images. That is called a Siamese network, “detecting same images”
So how do you train this Siamese neural network?
Remember that these two neural networks have the same parameters. What you want to do is really train the neural network, so that the encoding that it computes results in a function d that tells you when two pictures are of the same person.
More formally, the parameters of the neural network define an encoding of the image. So given any input image I, the neural network outputs this 128 dimensional encoding of I. What we want to do is learn parameters so that if two pictures I and J, are of the same person, then you want that distance between their encodings to be small. In contrast, if I and J are of different persons, then you want that distance between their encodings to be large.
As you vary the parameters in all of these layers of the neural network, you end up with different encodings, you can use back propagation and vary all those parameters in order to make sure these conditions are satisfied.
To apply the triplet loss, you need to compare pairs of images. For example, given this pair of images, you want their encodings to be similar because these are the same person.
Whereas, given this pair of images, you want their encodings to be quite different because these are different persons.
In the terminology of the triplet loss, what you are going do is always look at one anchor image and then you want to minimize distance between the anchor and the positive image. Whereas, you want the anchor when compared to the negative example for their distances to be much further apart. So, here comes the term “triplet loss”, which is that you'll always be looking at three images at a time.
Now, let’s abbreviate them to A, P, and N for anchor positive and negative. You want the encoding between the anchor minus the encoding of the positive example to be small and in particular, you want this to be less than or equal to the distance of the squared norm between the encoding of the anchor and the encoding of the negative, where of course, this is d of A, P and this is d of A, N. And you can think of d as a distance function.
The neural network doesn't care how much further negative it is. But in order to define this dataset of triplets, you do need some pairs of A and P. Pairs of pictures of the same person. So the purpose of training your system, you do need a dataset where you have multiple pictures of the same person.
If you had just one picture of each person, then you can't actually train this system, but of course after having trained the system, you can then apply it to your one shot learning problem where for your face recognition system, maybe you have only a single picture of someone you might be trying to recognize. But for your training set, you do need to make sure you have multiple images of the same person at least for some people in your training set so that you can have pairs of anchor and positive images.
Now, how do you actually choose these triplets to form your training set?
To construct a training set, what you want to do is to choose triplets A, P, and N that are hard to train on. A triplet that is hard will be if you choose values for A, P, and N so that maybe d(A, P) is actually quite close to d(A,N). So in that case, the learning algorithm has to try extra hard to take this thing on the right and try to push it up or take this thing on the left and try to push it down so that there is at least a margin of alpha between the left side and the right side. And the effect of choosing these triplets is that it increases the computational efficiency of your learning algorithm.
Fortunately, many of the famous tech companies have trained these large networks and posted parameters online. So, rather than trying to train one of these networks from scratch, this is one domain where because of the share data volume sizes, this is one domain where often it might be useful for you to download someone else's pre-train model, rather than do everything from scratch yourself.
The triplet loss implementation will be something like that:
def triplet_loss(y_true, y_pred, alpha = 0.2):
anchor, positive, negative = y_pred, y_pred, y_pred # Compute the (encoding) distance between the anchor and the positive pos_dist = tf.reduce_sum(tf.square(tf.subtract(anchor, positive)), axis = None) # Compute the (encoding) distance between the anchor and the negative neg_dist = tf.reduce_sum(tf.square(tf.subtract(anchor, negative)), axis=None) # Subtract the two previous distances and add alpha. basic_loss = pos_dist - neg_dist + alpha # Take the maximum of basic_loss and 0.0. Sum over the training examples. loss = tf.reduce_sum(tf.maximum(basic_loss, 0)) return loss
And many of the ideas here came from a paper by Yaniv Taigman, Ming Yang, Marc'Aurelio Ranzato, and Lior Wolf in a system that they developed called DeepFace.