Tuesday, July 9, 2019

WebRTC and Tensorflow.js

WebCam image classification on web page


I will give a quick tutorial on how to connect a webcam in html5/WebRTC to Tensorflow.js for image classification. We will load a pre trained mobile net and then pass it frames from the camera. First, we define our html page and specify a video element in which we later stream the video from the camera,
a label field and also start the WebRtc + neural network code. The code can be found at this gist.


<!DOCTYPE html>
<html>
  <head>
    <title> Hello WebRTC </title>
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@1.0.0/dist/tf.min.js"></script>
  </head>
  <body>
    <video id="cam" width="224" height="224" autoplay playsinline></video> <br/>
    <script src="camnet.js"></script><br/>
    <font color="red" size="5"> LABEL: </font><br/>
    <div id="label"> </div>
  </body>
</html>


Then we download the pretrained mobile net and initialize the webcam.
 async function init() {
    try {
 const net = await tf.loadLayersModel(MOBILENET_MODEL_PATH);
        const constraints = window.constraints = {audio: false, video: true};
     const stream = await navigator.mediaDevices.getUserMedia(constraints);
     onSuccess(stream, net);
    } catch (e) {
     onError(e);
    }
}


The camera stream can be retrieved by the getUserMedia function. The onError method simply writes an error to the console. If we are successful, we get the video element from our dom and bind the stream to the video. We then start the detection loop with a method called onFrame.
function onSuccess(stream, net) {    
    const video = document.querySelector('video');
    const videoTracks = stream.getVideoTracks();
    console.log('Got stream with constraints:', constraints);
    console.log(`Using video device: ${videoTracks[0].label}`);
    window.stream = stream;
    video.srcObject = window.stream;
    onFrame(video, net);
}


onFrame's inner function processFrame is an infinite recursion. We grab the video frame by frame and push it into a classify method along with the neural network, the video element as well as the label element.
function onFrame(video, net) {
    var label_element = document.getElementById('label');
    console.log(net.summary());
    async function processFrame() {
 classify(video, label_element, net)
        requestAnimationFrame(processFrame);          
    }
    processFrame();
}
The last method transforms the camera image into a tensor, normalizes the color and then construct a batch with only one example. Based on the prediction from the mobile net, we extract the best class and write it into the label element.
async function classify(img_element, label_element, net) {
    const img = tf.browser.fromPixels(img_element).toFloat();
    const offset = tf.scalar(127.5);
    const normalized = img.sub(offset).div(offset);
    const batched = normalized.reshape([1, IMAGE_SIZE, IMAGE_SIZE, 3]);
    const prediction = await net.predict(batched).data();
    var max_i = 0;
    var max_v = prediction[0];
    for (let i = 0; i < prediction.length; i++) {
 if(prediction[i] > max_v) {
     max_v = prediction[i];
     max_i = i;
 }
    }
    const label = IMAGENET_CLASSES[max_i];
    if (max_v > 0.5) {
 label_element.innerHTML = label + " [" + max_v + "]";
    }
}



Sunday, July 7, 2019

Implementing Triplet Loss Function in Tensorflow 2.0

In this post I will go through an implementation of the triplet loss for siamese neural network architectures in keras (tensorflow 2.0).

The goal of training a neural network with a triplet loss is to learn a metric embedding. That is examples that are conceptually close are also close in euclidean space and examples that are conceptually further are further away in euclidean space, too. The triplet loss is introduced in the facenet paper from google. When learning with the triplet loss, we choose three examples. The first example is called the anchor, which can be any example from our dataset. The positive example is one that is conceptually close. For example, a conceptually close example for face recognition might be a picture of the same person as shown in the anchor image. The negative example might simply be a picture of any other person. The embedding network can actually have any architecture. During training, we push all three examples through the network, compute the loss function with the three embeddings and adjust the weights which in each epoch pushes negative examples further from it's anchors and positive examples closer.


Figure 1: The Triplet loss forces negative examples further from the anchor and positive examples closer. 
An example is shown below. The loss consists of three parts. The first is the euclidean distance between the anchor and it's positive example. The second is the euclidean distance between the anchor and it's negative example. Since we want to minimize the loss, the first term should be small and the second term should be large. Furthermore, we introduce a regularization term or margin between the examples.

Figure 2: We embed the anchor, positive and negative examples with the same network and apply the triplet loss. During Backpropagation we achieve higher euclidean distances to the negative example and lower distances for the negative examples. 


For unsupervised audio embedding, one way to sample triplets is to pick a window from the audio as the anchor and a close window in time to the anchor as positive (since audio does not change that rapidly). The negative example is simply a sample from another file.

In order to implement such a loss in  Tensorflow 2.0 / Keras, we can implement the Loss base class.

class TripletLoss(loss.Loss):

    def __init__(self, margin):
        super().__init__()
        self.margin = margin

    def call(self, y_true, y_pred):
        anchor = y_pred[0]
        pos    = y_pred[1]
        neg    = y_pred[2]
        pos_dist   = tf.reduce_sum(tf.square(tf.subtract(anchor, pos)), axis=-1)    
        neg_dist   = tf.reduce_sum(tf.square(tf.subtract(anchor, neg)), axis=-1)    
        basic_loss = tf.add(tf.subtract(pos_dist, neg_dist), self.margin)    
        loss       = tf.reduce_sum(tf.maximum(basic_loss, 0.0))               
        return loss

As you can see from the code, we do not need the ground truth (y truth) and we simply pass dummy values during training (Keras will still check it's dimension). In order to construct a model using the triplet loss, we can build an embedder model and then use that model three times in the triplet loss model. During back propagation, the three gradients will be summed and then passed through the embedder model (deep learning book chapter 6, Algorithm 6.6). In order to see a full example for
audio data, you can check this gist.