Data Annotation Using Active Learning With Python Code

sourajit roy chowdhury
The Startup
Published in
7 min readOct 5, 2020

--

In the era of Machine Learning, Data is the new oil that we all know. We are also aware that gathering data is a non-trivial task. Even though we accomplish gathering data, a huge portion of data is unlabelled for the machine learning task (supervised learning). Annotating/Labeling data require huge manual efforts involving huge costs and time. In this article, I will show you how Active Learning can help us in solving the data labeling problem.

Active Learning basic flow

Introduction

Active Learning is a method by which the learning algorithm inspects all the data-points and concludes by selecting a few data-points on which the learning model is confused (uncertain points).

Let’s consider a situation where we have a large number of data which is unlabeled and a small set of data that is labeled. So if we think, there could be three possible ways by which we can label the data.

  1. Manual labeling.
  2. Train a classification model and use the same for labeling.
  3. Active Learning.

The drawback of Manual Labeling involves an inordinate amount of time and cost along with a large number of Subject Matter Experts who will manually annotate the data-set. A security factor is also involved while manual labeling.

In the case of training a classification model approach, it also won’t help. The very first reason is that the training data is very small so a huge chance to over-fit the model and the second reason is that the random data-points might not be always informative to start with.

So, we are left with Active Learning.

Active Learning Steps

  1. A small set of manually labeled data is needed which is informative and not random.
  2. Train a classification model on the above data-set. (to be honest, the model won’t be a good one, and that’s perfectly fine).
  3. Predict on the unlabeled data-set.
  4. Choose some query points on which the model is uncertain.
  5. Manually label the above uncertain points and include it in the training data-set.
  6. Iterate from steps 2 to 5 until the annotation budget is exhausted.

All the above steps seem trivial, but step number 4. How the model will find the uncertain points?

The Big Question

There are few methods to measure uncertainty both in Data and Model. If there is uncertainty in data, it is typically called Aleatory uncertainty, and if the machine learning model itself is uncertain, then it is called Epistemic uncertainty.

Although I won’t go to these in the nitty-gritty detail. Rather I will show you a measure of uncertainty that is easy to understand, calculate, and implement.

Entropy as a measure of uncertainty

Entropy Measure

A quick example will make us understand how entropy is a measure of knowledge as well as uncertainty.
Consider the below image where we have an empty bucket which will be filled with two colors red and blue. It is very evident that because of both the chemical property of the colors and the law of physics both the colors will reach an equilibrium after which they are so mixed up that it is difficult if not impossible to distinguish, the system has high entropy. On the other hand, if both the colors are not mixed, and perfectly distinguishable, then the system has very low entropy.

Entropy understanding

It should be clear now that entropy is in fact a measure of the disorder of a physical system, and it is related to the uncertainty associated with its physical state. In plain English, entropy grows as disorder grows.

A similar concept of uncertainty and lack of knowledge also exists to machine learning. In fact, entropy is also a measure of the expected amount of information. Such a concept plays a key role in machine learning.

We will extend the above analogy in the field of machine learning. Instead of two colors let’s say we have two classes red and blue. We have to classify both the classes using a machine learning model.

As simple as that, plot 1 is messy and random, no decision surface can distinguish both the classes. On the other hand plot 2 is perfectly separated therefor the model can classify the classes with much higher probability.

Based on the two plots discussed above, if a new query point is fed to the classification model, it is very true that for plot 1 we will get similar probability values for both the classes and for plot 2, probability of a class will be much higher than the other.

From the above equation, it is very clear that Entropy for plot 1 will always be high than the entropy for plot 2.

We will use this concept to measure uncertainty. The points having high entropy will be the uncertain points for the model and those points will revert to the SMEs, for manual annotation.

Although an uncertainty measure from a single network might be flawed (maybe it’s over confident in the output), but the idea is that going over many networks is going to improve that.

If most models agree about one of the categories, the ensemble network will have high confidence for that category. If they disagree, we will get large outputs for several of the categories.

But, in reality training and predicting over many neural network models is not at all a feasible option as huge time and costs are involved. The what is the way? It is in-fact Monte Carlo Dropout.

Monte Carlo Dropout

There are probabilistic definition of what Monte Carlo Dropout is, but in simple English the idea is to activate the dropout at inference time as well.

In traditional neural network dropout works at training time only. So while prediction the output is always deterministic. That means given the same query point (feature vector), the model will always predict the same softmax probabilities (or logits).

Incorporating Monte Carlo Dropout in neural network simulates the above ensemble network property without involving much extra time and cost.

As the prediction is now stochastic instead of deterministic, we can predict from the model for the same data point T times, which will give T different output probabilities, which is equivalent to T neural network ensemble.

So our uncertainty calculation will involve both Monte Carlo Dropout and Entropy. We will call it as Max Entropy Acquisition Function.

It’s Python time

For the code I will be using a simple toy data-set to show you an example how the whole active learning data annotation tool works.

Here I will consider the Scikit Learn handwritten data-set as our data.

Data for first iteration

The very first step is to gather some informative labelled data-set for the first iteration. For the sake of simplicity I am just using random data for the first iteration. Taking random data-set will be bad for the classification model as it will lead to garbage in garbage out problem.

Our Classification Model

Our classification model is based on TensorFlow-Keras layers.

You could always create your own model architecture, here I have created a the model with three hidden layers. The model architecture is straight forward. The only thing I want to discuss about the parameter training=True parameter for the Dropout layer.

This parameter enables the Dropout at the inference time as well, replicating the Monte Carlo Dropout.

Uncertainty Method

As I have previously mentioned that, for our case, the best uncertainty method will be calculating the Max Entropy along with MC Dropout

As we have implemented MC Dropout, so that it is equivalent as having an ensemble of models. In the code snippet T can be represented as the number of models. Due to stochastic probability prediction at every value of T the model will predict different class probabilities for a given set of data-points.

We are calculating the average probability score and then calculating the entropy for all the probability scores.

Based on the Max entropy value we are reverting the some number of data-points that need manual annotation.

Data Manipulation

After every iteration, the model will revert with some data-points that needs to be annotated manually. After the annotation is completed, those data-points should be removed from pooled data-set (unlabelled) and added to the training data-set. The below code does the same.

Final Loop

Adding all the above functions the active learning loop will be completed.

Conclusion

In this article I tried to give the basic idea and python coding of how Active Learning can help us labeling data-sets which drastically drops the manual annotation time and costs.

Although there are quite a few other uncertainty methods are available along with Bayesian Neural Networks for uncertainty measure.

For the full code check the GitHub Link.

--

--