Neural Network Calibration

3 min readJun 9, 2020

If you face a situation where you built two classifiers A and B. You have chosen Accuracy as the performance metric. You got a conclusion that model A has accuracy of 88% and confidence of 89% in each prediction. On the contrary model B gives accuracy of 88% but confidence of 95% in each prediction. In such a case which model would you deploy in production?

Typically everyone might think model B is better. Here I want to argue that model A is better. The reason for that is, model A is good in self assessment. Model A thinks that it can predict correctly 89% of time and it’s actually almost same (88% — Accuracy). But, model B is overconfident that it can predict correctly 95% time, but it does actually 88% (Accuracy).

This is the intuition behind the model calibration.

How to start?

This post is not a tutorial for tensorflow or keras or how to build a neural network. So, I will be using Hello World! of machine learning problem, i.e. iris dataset classification. The motivation here is how to calibrate the model using some codes.

The link of the notebook can be found at the end of this post.

Model Definition

# Model Architectureinput_layer = keras.layers.Input(shape=(4,))dense_1 = keras.layers.Dense(64, activation=’relu’)(input_layer)logits = keras.layers.Dense(3)(dense_1) # Here 3 represents number of classesmodel = keras.Model(inputs=input_layer, outputs=logits)

It’s a simple neural network for iris data classification having a single Dense layer of 64 neurons.

# Compile modelcustom_loss = keras.losses.CategoricalCrossentropy(from_logits=True)model.compile(optimizer=’adam’,loss=custom_loss,metrics=[‘accuracy’])

One thing to notice here, we are not using softmax in the loss function, instead we are going to predict logits, as we will try to calibrate based on the predicted logits.

Model Calibration

Once the model is trained, our focus of calibration will start there after.

There many ways to do calibration like Temperature Scaling, Platt Scaling, Isotonic Regression. The later two is easily available as a package in scikit-learn. So here I will focus on Temperature Scaling and how to implement that using Tensorflow.

Temperature Scaling

Modern neural networks are perfect in performance metric (accuracy) however, it’s not well calibrated. Most of the time it’s overconfident.

Temperature Scaling is a method by which the predicted logits are penalized to make it less confident.

Let’s Jump into the code

# Temperature Scalingtemp = tf.Variable(initial_value=1.0, trainable=True, dtype=tf.float32)def compute_loss():    y_pred_model_w_temp = tf.math.divide(y_pred, temp)    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(\
      tf.convert_to_tensor(keras.utils.to_categorical(Y)),                y_pred_model_w_temp))    return lossoptimizer = tf.optimizers.Adam(learning_rate=0.01)print(‘Temperature Initial value: {}’.format(temp.numpy()))for i in range(300):     opts = optimizer.minimize(compute_loss, var_list=[temp])print(‘Temperature Final value: {}’.format(temp.numpy()))

Let’s understand the code snippet step by step.

First we defined a Temperature variable and initialized it by a value 1.0
The focus is to get a new value of Temperature (temp) by which we can penalize the logits predicted by the model.
We created a custom loss function, where we defined our new logits(y_pred_model_w_temp) as dividing the predicted logits(y_pred) by temp value.
Once loss function is ready, we defined an optimiser, which will change the temp value over each epoch inside the for loop.
Once this is done, we will get a new value of temp, and based on that value we can easily get the new logits (y_pred_model_w_temp)

Note: One thing to remember, while using Temperature Scaling or any other calibration method, you have to do it in the validation/test data set. For sake of simplicity here I have done it using the training set only.

Expected Calibration Error(ECE)

Even though we have done calibration, but what is the index or measure of this? Here comes the part of ECE, which is nothing but the weighted average of the difference between expected accuracy and prediction confidence (the lower the better).

Tensorflow-probability package has a fine implementation to calculate ECE.

# ECE result after calibrationy_pred_model_w_temp = tf.math.divide(y_pred, temp)num_bins = 50labels_true = tf.convert_to_tensor(Y, dtype=tf.int32, name='labels_true')logits = tf.convert_to_tensor(y_pred_model_w_temp, dtype=tf.float32, name='logits')tfp.stats.expected_calibration_error(num_bins=num_bins,logits=logits,labels_true=labels_true)

In the end, the ECE score on the scaled probability records an improvement

Check the code here