Deep Learning Lab Session

First Lab Session - 1.5 Hour

Artificial Neural Networks for Handwritten Digits Recognition

Student 1: # DO Thi Duyen
Student 2: # LE Ta Dang Khoa

The aim of this session is to practice with Artificial Neural Networks. Answers and experiments should be made by groups of two students. Each group should fill and run appropriate notebook cells.

Follow instructions step by step until the end and submit your complete notebook as an archive (tar -cf groupXnotebook.tar DL_lab1/). Do not forget to run all your cells before generating your final report and do not forget to include the names of all participants in the group. The lab session should be completed by March 20th 2019.


During this lab session, you will implement, train and test a Neural Network for the Handwritten Digits Recognition problem [1] with different settings of hyperparameters. You will use the MNIST dataset which was constructed from scanned documents available from the National Institute of Standards and Technology (NIST). Images of digits were taken from a variety of scanned documents, normalized in size and centered.

Figure 1: MNIST digits examples

This assignment includes a written part of programms to help you understand how to build and train your neural net and then to test your code and get results.


Functions defined inside the python files mentionned above can be imported using the python command "from filename import function".

You will use the following libraries:

  1. numpy : for creating arrays and using methods to manipulate arrays;

  2. matplotlib : for making plots.

Before starting the lab, please launch the cell below. After that, you may not need to do any imports during the lab.

In [1]:
# All imports
from NeuralNetwork import NeuralNetwork
from transfer_functions import *
from utils import *
import numpy as np
import matplotlib

Section 1 : Your First Neural Network

Part 1: Before designing and writing your code, you will first work on a neural network by hand. Consider the following neural network with two inputs $x=(x_1,x_2)$, one hidden layer and a single output unit $y$. The initial weights are set to random values. Neurons 6 and 7 represent biases. Bias values are equal to 1. You will consider a training sample whose feature vector is $x = (0.8, 0.2)$ and whose label is $y = 0.4$.

Assume that neurons have a sigmoid activation function $f(x)=\frac{1}{(1+e^{-x})}$. The loss function $L$ is a Mean Squared Error (MSE): if $o$ denotes the output of the neural network, then the loss for a given sample $(o, y)$ is $L(o, y) = \left|\left| o - y \right|\right|^2$. In the following, you will assume that if you want to backpropagate the error on a whole batch, you will backpropagate the average error on that batch. More formally, let $((x^{(1)}, y^{(1)}), ..., (x^{(N)}, y^{(N)}))$ be a batch and $o^{(k)}$ the output associated to $x^{(k)}$. Then the total error $\bar{L}$ will be as follows:

$\bar{L} = \frac{1}{N} \sum_{k=1}^{N} L(o^{(k)}, y^{(k)})$.

Figure 2: Neural network

Question 1.1.1: Compute the new values of weights $w_{i,j}$ after a forward pass and a backward pass, and the outputs of the neural network before and after the backward path, when the learning rate is $\lambda$=5. $w_{i,j}$ is the weight of the connexion between neuron $i$ and neuron $j$. Please detail your computations in the cell below and print your answers.

In [2]:
lr = 5.0
x1, x2 = 0.8, 0.2
w1_01, w1_11, w1_21, w1_02, w1_12, w1_22 = 0.2, 0.3, 0.8, -0.4, -0.5, 0.2
w2_01, w2_11, w2_21 = 0.5, -0.6, 0.4
y = 0.4

o1_1 = sigmoid(x1*w1_11 + x2*w1_21 + 1*w1_01) # Output of the green neuron
o1_2 = sigmoid(x1*w1_12 + x2*w1_22 + 1*w1_02) # Output of the red neuron
o2_1 = sigmoid(o1_1*w2_11 + o1_2*w2_21 + 1*w2_01) # Output of the black neuron

print("=== FORWARD PASS 1 ===")
print("o =", o2_1)

# Partial derivatives of the loss wrt weights of the second layer
dL_w2_01 = 2 * (o2_1-y) * (o2_1*(1-o2_1)) * 1
dL_w2_11 = 2 * (o2_1-y) * (o2_1*(1-o2_1)) * o1_1
dL_w2_21 = 2 * (o2_1-y) * (o2_1*(1-o2_1)) * o1_2

# Partial derivatives of the loss wrt weights of the first layer
dL_w1_01 = 2 * (o2_1-y)*(o2_1*(1-o2_1))*w2_11 * (o1_1*(1-o1_1)) * 1
dL_w1_11 = 2 * (o2_1-y)*(o2_1*(1-o2_1))*w2_11 * (o1_1*(1-o1_1)) * x1
dL_w1_21 = 2 * (o2_1-y)*(o2_1*(1-o2_1))*w2_11 * (o1_1*(1-o1_1)) * x2
dL_w1_02 = 2 * (o2_1-y)*(o2_1*(1-o2_1))*w2_21 * (o1_2*(1-o1_2)) * 1
dL_w1_12 = 2 * (o2_1-y)*(o2_1*(1-o2_1))*w2_21 * (o1_2*(1-o1_2)) * x1
dL_w1_22 = 2 * (o2_1-y)*(o2_1*(1-o2_1))*w2_21 * (o1_2*(1-o1_2)) * x2

# Weights updates
w1_01 -= lr*dL_w1_01
w1_11 -= lr*dL_w1_11
w1_21 -= lr*dL_w1_21
w1_02 -= lr*dL_w1_02
w1_12 -= lr*dL_w1_12
w1_22 -= lr*dL_w1_22
w2_01 -= lr*dL_w2_01
w2_11 -= lr*dL_w2_11
w2_21 -= lr*dL_w2_21

print("\n=== BACKWARD PASS ===")
print("w1_01 =", w1_01)
print("w1_11 =", w1_11)
print("w1_21 =", w1_21)
print("w1_02 =", w1_02)
print("w1_12 =", w1_12)
print("w1_22 =", w1_22)
print("w2_01 =", w2_01)
print("w2_11 =", w2_11)
print("w2_21 =", w2_21)

o1_1 = sigmoid(x1*w1_11 + x2*w1_21 + 1*w1_01)
o1_2 = sigmoid(x1*w1_12 + x2*w1_22 + 1*w1_02)
o2_1 = sigmoid(o1_1*w2_11 + o1_2*w2_21 + 1*w2_01)

print("\n=== FORWARD PASS 2 ===")
print("o =", o2_1)
=== FORWARD PASS 1 ===
o = 0.5597295991095776

w1_01 = 0.2540331790269339
w1_11 = 0.3432265432215471
w1_21 = 0.8108066358053868
w1_02 = -0.4341841377344243
w1_12 = -0.5273473101875394
w1_22 = 0.19316317245311515
w2_01 = 0.10637455535192786
w2_11 = -0.8541467506279605
w2_21 = 0.2745727217772572

=== FORWARD PASS 2 ===
o = 0.40648823589210104

Part 2: Neural Network Implementation

In Part 1, you computed weight updates for one sample. This is what we do for the stochastic gradient descent algorithm. However in the rest of the lab, you will be to implement the batch version of the gradient descent.

Please read all source files carefully and understand the data structures and all functions. You are to complete the missing code. First you should define the neural network (using the NeuralNetwork class, see in the file) and reinitialise weights. Then you will need to complete the feedforward() and the backpropagate() functions.

Question 1.2.1: Implement the feedforward() function.

In [3]:
class NeuralNetwork(NeuralNetwork):
    def feedforward(self, inputs):
        transfer_f = self.transfer_f
        inputs = [x + [1.] for x in inputs]
        self.input = np.array(inputs) # Shape = [batch_size, number_of_input_values+1]
        # Compute activations for the hidden layer
        u_1 = # Shape of u_1 should be [batch_size, number_of_neurons_in_hidden_layer]
        self.u_hidden = u_1
        self.o_hidden = np.ones((u_1.shape[0], u_1.shape[1]+1)) # Shape = [batch_size, number_of_hidden_values+1]
        # Compute output of hidden layer
        self.o_hidden[:, :-1] = transfer_f(self.u_hidden)
        # Compute activations for the output layer
        u_2 =
        self.u_output = u_2
        # Compute output of output layer
        self.o_output = transfer_f(self.u_output)

Question 1.2.2: Test your implementation: create the Neural Network defined in Part 1 and see if the feedforward() function you implemented gives the same results as the ones you found by hand.

In [4]:
# First define your neural network
model = NeuralNetwork(2, 2, 1)

# Then initialize the weights according to Figure 2
W_input_to_hidden = np.array([[0.3, -0.5], [0.8, 0.2], [0.2, -0.4]])
W_hidden_to_output = np.array([[-0.6], [0.4], [0.5]])
model.weights_init(W_input_to_hidden, W_hidden_to_output)

# Feed test values
test = [[0.8, 0.2]]

# Print the output
print("Output =", model.o_output[0,0])
Output = 0.5597295991095776

The implemented feedforward function in Question 1.2.2 gives the same result as the one implemented in Question 1.1.1:

  • The result at FORWARD PASS 1 is 0.5597 for both Question 1.1.1 and Question 1.2.2.

Question 1.2.3: Implement the backpropagate() function.

In [5]:
class NeuralNetwork(NeuralNetwork):
    def backpropagate(self, targets, learning_rate=5.0):
        transfer_df = self.transfer_df
        l = learning_rate
        targets = np.array(targets) # Target outputs
        # Compute partial derivative of loss with respect to activations of output layer
        self.dL_du_output = 2 * np.multiply((self.o_output-targets), transfer_df(self.u_output))
        # Compute partial derivative of loss with respect to activations of hidden layer
        self.dL_du_hidden = np.multiply(,\
                                        np.c_[self.transfer_df(self.u_hidden), np.zeros(self.u_hidden.shape[0])])
        # Compute partial derivative of loss with respect to weights        
        dW_input_to_hidden =[:,:-1])
        dW_hidden_to_output =
        # Make updates
        self.W_hidden_to_output -= l*dW_hidden_to_output/len(targets)
        self.W_input_to_hidden -= l*dW_input_to_hidden/len(targets)

Question 1.2.4: Test your implementation: create the Neural Network defined in Part 1 and see if the backpropagate() function you implemented gives the same weight updates as the ones you found by hand. Do another forward pass and see if the new output is the same as the one you obtained in Question 1.1.1.

In [6]:
# First define your neural network
model = NeuralNetwork(2, 2, 1)

# Then initialize the weights according to Figure 2
w1_01, w1_11, w1_21, w1_02, w1_12, w1_22 = 0.2, 0.3, 0.8, -0.4, -0.5, 0.2
w2_01, w2_11, w2_21 = 0.5, -0.6, 0.4
W_input_to_hidden = np.array([[w1_11, w1_12], [w1_21, w1_22], [w1_01, w1_02]])
W_hidden_to_output = np.array([[w2_11], [w2_21], [w2_01]])
model.weights_init(W_input_to_hidden, W_hidden_to_output)

# Feed test values
test = [[0.8, 0.2]]

# Backpropagate
targets = [[0.4]]

# Print weights
print("\nW_input_to_hidden =", model.W_input_to_hidden)
print("\nW_hidden_to_output =", model.W_hidden_to_output)

# Feed test values again

# Print the output
print("\nOutput =", model.o_output)
W_input_to_hidden = [[ 0.34322654 -0.52734731]
 [ 0.81080664  0.19316317]
 [ 0.25403318 -0.43418414]]

W_hidden_to_output = [[-0.85414675]
 [ 0.27457272]
 [ 0.10637456]]

Output = [[0.40648824]]

Checked your implementations and found that everything was fine? Congratulations! You can move to the next section.


The model was fine. It gives the same result as the one implemented in Question 1.1.1.

  • In this case, after running FORWARD PASS 1 and BACKWARD 1, the output at FORWARD PASS 2 is 0.4064 for both Question 1.1.1 and Question 1.2.4.

Section 2: Handwritten Digits Recognition

The MNIST dataset consists of handwritten digit images. It is split into a training set containing 60,000 samples and a test set containing 10,000 samples. In this Lab Session, the official training set of 60,000 images is divided into an actual training set of 50,000 samples a validation set of 10,000 samples. All digit images have been size-normalized and centered in a fixed size image of 28 x 28 pixels. Images are stored in byte form: you will use the NumPy python library to convert data files into NumPy arrays that you will use to train your Neural Networks.

You will first work with a small subset of MNIST (1000 samples), then on a very small subset of MNIST (10 samples), and eventually run a model on the whole one.

The MNIST dataset is available in the Data folder. To get the training, testing and validation data, run the load_data() function.

In [7]:
# Just run that cell ;-)
training_data, validation_data, test_data = load_data()
small_training_data = (training_data[0][:1000], training_data[1][:1000])
small_validation_data = (validation_data[0][:200], validation_data[1][:200])
indices = [1, 3, 5, 7, 2, 0, 13, 15, 17, 4]
vsmall_training_data = ([training_data[0][i] for i in indices], [training_data[1][i] for i in indices])
Loading MNIST data .....
In [8]:
# And you can run that cell if you want to see what the MNIST dataset looks like
ROW = 2
for i in range(ROW * COLUMN):
    # train[i][0] is i-th image data with size 28x28
    image = np.array(training_data[0][i]).reshape(28, 28)   
    plt.subplot(ROW, COLUMN, i+1)          
    plt.imshow(image, cmap='gray')  # cmap='gray' is for black and white picture.
plt.axis('off')  # do not show axis value
plt.tight_layout()   # automatic padding between subplots

Part 1: Build a bigger Neural Network

The input layer of the neural network that you will build contains neurons encoding the values of the input pixels. The training data for the network will consist of many 28 by 28 pixel images of scanned handwritten digits. Thus, the input layer contains 784=28×28 units. The second layer of the network is a hidden layer. We set the number of neurons in the hidden layer to 30. The output layer contains 10 neurons.

Question 2.1.1: Create the network described above using the NeuralNetwork class.

In [9]:
# Define your neural network
mnist_model = NeuralNetwork(784, 30, 10)

Question 2.1.2: Train your Neural Network on the small subset of MNIST (300 iterations) and print the new accuracy on test data. You will use small_validation_data for validation. Try different learning rates (0.1, 1.0, 10.0). You should use the train() function of the NeuralNetwork class to train your network, and the weights_init() function to reinitialize weights between tests. Print the accuracy of each model on test data using the predict() function.

In [10]:
# Train NN and print accuracy on test data

# Learning rate 0.1
print("Learning rate 0.1")
mnist_model.train(small_training_data, small_validation_data, 300, 0.1)
print("Accuracy on test data: %2.2f%%\n\n" % float(100*mnist_model.predict(test_data)/len(test_data[0])))

# Learning rate 1.
print("Learning rate 1.")
mnist_model.train(small_training_data, small_validation_data, 300, 1.)
print("Accuracy on test data: %2.2f%%\n\n" % float(100*mnist_model.predict(test_data)/len(test_data[0])))

# Learning rate 10.
print("Learning rate 10.")
mnist_model.train(small_training_data, small_validation_data, 300, 10.)
print("Accuracy on test data: %2.2f%%\n\n" % float(100*mnist_model.predict(test_data)/len(test_data[0])))
Learning rate 0.1
Training time: 24.29762601852417
Accuracy on test data: 15.70%

Learning rate 1.
Training time: 24.21302103996277
Accuracy on test data: 83.78%

Learning rate 10.
Training time: 23.74274492263794
Accuracy on test data: 10.28%

Question 2.1.3: Do the same with 15 and 75 hidden neurons.

In [11]:
# Train NN and print accuracy on test data

# 15 hidden neurons 
print("15 HIDDEN LAYERS\n")
# Define the neural network
mnist_model = NeuralNetwork(784, 15, 10)

# Learning rate 0.1
print("Learning rate 0.1")
mnist_model.train(small_training_data, small_validation_data, 300, 0.1)
print("Accuracy on test data: %2.2f%%\n\n" % float(100*mnist_model.predict(test_data)/len(test_data[0])))

# Learning rate 1.
print("Learning rate 1.")
mnist_model.train(small_training_data, small_validation_data, 300, 1.)
print("Accuracy on test data: %2.2f%%\n\n" % float(100*mnist_model.predict(test_data)/len(test_data[0])))

# Learning rate 10.
print("Learning rate 10.")
mnist_model.train(small_training_data, small_validation_data, 300, 10.)
print("Accuracy on test data: %2.2f%%\n\n" % float(100*mnist_model.predict(test_data)/len(test_data[0])))

# 75 hidden neurons
print("75 HIDDEN LAYERS\n")
# Define the neural network
mnist_model = NeuralNetwork(784, 75, 10)

# Learning rate 0.1
print("Learning rate 0.1")
mnist_model.train(small_training_data, small_validation_data, 300, 0.1)
print("Accuracy on test data: %2.2f%%\n\n" % float(100*mnist_model.predict(test_data)/len(test_data[0])))

# Learning rate 1.
print("Learning rate 1.")
mnist_model.train(small_training_data, small_validation_data, 300, 1.)
print("Accuracy on test data: %2.2f%%\n\n" % float(100*mnist_model.predict(test_data)/len(test_data[0])))

# Learning rate 10.
print("Learning rate 10.")
mnist_model.train(small_training_data, small_validation_data, 300, 10.)
print("Accuracy on test data: %2.2f%%\n\n" % float(100*mnist_model.predict(test_data)/len(test_data[0])))

Learning rate 0.1
Training time: 23.43873357772827
Accuracy on test data: 12.54%

Learning rate 1.
Training time: 24.166033506393433
Accuracy on test data: 75.82%

Learning rate 10.
Training time: 23.86826515197754
Accuracy on test data: 10.28%


Learning rate 0.1
Training time: 25.864466905593872
Accuracy on test data: 25.56%

Learning rate 1.
Training time: 24.382953882217407
Accuracy on test data: 84.80%

Learning rate 10.
Training time: 24.098219633102417
Accuracy on test data: 65.77%

Question 2.1.4: Repeat Questions 2.1.2 and 2.1.3 on the very small datasets. You will use small_validation_data for validation.

In [12]:
# Train NN and print accuracy on test data

# 30 hidden neurons
print("30 HIDDEN LAYERS\n")
# Define the neural network
mnist_model = NeuralNetwork(784, 30, 10)

# Learning rate 0.1
print("Learning rate 0.1")
mnist_model.train(vsmall_training_data, small_validation_data, 300, 0.1)
print("Accuracy on test data: %2.2f%%\n\n" % float(100*mnist_model.predict(test_data)/len(test_data[0])))

# Learning rate 1.
print("Learning rate 1.")
mnist_model.train(vsmall_training_data, small_validation_data, 300, 1.)
print("Accuracy on test data: %2.2f%%\n\n" % float(100*mnist_model.predict(test_data)/len(test_data[0])))

# Learning rate 10.
print("Learning rate 10.")
mnist_model.train(vsmall_training_data, small_validation_data, 300, 10.)
print("Accuracy on test data: %2.2f%%\n\n" % float(100*mnist_model.predict(test_data)/len(test_data[0])))

# 15 hidden neurons
print("15 HIDDEN LAYERS\n")
# Define the neural network
mnist_model = NeuralNetwork(784, 15, 10)

# Learning rate 0.1
print("Learning rate 0.1")
mnist_model.train(vsmall_training_data, small_validation_data, 300, 0.1)
print("Accuracy on test data: %2.2f%%\n\n" % float(100*mnist_model.predict(test_data)/len(test_data[0])))

# Learning rate 1.
print("Learning rate 1.")
mnist_model.train(vsmall_training_data, small_validation_data, 300, 1.)
print("Accuracy on test data: %2.2f%%\n\n" % float(100*mnist_model.predict(test_data)/len(test_data[0])))

# Learning rate 10.
print("Learning rate 10.")
mnist_model.train(vsmall_training_data, small_validation_data, 300, 10.)
print("Accuracy on test data: %2.2f%%\n\n" % float(100*mnist_model.predict(test_data)/len(test_data[0])))

# 75 hidden neurons
print("75 HIDDEN LAYERS\n")
# Define the neural network
mnist_model = NeuralNetwork(784, 75, 10)

# Learning rate 0.1
print("Learning rate 0.1")
mnist_model.train(vsmall_training_data, small_validation_data, 300, 0.1)
print("Accuracy on test data: %2.2f%%\n\n" % float(100*mnist_model.predict(test_data)/len(test_data[0])))

# Learning rate 1.
print("Learning rate 1.")
mnist_model.train(vsmall_training_data, small_validation_data, 300, 1.)
print("Accuracy on test data: %2.2f%%\n\n" % float(100*mnist_model.predict(test_data)/len(test_data[0])))

# Learning rate 10.
print("Learning rate 10.")
mnist_model.train(vsmall_training_data, small_validation_data, 300, 10.)
print("Accuracy on test data: %2.2f%%\n\n" % float(100*mnist_model.predict(test_data)/len(test_data[0])))

Learning rate 0.1
Training time: 1.7113311290740967
Accuracy on test data: 22.14%

Learning rate 1.
Training time: 1.6647279262542725
Accuracy on test data: 52.48%

Learning rate 10.
Training time: 1.658393383026123
Accuracy on test data: 13.16%


Learning rate 0.1
Training time: 1.5892040729522705
Accuracy on test data: 23.46%

Learning rate 1.
Training time: 1.5786983966827393
Accuracy on test data: 50.82%

Learning rate 10.
Training time: 1.5880048274993896
Accuracy on test data: 11.33%


Learning rate 0.1
Training time: 1.9355618953704834
Accuracy on test data: 36.79%

Learning rate 1.
Training time: 1.9575245380401611
Accuracy on test data: 51.25%

Learning rate 10.
Training time: 1.9383466243743896
Accuracy on test data: 25.09%

Question 2.1.5: Explain the results you obtained at Questions 2.1.2, 2.1.3 and 2.1.4.


Question 2.1.2 and 2.1.3 (SMALL training-data):

learning_rate = 0.1 learning_rate = 1 learning_rate = 10
hidden_layer = 15 12.54% 75.82% 10.28%
hidden_layer = 30 15.7% 83.78% 10.28%
hidden_layer = 75 25.56% 84.8% 65.77%

Question 2.1.4 (VERY-SMALL training-data):

learning_rate = 0.1 learning_rate = 1 learning_rate = 10
hidden_layer = 15 23.46% 50.82% 11.33%
hidden_layer = 30 22.14% 52.48% 13.16%
hidden_layer = 75 36.79% 51.25% 25.09%


  1. Given the same number of hidden layers, learning-rate = 1.0 outperforms learning-rate = 0.1 and 10.0.

  2. Given the same learning-rate, hidden-layer = 75 outperforms hidden-layer = 15 and 30, except at learning-rate 1.0, when all 3 have narrower performance difference.

  3. While the result obtained with 10-sample training is lower than with 1000-sample training at learning-rate = 1.0;

    For learning-rate = 0.1 and 10, 10-sample training has higher accuracy than 1000-sample training.


  1. For the fist observation, we can see (using the training graph) that:

    • learning-rate = 1.0 gradually improves both training-accuracy and validation-accuracy, making the underlying pattern gradually captured, which led to high performance
    • learning-rate = 10 "fluctuates" so much that it either doesn't converge, or converge very slow in training, led to poor result.
    • learning-rate = 1.0 moves so slow that we think the training process is stuck somewhere, producing the poor result.
  2. For the second observation, we first note that learning-rate = 0.1 and 10 aren't good learning rates. By looking at the training graph, we can see that:

    • The higher complexity of hidden-layer = 75 allows this complexity to "compensate" for too-slow or too-fast propagation, making the model actually improves in training, led to better result.
    • But for the good learning-rate = 1.0, all 3 topologies gradually improves in training, led to narrower differences. Despite, for 10-sample training (very small), the training-set is so small that all 3 topologies overfit and produce very similar bad performance.
  3. For the final observation, we first note that learning-rate = 1.0 is a quite good learning-rate. By looking at the training graph, we can see that:

    • This learning-rate allows the patterns underlying 1000-sample training to be captured, and since 1000-sample is more representative then 10-sample, this led to better result of small training-set in compare to VERY small training-set
    • For the slow learning-rate = 0.1, patterns underlying 10-sample training are captured quicker than those of 1000-sample training, led to better result for VERY small training-set, especially if this set is chosen carefully (to be somewhat "signal" rather than "noise").
    • For the fast learning-rate = 10, improvement in training fluctuates a lot so it doesn't mean much. Except for the case of hidden-layer = 75, the high complexity of this topology really "compensates" this fast learning-rate, led to patterns underlying 1000-sample training to be captured and produce better result in compare to 10-sample training.

Question 2.1.6: Among all the numbers of hidden neurons and learning rates you tried in previous questions, which ones would you expect to achieve best performances on the whole dataset? Justify your answer.


  1. As explanied above, the complexity of hidden-layer = 75 "compensates" all propagation pace the best, and learning-rate = 1.0 smooths all topologies the best, so we'll go for hidden-layer = 75, learning-rae = 1.0.

  2. In fact, the testing-accuracy table in question 2.1.5 also shows the superior performance of this combination.

Question 2.1.7: Train a model with the number of hidden neurons and the learning rate you chose in Question 2.1.6 and print its accuracy on the test set. You will use validation_data for validation. Training can be long on the whole dataset (~40 minutes): we suggest that you work on the optional part while waiting for the training to finish.

In [18]:
print("Training whole dataset using the model with hidden-layer=75, learning-rate=1 and iterations=300:")

# Define the neural network
mnist_model = NeuralNetwork(784, 75, 10)

mnist_model.train(training_data, validation_data, 300, 1.)
print("Accuracy on test data: %2.2f%%\n\n" % float(100*mnist_model.predict(test_data)/len(test_data[0])))
Training whole dataset using the model with hidden-layer=75, learning-rate=1 and iterations=300:
Training time: 1662.6136071681976
Accuracy on test data: 88.32%

Part 2 (optional): Another loss function

In classification problems, we usually replace the sigmoids in the output layer by a "softmax" function and the MSE loss by a "cross-entropy" loss. More formally, let $u = (u_1, ..., u_n)$ be the vector representing the activation of the output layer of a Neural Network. The output of that neural network is $o = (o_1, ..., o_n) = \textrm{softmax}(u)$, and

$\textrm{softmax}(u) = (\frac{e^{u_1}}{\sum_{k=1}^n e^{u_k}}, ..., \frac{e^{u_n}}{\sum_{k=1}^n e^{u_k}})$.

If $t = (t_1, ..., t_n)$ is a vector of non-negative targets such that $\sum_{k=1}^n t_k = 1$ (which is the case in classification problems, where one target is equal to 1 and all others are equal to 0), then the cross-entropy loss is defined as follows:

$L_{xe}(o, t) = - \sum_{k=1}^n t_k\log(o_k)$.

Question 2.2.1: Let $L_{xe}$ be the cross-entropy loss function and $u_i$, $i \in \lbrace 1, ..., n \rbrace$, be the activations of the output neurons. Let us assume that the transfer function of the output neurons is the softmax function. Targets are $t_1, ..., t_n$. Derive a formula for $\frac{\partial L_{xe}}{\partial u_i}$ (details of your calculations are not required).

Answer: $\frac{\partial L_{xe}}{\partial u_i} = o_i - t_i$

Question 2.2.2: Implement a new feedforward() function and a new backpropagate() function adapted to the cross-entropy loss instead of the MSE loss.

In [14]:
class NeuralNetwork(NeuralNetwork):
    def feedforward_xe(self, inputs):
        self.o_input = np.array(inputs)
        if len(inputs[0]) < self.input_layer_size:
            self.o_input = np.append(self.o_input, np.ones((len(inputs), 1)), axis=1)
        # Compute the 1st hidden-layer
        self.u_hidden =, self.W_input_to_hidden)
        self.o_hidden = self.transfer_f(self.u_hidden)
        if len(self.o_hidden[0]) < self.hidden_layer_size:
            self.o_hidden = np.append(self.o_hidden, np.ones((len(self.o_hidden), 1)), axis=1)
        # Compute output
        self.u_output =, self.W_hidden_to_output)
        self.o_output = softmax(self.u_output)

    def backpropagate_xe(self, targets, learning_rate=5.0):
        dE_du_hidden = self.o_output - targets

        dE_du_output = np.multiply(, 
                                    self.o_hidden * (1 - self.o_hidden) )
        dE_du_output = np.delete(dE_du_output, -1, axis=1)

        # Compute error-derivatives w.r.t. the weights
        dE_dw_hidden = (1/len(targets)) *, self.o_hidden).T
        dE_dw_output = (1/len(targets)) *, self.o_input).T
        # Update the weights
        self.W_hidden_to_output -= learning_rate * dE_dw_hidden
        self.W_input_to_hidden -= learning_rate * dE_dw_output

Question 2.2.3: Create a new Neural Network with the same architecture as in Question 2.1.1 and train it using the softmax cross-entropy loss.

In [16]:
# Define your neural network
mnist_model_xe = NeuralNetwork(784, 30, 10)

# Train NN and print accuracy on validation data
print("\nLearning rate = 0.1")
mnist_model_xe.train_xe(small_training_data, small_validation_data, 300, 0.1)
print("Accuracy on test data: %2.2f%%\n\n" % float(100*mnist_model_xe.predict(test_data)/len(test_data[0])))

print("\nLearning rate = 1")
mnist_model_xe.train_xe(small_training_data, small_validation_data, 300, 1.)
print("Accuracy on test data: %2.2f%%\n\n" % float(100*mnist_model_xe.predict(test_data)/len(test_data[0])))

print("\nLearning rate = 10")
mnist_model_xe.train_xe(small_training_data, small_validation_data, 300, 10.)
print("Accuracy on test data: %2.2f%%\n\n" % float(100*mnist_model_xe.predict(test_data)/len(test_data[0])))
Learning rate = 0.1
Training time: 17.88617992401123
Accuracy on test data: 67.85%

Learning rate = 1
Training time: 18.187050104141235
Accuracy on test data: 86.73%

Learning rate = 10
Training time: 17.920576572418213
Accuracy on test data: 83.96%

Why we pick learning-rate of 1?

When looking at the graph, we can see that the training process is very smooth and improves faster then the other 2 learning-rates.

In [17]:
# Print accuracy on test data
mnist_model_xe.train_xe(training_data, validation_data, 300, 1.)

accuracy = mnist_model_xe.predict(test_data)/100
print("Accuracy", accuracy)
Training time: 1250.7128057479858
Accuracy 91.92

Question 2.2.4: Compare your results with the MSE loss and with the cross-entropy loss.


  • First, the accuracy of cross-entropy is higher than that of MSE (91.92% in compare to 88.32%)
  • Second, the cross-entropy version reaches acceptable performance (80% accuracy) much faster than MSE (epoch ~50 in compare to epoch ~200). This led to faster training time.

In conclusion, the cross-entropy version is better.