Deep Learning

Lab Session 3 - 1.5 Hours

Sentiment Analysis with Recurrent Neural Networks

Submitted by: DO Thi Duyen - LE Ta Dang Khoa

The aim of this session is to practice with VanillaRNN and Gated Recurrent Units (GRU). Each group should fill and run appropriate notebook cells.

Follow instructions step by step until the end and submit your complete notebook as an archive (tar -cf groupXnotebook.tar DL_lab3/).

Do not forget to run all your cells before generating your final report and do not forget to include the names of all participants in the group. The lab session should be completed by June 12th 2019 (23:59:59 CET).

Section 1: Sentiment Analysis with a Vanilla RNN¶

In this part, you will have no code to write. However you should spend some minutes on it, to understand well how the Vanilla RNN is implemented: you will implement a GRU in a similar way in Section 2.

You will work on a corpus of 3,000 user comments taken from IMDb (1,000), Amazon (1,000) and Yelp (1,000). These comments are split into two categories: positive comments (denoted by "1") and negative comments (denoted by "0"). For each website, 500 comments are positive and 500 comments are negative. This corpus has been created for the paper From Group to Individual Labels using Deep Features by Kotzias et al.

In this lab, we split this dataset into a training set of 2,520 comments (420 positive comments and 420 negative comments from each website), a validation set of 240 comments (40 positive comments and 40 negative comments from each website) and a test set of 240 comments (40 positive comments and 40 negative comments from each website).

Your goal will be to classify automatically these sentences by training a Vanilla RNN and then a GRU. Please note that we use the word2vec method to convert words into vectors: these vectors are designed so that they reflect the semantic and the syntactic functions of words. You can read more about word2vec in the paper Distributed representations of words and phrases and their compositionality by Mikolov et al.

First of all, please run the following cell.

# Imports
import tensorflow as tf
import numpy as np
import utils

import warnings
warnings.filterwarnings("ignore")

# Parameters
epsilon = 1e-10
max_l = 32 # Max length of sentences

train, val, test, word2vec = utils.load_data()
data = utils.Dataset(train, val, test, word2vec)

In the following cell, we define a VanillaRNN class. Please read its code carefully before running the cell because you will need to implement a similar class for the GRU.

If our sentence is represented by the sequence $(x_1, ..., x_L)$, the hidden states $h_t$ of the Vanilla RNN are defined as

$h_0 = 0$

$h_{t+1} = f(W_h h_t + W_x x_{t+1} + b)$

where $W_h$, $W_x$ and $b$ are trainable parameters and $f$ is an activation fucntion.

class VanillaRNN:

    def __init__(self, input_size, hidden_states, activation=None, name=None):
        self._hidden_states = hidden_states
        self._input_size = input_size
        self._activation = activation or tf.tanh
        self._name = (name or "vanilla_rnn") + "/"
        self._candidate_kernel = tf.get_variable(self._name + "candidate/weights",
                                                   shape=[input_size + self._hidden_states, self._hidden_states])
        self._candidate_bias = tf.get_variable(self._name + "candidate/bias", shape=[self._hidden_states])

    def state_size(self):
        return self._hidden_states

    def zero_state(self, inputs):
        batch_size = tf.shape(inputs)[0]
        return tf.zeros([batch_size, self.state_size()], dtype=tf.float32)

    def __call__(self, inputs, state):

        candidate = tf.matmul(tf.concat([inputs, state], 1), self._candidate_kernel)
        candidate = tf.nn.bias_add(candidate, self._candidate_bias)
        new_h = self._activation(candidate)
        return new_h

Parameters

Learning rate: 0.001
Training epochs: 30
Batch size: 128
Hidden states: 50

# Parameters
learning_rate = 0.001
training_epochs = 30
batch_size = 128
hidden_states = 50

Then we define our model. Please read the code of the process_sequence() function to understand the utility of the MaskData placeholder. If $h_L$ is the last hidden state of the Vanilla RNN, then we define our final prediction $p$ as

$p = \sigma (W_{pred} h_L + b_{pred})$

where $W_{pred}$ and $b_{pred}$ are trainable parameters and $\sigma$ denotes the sigmoid function.

tf.reset_default_graph()
tf.set_random_seed(123)
model_path = "models/vanilla.ckpt"
# tf Graph Input:  sentiment analysis data
# Sentences are padded with zero vectors
x = tf.placeholder(tf.float32, [None, max_l, 300], name='InputData')
# masks: necessary as we have different sentence lengths
m = tf.placeholder(tf.float32, [None, max_l, 1], name='MaskData')
# positive (1) or negative (0) labels
y = tf.placeholder(tf.float32, [None, 1], name='LabelData')

# we define our VanillaRNN cell
vanilla = VanillaRNN(300, hidden_states)

# we retrieve its last output
vanilla_output = utils.process_sequence(vanilla, x, m)

W = tf.Variable(tf.zeros([hidden_states, 1]), name='Weights')
b = tf.Variable(tf.zeros([1]), name='Bias')
# we make the final prediction
pred = tf.nn.sigmoid(tf.matmul(vanilla_output, W) + b)

Question 0 - Why do we need a MaskData placeholder?

ANSWER:

When we want to batchize multiple sentences as one input to tensorflow, we need to make them as a tensor/matrix with fixed length inputs. In this lab session, the given data contain variable length sentences. Therefore, we used the MaskData placeholder to handle these variable length inputs in order to create fixed ones, then feed it to the network.

Eventually, we train our model using a cross-entropy loss and the Adam optimizer. At each epoch we check the validation accuracy, and save the model if that accuracy increased. At the end, we load the best model on validation, and print its accuracy on the test set.

We test our model using a $\tanh$ activation function.

with tf.name_scope('Loss'):
    # Minimize error using cross entropy
    # We use tf.clip_by_value to avoid having too low numbers in the log function
    cost = tf.reduce_mean(-y*tf.log(tf.clip_by_value(pred, epsilon, 1.0)) - (1.-y)*tf.log(tf.clip_by_value((1.-pred), epsilon, 1.0)))

with tf.name_scope('Adam'):
    # Gradient Descent
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
with tf.name_scope('Accuracy'):
    # Accuracy
    pred_tmp = tf.stack([pred, 1.-pred])
    y_tmp = tf.stack([y, 1.-y])
    acc = tf.equal(tf.argmax(pred_tmp, 0), tf.argmax(y_tmp, 0))
    acc = tf.reduce_mean(tf.cast(acc, tf.float32))

# Initializing the variables
init = tf.global_variables_initializer()
saver = tf.train.Saver()

with tf.Session() as sess:
    sess.run(init)
    # Training cycle
    print("Training started")
    best_val_acc = 0.
    for epoch in range(training_epochs):
        avg_cost = 0.
        total_batch = int(len(train)/batch_size)
        # Loop over all batches
        for i in range(total_batch):
            batch_xs, batch_ms, batch_ys = data.next_batch(batch_size)
            # Run optimization op (backprop), cost op (to get loss value)
            # and summary nodes
            _, c = sess.run([optimizer, cost],
                                     feed_dict={x: batch_xs, y: batch_ys, m: batch_ms})
            # Compute average loss
            avg_cost += c / total_batch
        # Display logs per epoch step
        val_xs, val_ms, val_ys = data.val_batch()
        val_acc = acc.eval({x: val_xs, m: val_ms, y: val_ys})
        print("Accuracy on validation:", val_acc)
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            save_path = saver.save(sess, model_path)
            print("        Model saved in file: %s" % save_path)
        print("Epoch: ", '%02d' % (epoch+1), "  =====> Loss=", "{:.9f}".format(avg_cost))

    # Test model
    # Calculate accuracy
    saver.restore(sess, model_path)
    test_xs, test_ms, test_ys = data.test_batch()
    print("Accuracy:", acc.eval({x: test_xs, m: test_ms, y: test_ys}))

Training started
Accuracy on validation: 0.5
        Model saved in file: models/vanilla.ckpt
Epoch:  01   =====> Loss= 0.694689271
Accuracy on validation: 0.6666667
        Model saved in file: models/vanilla.ckpt
Epoch:  02   =====> Loss= 0.690510932
Accuracy on validation: 0.7416667
        Model saved in file: models/vanilla.ckpt
Epoch:  03   =====> Loss= 0.662441818
Accuracy on validation: 0.69166666
Epoch:  04   =====> Loss= 0.582793427
Accuracy on validation: 0.775
        Model saved in file: models/vanilla.ckpt
Epoch:  05   =====> Loss= 0.528468937
Accuracy on validation: 0.79583335
        Model saved in file: models/vanilla.ckpt
Epoch:  06   =====> Loss= 0.471816479
Accuracy on validation: 0.8041667
        Model saved in file: models/vanilla.ckpt
Epoch:  07   =====> Loss= 0.430543068
Accuracy on validation: 0.8208333
        Model saved in file: models/vanilla.ckpt
Epoch:  08   =====> Loss= 0.413095255
Accuracy on validation: 0.825
        Model saved in file: models/vanilla.ckpt
Epoch:  09   =====> Loss= 0.389044346
Accuracy on validation: 0.825
Epoch:  10   =====> Loss= 0.372057695
Accuracy on validation: 0.82916665
        Model saved in file: models/vanilla.ckpt
Epoch:  11   =====> Loss= 0.355003073
Accuracy on validation: 0.8333333
        Model saved in file: models/vanilla.ckpt
Epoch:  12   =====> Loss= 0.349159446
Accuracy on validation: 0.8333333
Epoch:  13   =====> Loss= 0.342088008
Accuracy on validation: 0.82916665
Epoch:  14   =====> Loss= 0.335627354
Accuracy on validation: 0.79583335
Epoch:  15   =====> Loss= 0.336374894
Accuracy on validation: 0.84583336
        Model saved in file: models/vanilla.ckpt
Epoch:  16   =====> Loss= 0.330676725
Accuracy on validation: 0.825
Epoch:  17   =====> Loss= 0.312103723
Accuracy on validation: 0.8375
Epoch:  18   =====> Loss= 0.296180920
Accuracy on validation: 0.825
Epoch:  19   =====> Loss= 0.315674418
Accuracy on validation: 0.82916665
Epoch:  20   =====> Loss= 0.298783961
Accuracy on validation: 0.82916665
Epoch:  21   =====> Loss= 0.286065846
Accuracy on validation: 0.825
Epoch:  22   =====> Loss= 0.278212914
Accuracy on validation: 0.8375
Epoch:  23   =====> Loss= 0.271039650
Accuracy on validation: 0.825
Epoch:  24   =====> Loss= 0.267676794
Accuracy on validation: 0.84166664
Epoch:  25   =====> Loss= 0.262044297
Accuracy on validation: 0.825
Epoch:  26   =====> Loss= 0.254342996
Accuracy on validation: 0.825
Epoch:  27   =====> Loss= 0.250271723
Accuracy on validation: 0.8375
Epoch:  28   =====> Loss= 0.233312539
Accuracy on validation: 0.82916665
Epoch:  29   =====> Loss= 0.223956222
Accuracy on validation: 0.84166664
Epoch:  30   =====> Loss= 0.222063489
WARNING:tensorflow:From C:\Users\duyen\Anaconda2\envs\py36\lib\site-packages\tensorflow\python\training\saver.py:1266: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
INFO:tensorflow:Restoring parameters from models/vanilla.ckpt
Accuracy: 0.85

Did you understand everything? If so, you can move towards Section 2.

Section 2: Your turn!¶

Question 1 - Recall the formulas defining the hidden states of a GRU.

ANSWER:

Remember

$r^{t} = sigmoid(W_{r}.[h^{t-1}, x^{t}] + b_{r})$

Input

$h^{'t} = tanh(W_{i}.[r^{t} \otimes h^{t-1}, x^{t}] + b_{i})$

Update Gate

$z^{t} = sigmoid(W_{z}.[h^{t-1}, x^{t}] + b_{z})$

$h^{t} = z^{t} \otimes h^{'t} + (1-z^{t}) \otimes h^{t-1} $

Question 2 - Define a GRU similar to the Vanilla RNN that we defined in Section 1.

class GRU:

    def __init__(self, input_size, hidden_states, activation=None, name=None):
        self._hidden_states = hidden_states
        self._input_size = input_size
        self._activation = activation or tf.tanh
        self._name = (name or "gru") + "/"
        ############ CODE NEEDED ############
        self._candidate_kernel_r = tf.get_variable(self._name + "candidate/weights_r",
                                                   shape=[input_size + self._hidden_states, self._hidden_states])
        self._candidate_bias_r = tf.get_variable(self._name + "candidate/bias_r", shape=[self._hidden_states])

        self._candidate_kernel_i = tf.get_variable(self._name + "candidate/weights_i",
                                                   shape=[input_size + self._hidden_states, self._hidden_states])
        self._candidate_bias_i = tf.get_variable(self._name + "candidate/bias_i", shape=[self._hidden_states])
        
        self._candidate_kernel_z = tf.get_variable(self._name + "candidate/weights_z",
                                                   shape=[input_size + self._hidden_states, self._hidden_states])
        self._candidate_bias_z = tf.get_variable(self._name + "candidate/bias_z", shape=[self._hidden_states])
        #####################################

    def state_size(self):
        return self._hidden_states

    def output_size(self):
        return self._hidden_states

    def zero_state(self, inputs):
        batch_size = tf.shape(inputs)[0]
        return tf.zeros([batch_size, self.state_size()], dtype=tf.float32)

    def __call__(self, inputs, state):
        ############ CODE NEEDED ############
        
        # Compute Remember r
        candidate_r = tf.matmul(tf.concat([inputs, state], 1), self._candidate_kernel_r)
        candidate_r = tf.nn.bias_add(candidate_r, self._candidate_bias_r)
        r = tf.nn.sigmoid(candidate_r)
        
        # Compute Input h'
        candidate_i = tf.matmul(tf.concat([inputs, tf.multiply(r, state)], 1), self._candidate_kernel_i)
        candidate_i = tf.nn.bias_add(candidate_i, self._candidate_bias_i)
        h_prime = tf.nn.tanh(candidate_i)
        
        # Compute Update Gate z
        candidate_z = tf.matmul(tf.concat([inputs, state], 1), self._candidate_kernel_z)
        candidate_z = tf.nn.bias_add(candidate_z, self._candidate_bias_z)
        z = tf.nn.sigmoid(candidate_z)
        
        # Compute new state
        new_h = tf.multiply(z, h_prime) + tf.multiply(1-z, h_prime)
        #####################################
        return new_h

Question 3 - Train that GRU with a $tanh$ activation function and print its accuracy on the test set.

ANSWER:

The training is in cell below. Its accuracy on the test set is 0.8375

tf.reset_default_graph()
tf.set_random_seed(123)
model_path = "models/gru.ckpt"
# tf Graph Input:  sentiment analysis data
x = tf.placeholder(tf.float32, [None, max_l, 300], name='InputData')
# masks
m = tf.placeholder(tf.float32, [None, max_l, 1], name='MaskData')
# Positive (1) or Negative (0) labels
y = tf.placeholder(tf.float32, [None, 1], name='LabelData')

gru = GRU(300, hidden_states)

gru_output = utils.process_sequence(gru, x, m)

W = tf.Variable(tf.zeros([hidden_states, 1]), name='Weights')
b = tf.Variable(tf.zeros([1]), name='Bias')
pred = tf.nn.sigmoid(tf.matmul(gru_output, W) + b)

with tf.name_scope('Loss'):
    # Minimize error using cross entropy
    # We use tf.clip_by_value to avoid having too low numbers in the log function
    cost = tf.reduce_mean(-y*tf.log(tf.clip_by_value(pred, epsilon, 1.0)) - (1.-y)*tf.log(tf.clip_by_value((1.-pred), epsilon, 1.0)))

with tf.name_scope('Adam'):
    # Gradient Descent
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
with tf.name_scope('Accuracy'):
    # Accuracy
    pred_tmp = tf.stack([pred, 1.-pred])
    y_tmp = tf.stack([y, 1.-y])
    acc = tf.equal(tf.argmax(pred_tmp, 0), tf.argmax(y_tmp, 0))
    acc = tf.reduce_mean(tf.cast(acc, tf.float32))

# Initializing the variables
init = tf.global_variables_initializer()
saver = tf.train.Saver()

with tf.Session() as sess:
    sess.run(init)
    # Training cycle
    print("Training started")
    best_val_acc = 0.
    for epoch in range(training_epochs):
        avg_cost = 0.
        total_batch = int(len(train)/batch_size)
        # Loop over all batches
        for i in range(total_batch):
            batch_xs, batch_ms, batch_ys = data.next_batch(batch_size)
            # Run optimization op (backprop), cost op (to get loss value)
            # and summary nodes
            _, c = sess.run([optimizer, cost],
                                     feed_dict={x: batch_xs, y: batch_ys, m: batch_ms})
            # Compute average loss
            avg_cost += c / total_batch
        # Display logs per epoch step
        val_xs, val_ms, val_ys = data.val_batch()
        val_acc = acc.eval({x: val_xs, m: val_ms, y: val_ys})
        print("Accuracy on validation:", val_acc)
        if val_acc > best_val_acc:
            best_val_acc = val_acc
            save_path = saver.save(sess, model_path)
            print("        Model saved in file: %s" % save_path)
        print("Epoch: ", '%02d' % (epoch+1), "  =====> Loss=", "{:.9f}".format(avg_cost))

    # Test model
    # Calculate accuracy
    saver.restore(sess, model_path)
    test_xs, test_ms, test_ys = data.test_batch()
    print("Accuracy:", acc.eval({x: test_xs, m: test_ms, y: test_ys}))

Training started
Accuracy on validation: 0.5
        Model saved in file: models/gru.ckpt
Epoch:  01   =====> Loss= 0.694832127
Accuracy on validation: 0.6625
        Model saved in file: models/gru.ckpt
Epoch:  02   =====> Loss= 0.690147723
Accuracy on validation: 0.73333335
        Model saved in file: models/gru.ckpt
Epoch:  03   =====> Loss= 0.677693894
Accuracy on validation: 0.775
        Model saved in file: models/gru.ckpt
Epoch:  04   =====> Loss= 0.591422774
Accuracy on validation: 0.7708333
Epoch:  05   =====> Loss= 0.509685430
Accuracy on validation: 0.77916664
        Model saved in file: models/gru.ckpt
Epoch:  06   =====> Loss= 0.471581693
Accuracy on validation: 0.7916667
        Model saved in file: models/gru.ckpt
Epoch:  07   =====> Loss= 0.439693710
Accuracy on validation: 0.77916664
Epoch:  08   =====> Loss= 0.431503745
Accuracy on validation: 0.8208333
        Model saved in file: models/gru.ckpt
Epoch:  09   =====> Loss= 0.413989216
Accuracy on validation: 0.8125
Epoch:  10   =====> Loss= 0.395639500
Accuracy on validation: 0.80833334
Epoch:  11   =====> Loss= 0.371596493
Accuracy on validation: 0.8041667
Epoch:  12   =====> Loss= 0.370231081
Accuracy on validation: 0.8125
Epoch:  13   =====> Loss= 0.357129651
Accuracy on validation: 0.825
        Model saved in file: models/gru.ckpt
Epoch:  14   =====> Loss= 0.345925518
Accuracy on validation: 0.8333333
        Model saved in file: models/gru.ckpt
Epoch:  15   =====> Loss= 0.339102052
Accuracy on validation: 0.8333333
Epoch:  16   =====> Loss= 0.323459774
Accuracy on validation: 0.82916665
Epoch:  17   =====> Loss= 0.317358541
Accuracy on validation: 0.8375
        Model saved in file: models/gru.ckpt
Epoch:  18   =====> Loss= 0.300963225
Accuracy on validation: 0.82916665
Epoch:  19   =====> Loss= 0.295587132
Accuracy on validation: 0.82916665
Epoch:  20   =====> Loss= 0.294250146
Accuracy on validation: 0.82916665
Epoch:  21   =====> Loss= 0.288610291
Accuracy on validation: 0.81666666
Epoch:  22   =====> Loss= 0.280991768
Accuracy on validation: 0.8208333
Epoch:  23   =====> Loss= 0.264295967
Accuracy on validation: 0.81666666
Epoch:  24   =====> Loss= 0.266704788
Accuracy on validation: 0.825
Epoch:  25   =====> Loss= 0.260948915
Accuracy on validation: 0.825
Epoch:  26   =====> Loss= 0.243570645
Accuracy on validation: 0.825
Epoch:  27   =====> Loss= 0.246957934
Accuracy on validation: 0.825
Epoch:  28   =====> Loss= 0.234278925
Accuracy on validation: 0.8208333
Epoch:  29   =====> Loss= 0.222084848
Accuracy on validation: 0.8208333
Epoch:  30   =====> Loss= 0.221500899
INFO:tensorflow:Restoring parameters from models/gru.ckpt
Accuracy: 0.8375

Question 4 - What are the advantages of Gated Recurrent Units over Vanilla RNNs?

ANSWER

The advantages of Gated Recurrent Units over Vanilla RNNs:

GRU allows RNNs to remember their inputs over a long period of time.
GRU's shortcut paths allow the error to be back-propagated easily without too quickly vanishing.