The aim of this session is to practice with VanillaRNN and Gated Recurrent Units (GRU). Each group should fill and run appropriate notebook cells.
Follow instructions step by step until the end and submit your complete notebook as an archive (tar -cf groupXnotebook.tar DL_lab3/).
Do not forget to run all your cells before generating your final report and do not forget to include the names of all participants in the group. The lab session should be completed by June 12th 2019 (23:59:59 CET).
In this part, you will have no code to write. However you should spend some minutes on it, to understand well how the Vanilla RNN is implemented: you will implement a GRU in a similar way in Section 2.
You will work on a corpus of 3,000 user comments taken from IMDb (1,000), Amazon (1,000) and Yelp (1,000). These comments are split into two categories: positive comments (denoted by "1") and negative comments (denoted by "0"). For each website, 500 comments are positive and 500 comments are negative. This corpus has been created for the paper From Group to Individual Labels using Deep Features by Kotzias et al.
In this lab, we split this dataset into a training set of 2,520 comments (420 positive comments and 420 negative comments from each website), a validation set of 240 comments (40 positive comments and 40 negative comments from each website) and a test set of 240 comments (40 positive comments and 40 negative comments from each website).
Your goal will be to classify automatically these sentences by training a Vanilla RNN and then a GRU. Please note that we use the word2vec method to convert words into vectors: these vectors are designed so that they reflect the semantic and the syntactic functions of words. You can read more about word2vec in the paper Distributed representations of words and phrases and their compositionality by Mikolov et al.
First of all, please run the following cell.
# Imports
import tensorflow as tf
import numpy as np
import utils
import warnings
warnings.filterwarnings("ignore")
# Parameters
epsilon = 1e-10
max_l = 32 # Max length of sentences
train, val, test, word2vec = utils.load_data()
data = utils.Dataset(train, val, test, word2vec)
In the following cell, we define a VanillaRNN class. Please read its code carefully before running the cell because you will need to implement a similar class for the GRU.
If our sentence is represented by the sequence $(x_1, ..., x_L)$, the hidden states $h_t$ of the Vanilla RNN are defined as
where $W_h$, $W_x$ and $b$ are trainable parameters and $f$ is an activation fucntion.
class VanillaRNN:
def __init__(self, input_size, hidden_states, activation=None, name=None):
self._hidden_states = hidden_states
self._input_size = input_size
self._activation = activation or tf.tanh
self._name = (name or "vanilla_rnn") + "/"
self._candidate_kernel = tf.get_variable(self._name + "candidate/weights",
shape=[input_size + self._hidden_states, self._hidden_states])
self._candidate_bias = tf.get_variable(self._name + "candidate/bias", shape=[self._hidden_states])
def state_size(self):
return self._hidden_states
def zero_state(self, inputs):
batch_size = tf.shape(inputs)[0]
return tf.zeros([batch_size, self.state_size()], dtype=tf.float32)
def __call__(self, inputs, state):
candidate = tf.matmul(tf.concat([inputs, state], 1), self._candidate_kernel)
candidate = tf.nn.bias_add(candidate, self._candidate_bias)
new_h = self._activation(candidate)
return new_h
Parameters
# Parameters
learning_rate = 0.001
training_epochs = 30
batch_size = 128
hidden_states = 50
Then we define our model. Please read the code of the process_sequence() function to understand the utility of the MaskData placeholder. If $h_L$ is the last hidden state of the Vanilla RNN, then we define our final prediction $p$ as
where $W_{pred}$ and $b_{pred}$ are trainable parameters and $\sigma$ denotes the sigmoid function.
tf.reset_default_graph()
tf.set_random_seed(123)
model_path = "models/vanilla.ckpt"
# tf Graph Input: sentiment analysis data
# Sentences are padded with zero vectors
x = tf.placeholder(tf.float32, [None, max_l, 300], name='InputData')
# masks: necessary as we have different sentence lengths
m = tf.placeholder(tf.float32, [None, max_l, 1], name='MaskData')
# positive (1) or negative (0) labels
y = tf.placeholder(tf.float32, [None, 1], name='LabelData')
# we define our VanillaRNN cell
vanilla = VanillaRNN(300, hidden_states)
# we retrieve its last output
vanilla_output = utils.process_sequence(vanilla, x, m)
W = tf.Variable(tf.zeros([hidden_states, 1]), name='Weights')
b = tf.Variable(tf.zeros([1]), name='Bias')
# we make the final prediction
pred = tf.nn.sigmoid(tf.matmul(vanilla_output, W) + b)
Question 0 - Why do we need a MaskData placeholder?
ANSWER:
When we want to batchize multiple sentences as one input to tensorflow, we need to make them as a tensor/matrix with fixed length inputs. In this lab session, the given data contain variable length sentences. Therefore, we used the MaskData placeholder to handle these variable length inputs in order to create fixed ones, then feed it to the network.
Eventually, we train our model using a cross-entropy loss and the Adam optimizer. At each epoch we check the validation accuracy, and save the model if that accuracy increased. At the end, we load the best model on validation, and print its accuracy on the test set.
We test our model using a $\tanh$ activation function.
with tf.name_scope('Loss'):
# Minimize error using cross entropy
# We use tf.clip_by_value to avoid having too low numbers in the log function
cost = tf.reduce_mean(-y*tf.log(tf.clip_by_value(pred, epsilon, 1.0)) - (1.-y)*tf.log(tf.clip_by_value((1.-pred), epsilon, 1.0)))
with tf.name_scope('Adam'):
# Gradient Descent
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
with tf.name_scope('Accuracy'):
# Accuracy
pred_tmp = tf.stack([pred, 1.-pred])
y_tmp = tf.stack([y, 1.-y])
acc = tf.equal(tf.argmax(pred_tmp, 0), tf.argmax(y_tmp, 0))
acc = tf.reduce_mean(tf.cast(acc, tf.float32))
# Initializing the variables
init = tf.global_variables_initializer()
saver = tf.train.Saver()
with tf.Session() as sess:
sess.run(init)
# Training cycle
print("Training started")
best_val_acc = 0.
for epoch in range(training_epochs):
avg_cost = 0.
total_batch = int(len(train)/batch_size)
# Loop over all batches
for i in range(total_batch):
batch_xs, batch_ms, batch_ys = data.next_batch(batch_size)
# Run optimization op (backprop), cost op (to get loss value)
# and summary nodes
_, c = sess.run([optimizer, cost],
feed_dict={x: batch_xs, y: batch_ys, m: batch_ms})
# Compute average loss
avg_cost += c / total_batch
# Display logs per epoch step
val_xs, val_ms, val_ys = data.val_batch()
val_acc = acc.eval({x: val_xs, m: val_ms, y: val_ys})
print("Accuracy on validation:", val_acc)
if val_acc > best_val_acc:
best_val_acc = val_acc
save_path = saver.save(sess, model_path)
print(" Model saved in file: %s" % save_path)
print("Epoch: ", '%02d' % (epoch+1), " =====> Loss=", "{:.9f}".format(avg_cost))
# Test model
# Calculate accuracy
saver.restore(sess, model_path)
test_xs, test_ms, test_ys = data.test_batch()
print("Accuracy:", acc.eval({x: test_xs, m: test_ms, y: test_ys}))
Did you understand everything? If so, you can move towards Section 2.
Question 1 - Recall the formulas defining the hidden states of a GRU.
ANSWER:
Remember
$r^{t} = sigmoid(W_{r}.[h^{t-1}, x^{t}] + b_{r})$
Input
$h^{'t} = tanh(W_{i}.[r^{t} \otimes h^{t-1}, x^{t}] + b_{i})$
Update Gate
$z^{t} = sigmoid(W_{z}.[h^{t-1}, x^{t}] + b_{z})$
$h^{t} = z^{t} \otimes h^{'t} + (1-z^{t}) \otimes h^{t-1} $
Question 2 - Define a GRU similar to the Vanilla RNN that we defined in Section 1.
class GRU:
def __init__(self, input_size, hidden_states, activation=None, name=None):
self._hidden_states = hidden_states
self._input_size = input_size
self._activation = activation or tf.tanh
self._name = (name or "gru") + "/"
############ CODE NEEDED ############
self._candidate_kernel_r = tf.get_variable(self._name + "candidate/weights_r",
shape=[input_size + self._hidden_states, self._hidden_states])
self._candidate_bias_r = tf.get_variable(self._name + "candidate/bias_r", shape=[self._hidden_states])
self._candidate_kernel_i = tf.get_variable(self._name + "candidate/weights_i",
shape=[input_size + self._hidden_states, self._hidden_states])
self._candidate_bias_i = tf.get_variable(self._name + "candidate/bias_i", shape=[self._hidden_states])
self._candidate_kernel_z = tf.get_variable(self._name + "candidate/weights_z",
shape=[input_size + self._hidden_states, self._hidden_states])
self._candidate_bias_z = tf.get_variable(self._name + "candidate/bias_z", shape=[self._hidden_states])
#####################################
def state_size(self):
return self._hidden_states
def output_size(self):
return self._hidden_states
def zero_state(self, inputs):
batch_size = tf.shape(inputs)[0]
return tf.zeros([batch_size, self.state_size()], dtype=tf.float32)
def __call__(self, inputs, state):
############ CODE NEEDED ############
# Compute Remember r
candidate_r = tf.matmul(tf.concat([inputs, state], 1), self._candidate_kernel_r)
candidate_r = tf.nn.bias_add(candidate_r, self._candidate_bias_r)
r = tf.nn.sigmoid(candidate_r)
# Compute Input h'
candidate_i = tf.matmul(tf.concat([inputs, tf.multiply(r, state)], 1), self._candidate_kernel_i)
candidate_i = tf.nn.bias_add(candidate_i, self._candidate_bias_i)
h_prime = tf.nn.tanh(candidate_i)
# Compute Update Gate z
candidate_z = tf.matmul(tf.concat([inputs, state], 1), self._candidate_kernel_z)
candidate_z = tf.nn.bias_add(candidate_z, self._candidate_bias_z)
z = tf.nn.sigmoid(candidate_z)
# Compute new state
new_h = tf.multiply(z, h_prime) + tf.multiply(1-z, h_prime)
#####################################
return new_h
Question 3 - Train that GRU with a $tanh$ activation function and print its accuracy on the test set.
ANSWER:
The training is in cell below. Its accuracy on the test set is 0.8375
tf.reset_default_graph()
tf.set_random_seed(123)
model_path = "models/gru.ckpt"
# tf Graph Input: sentiment analysis data
x = tf.placeholder(tf.float32, [None, max_l, 300], name='InputData')
# masks
m = tf.placeholder(tf.float32, [None, max_l, 1], name='MaskData')
# Positive (1) or Negative (0) labels
y = tf.placeholder(tf.float32, [None, 1], name='LabelData')
gru = GRU(300, hidden_states)
gru_output = utils.process_sequence(gru, x, m)
W = tf.Variable(tf.zeros([hidden_states, 1]), name='Weights')
b = tf.Variable(tf.zeros([1]), name='Bias')
pred = tf.nn.sigmoid(tf.matmul(gru_output, W) + b)
with tf.name_scope('Loss'):
# Minimize error using cross entropy
# We use tf.clip_by_value to avoid having too low numbers in the log function
cost = tf.reduce_mean(-y*tf.log(tf.clip_by_value(pred, epsilon, 1.0)) - (1.-y)*tf.log(tf.clip_by_value((1.-pred), epsilon, 1.0)))
with tf.name_scope('Adam'):
# Gradient Descent
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
with tf.name_scope('Accuracy'):
# Accuracy
pred_tmp = tf.stack([pred, 1.-pred])
y_tmp = tf.stack([y, 1.-y])
acc = tf.equal(tf.argmax(pred_tmp, 0), tf.argmax(y_tmp, 0))
acc = tf.reduce_mean(tf.cast(acc, tf.float32))
# Initializing the variables
init = tf.global_variables_initializer()
saver = tf.train.Saver()
with tf.Session() as sess:
sess.run(init)
# Training cycle
print("Training started")
best_val_acc = 0.
for epoch in range(training_epochs):
avg_cost = 0.
total_batch = int(len(train)/batch_size)
# Loop over all batches
for i in range(total_batch):
batch_xs, batch_ms, batch_ys = data.next_batch(batch_size)
# Run optimization op (backprop), cost op (to get loss value)
# and summary nodes
_, c = sess.run([optimizer, cost],
feed_dict={x: batch_xs, y: batch_ys, m: batch_ms})
# Compute average loss
avg_cost += c / total_batch
# Display logs per epoch step
val_xs, val_ms, val_ys = data.val_batch()
val_acc = acc.eval({x: val_xs, m: val_ms, y: val_ys})
print("Accuracy on validation:", val_acc)
if val_acc > best_val_acc:
best_val_acc = val_acc
save_path = saver.save(sess, model_path)
print(" Model saved in file: %s" % save_path)
print("Epoch: ", '%02d' % (epoch+1), " =====> Loss=", "{:.9f}".format(avg_cost))
# Test model
# Calculate accuracy
saver.restore(sess, model_path)
test_xs, test_ms, test_ys = data.test_batch()
print("Accuracy:", acc.eval({x: test_xs, m: test_ms, y: test_ys}))
Question 4 - What are the advantages of Gated Recurrent Units over Vanilla RNNs?
ANSWER
The advantages of Gated Recurrent Units over Vanilla RNNs: