LSTM
introduction
RNN(Recurrent Neural Network) is a kind of neural network which will send current information recurrently back to itself. As a result, it can “remember” something of previous samples. However common RNN can not remember too much things because of the gradient vanishing problems. When error back propagates to the previous layers, it will multiply the gradient of the activation function which is less than 1. After several steps, it decays to nearly 0. LSTM(Long Short Term Memory)[1] is one kind of the most promising variant of RNN. Some gates are introduced into the LSTM to help the neuron to choose when to forget and when to remember things. It tackle the gradient vanishing problems with some more parameters introduced.
Here we use a sine wave as input and use LSTM to learn it. After the LSTM network is well trained we then try to draw the same wave all by LSTM itself.
construct the LSTM in Theano2
There are a lot of deep learning framework we can choose such as theano, tensorflow, keras, caffe, torch, etc. However, we prefer theano than others because it gives us the most freedom to construct our programs. A Library called Computation Graph Toolkit is also very promising but it still need some time to become user friendly. The theano tutorial is offered in [2].
Firstly we construct the LSTM kernel function according to [3]. The LSTM function is a bit more complicated than traditional RNN with three more gates. The function is defined as:
def lstm(x, cm1, hm1, ym1, W):
hx = T.concatenate([x, hm1])
hxSize = hx.shape[0]
bs = 0
Wf = W[bs: bs + hiddenSize * hxSize].reshape([hiddenSize, hxSize])
bs += hiddenSize * hxSize
bf = W[bs: bs + hiddenSize]
bs += hiddenSize
Wi = W[bs: bs + hiddenSize * hxSize].reshape([hiddenSize, hxSize])
bs += hiddenSize * hxSize
bi = W[bs: bs + hiddenSize]
bs += hiddenSize
Wc = W[bs: bs + hiddenSize * hxSize].reshape([hiddenSize, hxSize])
bs += hiddenSize * hxSize
bc = W[bs: bs + hiddenSize]
bs += hiddenSize
Wo = W[bs: bs + hiddenSize * hxSize].reshape([hiddenSize, hxSize])
bs += hiddenSize * hxSize
bo = W[bs: bs + hiddenSize]
bs += hiddenSize
Wy = W[bs: bs + vectorSize * hiddenSize].reshape([vectorSize, hiddenSize])
bs += vectorSize * hiddenSize
by = W[bs: bs + vectorSize]
bs += vectorSize
ft = T.nnet.sigmoid(Wf.dot(hx) + bf)
it = T.nnet.sigmoid(Wi.dot(hx) + bi)
ct = T.tanh(Wc.dot(hx) + bc)
ot = T.nnet.sigmoid(Wo.dot(hx) + bo)
c = ft * cm1 + it * ct
h = ot * T.tanh(c)
y = Wy.dot(h) + by
return [y, c, h, y]
We compact all parameters in W, and in the function we will unzip them respectively. x is the input vector of each step. cm1 is the memory state, hm1 is the hidden cell and y is the output. And then, we use Theano’s scan function to do a loop:
tResult, tUpdates = theano.scan(lstm,
outputs_info = [None,
T.zeros(hiddenSize),
T.zeros(hiddenSize),
T.zeros(vectorSize)],
sequences = [dict(input = tx)],
non_sequences = [tW])
We define the loss as the square sum of the difference between output and the next x. So the input data is x[:-1] and the expected result is x[1:].
predictSequence = tResult[3]
tef = T.sum((predictSequence - ty)**2)
GetPredict = theano.function(inputs = [tW, tx], outputs = predictSequence)
GetError = theano.function(inputs = [tW, tx, ty], outputs = tef)
GetGrad = theano.function(inputs = [tW, tx, ty], outputs = tgrad)
After that, we use Adagrad to do the optimization. Adagrad is a simple and efficient optimization method which does better than L-BFGS(not faster than but get better result) and SGD. The Adagrad code is as follows:
for i in xrange(1000):
dx = GetGrad(W, x, y)
cache += dx**2
W += - learning_rate * dx / (np.sqrt(cache) + eps)
However, LSTM is hard to optimize without some tricks. We introduce two tricks here. One is called weighted trainning and the other is denoising.
weighted trainning method
weighted trainning method weighted each step in the input sequence. Because we believe that the early stage of a sequence. Some small difference happened in the early stage will be broadcast in the following steps and will finally cause the prediction to fail. We can add a decaying weight to the sequence. It turned out to help a lot in the performance of the result. The result is shown as follows:
denoising LSTM
Another method, borrowed from denoising autoencoder is to add some noise to the sequence input. It can also help train the network. Besides, it needs less manipulation compared with the weighted methods. The result is shown as follow:
Conclusion
In this article, we do experiments on LSTM to predict the sequence itself. We tried weighted training method and denoising LSTM and the later one turn out to be more efficient.