700字范文,内容丰富有趣,生活中的好帮手!
700字范文 > 循环神经网络教程4-用Python和Theano实现GRU/LSTM RNN Part 4 – Implementing

循环神经网络教程4-用Python和Theano实现GRU/LSTM RNN Part 4 – Implementing

时间:2024-02-25 10:54:21

相关推荐

循环神经网络教程4-用Python和Theano实现GRU/LSTM RNN  Part 4 – Implementing

The code for this post is on Github.This ispart 4, the last part of theRecurrent Neural Network Tutorial. The previous partsare:

Recurrent Neural Networks Tutorial, Part 1 – Introduction to RNNs Recurrent Neural Networks Tutorial, Part 2 – Implementing a RNN with Python, Numpy and Theano Recurrent Neural Networks Tutorial, Part 3 – Backpropagation Through Time and Vanishing Gradients

In this post we’ll learn aboutLSTM (Long Short Term Memory) networks and GRUs (Gated Recurrent Units). LSTMs werefirst proposed in 1997by Sepp Hochreiter and Jürgen Schmidhuber, and are among the most widely used models in Deep Learning for NLP today. GRUs,first used in ,are a simpler variant ofLSTMs that share many of the sameproperties. Let’s start by looking at LSTMs, and then we’ll see how GRUs are different.

LSTM networks

Inpart 3we looked athow thevanishing gradient problem prevents standardRNNs from learning long-term dependencies. LSTMs weredesigned to combat vanishing gradientsthrough agatingmechanism. To understand what this means, let’s look at how a LSTM calculates ahidden state(I’m usingto mean elementwise multiplication):

These equations look quite complicated, but actually it’s not that hard. First, notice that a LSTM layeris just another way to compute ahidden state. Previously, we computed the hidden state as. The inputs to this unit were, the current input at step, and, the previous hidden state. The output was a new hidden state. A LSTM unit does the exact same thing, just in a different way!This is key to understanding the big picture.You can essentially treat LSTM (andGRU) units as a black boxes. Given the current input and previous hidden state, they compute the next hidden state in some way.

With that in mind let’s try to get an intuition forhowa LSTMunit computes the hidden state.Chris Olah hasanexcellent post that goes into details on thisand to avoid duplicating his effort I will only give a brief explanationhere. I urge you to read his post to fordeeper insight and nice visualizations. But, to summarize:

are called the input, forget and outputgates, respectively. Note that they have the exact same equations, just with different parameter matrices. They care calledgatesbecause the sigmoid function squashes the values of thesevectors between 0 and 1, and by multiplying them elementwise with another vector you define how much of that other vector you want to “let through”. The input gate defines how much of the newly computed state for the currentinput you want to let through. The forget gate defines how much of the previous state you want to let through. Finally, The output gate defines how much of the internal state you want to expose to the external network (higher layers and the next time step). All the gates have the same dimensions, the size of your hidden state. is a “candidate” hidden state that is computed based on the current input and the previous hidden state. It is exactly the same equation we had in our vanilla RNN, we justrenamed the parametersandtoand. However, instead of takingas the new hidden state as we did in the RNN, we will use the input gate from above to pick some of it. is the internal memory of the unit. It is a combination of the previous memorymultipliedby the forget gate, and the newly computed hidden state, multipliedby the input gate. Thus, intuitively it is a combination of how we want to combineprevious memory and the new input. We could choose to ignore the old memory completely (forget gate all 0’s) or ignore the newly computed state completely (input gate all 0’s), but most likely we wantsomething in between these two extremes. Given the memory, we finally compute the output hidden stateby multiplying the memory with the output gate. Not all of the internal memory may be relevantto the hidden state used by other units in the network.

LSTM Gating. Chung, Junyoung, et al. “Empirical evaluation of gated recurrent neural networks on sequence modeling.” ()

Intuitively, plain RNNs could beconsidered a special case of LSTMs. If you fix the input gate all 1’s, the forget gate to all 0’s (you always forget the previous memory) and the output gate to all one’s (you expose the whole memory) you almost getstandardRNN. There’s just an additionalthat squashes the output a bit. Thegating mechanismis what allows LSTMs toexplicitly model long-term dependencies. By learning the parameters for its gates, the network learns how itsmemory should behave.

Notably, there exist several variations on the basic LSTM architecture. A common one is creatingpeepholeconnections that allow the gates to not only depend on the previoushidden state, but also on the previous internal state, adding an additional term in the gate equations. There are many more variations.LSTM: A Search Space Odysseyempirically evaluates different LSTM architectures.

GRUs

The idea behind a GRU layer is quitesimilar to that of a LSTM layer, as are theequations.

A GRU has two gates, areset gate, and an update gate. Intuitively, the reset gate determines how to combine the new input with the previous memory, and the update gate defines how much of the previous memory to keep around. If we set the reset to all 1’s and update gate to all 0’s we againarrive at our plain RNN model. The basic idea of using a gating mechanism to learn long-term dependencies is the same as in a LSTM, but there are a few key differences:

A GRU has two gates, an LSTM has three gates. GRUs don’t possess and internal memory () that is different fromthe exposedhidden state. They don’t have the output gate that is present in LSTMs. The input and forget gates are coupled by an update gateand the reset gateis applied directly to the previous hidden state. Thus, the responsibility of the reset gate in a LSTM is really split up into bothand. We don’t apply a second nonlinearity when computing the output.

GRU Gating. Chung, Junyoung, et al. “Empirical evaluation of gated recurrent neural networks on sequence modeling.” ()

GRU vs LSTM

Now that you’ve seen two models to combat the vanishing gradient problem you may be wondering: Which one to use? GRUs are quite new (), and their tradeoffs haven’t been fully explored yet. According to empirical evaluations inEmpirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling andAn Empirical Exploration of Recurrent Network Architectures, there isn’t a clear winner. Inmany tasks both architectures yield comparable performance and tuning hyperparameters like layer size is probably more important than picking the ideal architecture. GRUs havefewerparameters (U and W are smaller) and thus may train a bit faster or need lessdata to generalize. On the other hand, if you have enough data, the greater expressive power of LSTMs maylead to better results.

Implementation

Let’s return to the implementation of the Language Model frompart 2and let’s use GRU units inour RNN. There is no principled reason why I’ve chosen GRUs insteadLSTMs in this part (other that I also wanted to become more familiar with GRUs). Their implementations are almost identical so you should be able to modify the code to go from GRU to LSTM quite easily by changing the equations.

We base the code on our previous Theano implementation. Remember that a GRU (LSTM) layer is just another way of computing the hidden state. So all we really need to do is change the hidden state computation in our forward propagation function.

In ourimplementation we also added bias units. It’s quite typical that these are not shown in the equations. Of course we also needto change the initialization of our parametersand because they now have a different sizes. I don’t show the initialization code here, butit is on Gitub. I also added a word embedding layer, but more on that below.

That was pretty simple. But what about the gradients? We could derive the gradients forandby hand using the chain rule, just like we did before. But in practice most people uselibraries like Theano that support auto-differenation of expressions. If you are for somehow forced to calculate the gradients yourself, you probably want to modularize different units and have your own version of auto-differentiation using the chain rule. We let Theano calculate the gradients for us:

That’s pretty much it. To get betterresults we also usea few additional tricks in our implementation.

Using rmsprop for parameter updates

Inpart 2we used the most basic version of Stochastic Gradient Descent (SGD) to update our parameters. It turns out this isn’t such a great idea.If you setyour learning rate low enough, SGD is guaranteed to make progress towards a good solution, but in practice that would take a very long time.There exist a number of commonly used variations on SGD, including the(Nesterov) Momentum Method,AdaGrad,AdaDeltaandrmsprop. This postcontains a good overview of many of these methods. I’m also planning to explore the implementation of each of these methods in detail in a future post. For this part of the tutorialI chose to go with rmsprop. The basic idea behind rmspropis to adjust the learning rateper-parameteraccording to thea (smoothed) sum of the previous gradients. Intuitively this means that frequently occurring features get a smaller learning rate (because the sum of their gradients is larger), and rare features get a larger learning rate.

The implementation of rmsprop is quite simple. For each parameter we keep a cache variable and during gradient descent we update the parameter and the cache as follows (example for):

The decay is typically set to 0.9 or 0.95 and the1e-6 term is added to avoid division by 0.

Addingan embedding layer

Using word embeddings such asword2vecandGloVeis a popular method to improve the accuracy of your model. Instead of using one-hot vectors to represent our words, the low-dimensional vectors learned usingword2vec orGloVecarry semantic meaning – similar words havesimilar vectors. Using these vectors is a form ofpre-training.Intuitively, you are telling the network which words are similar so that it needs to learn less about the language. Using pre-trained vectors is particularly useful if you don’t have a lot of data because it allows the network to generalize to unseen words. I didn’t use pre-trained word vectors in my experiments, but adding an embedding layer (the matrixin our code) makes it easy to plug them in. The embedding matrix is really just a lookup table – the ith column vector corresponds to the ith word in our vocabulary.By updating the matrixwe are learning word vectorsourselves, but they are very specific to our task (and data set) and not as general as those that you can download, which are trained on millions or billions of documents.

Adding a second GRU layer

Adding a second layer to our network allows our model to capture higher-level interactions. You could add additional layers, but I didn’t try that for this experiment. You’ll likely see diminishing returns after 2-3 layers and unless you have a huge amount of data (which we don’t) more layers are unlikely to make a bigdifference and may lead to overfitting.

Adding a second layer to our network is straightforward, we (again) only need to modify the forward propagation calculation and initialization function.

The full code for the GRU network is available here.

A note on performance

I’ve gotten questions about this in the past, so I want to clarify that the code I showedhere isn’t very efficient. It’s optimized for clarity andwas primarilywritten for educational purposes. It’s probably good enough to play around with the model, but you should not use it in production or expect to train on a large dataset with it. There are many tricks tooptimize RNN performance,but the perhaps most important one would beto batch together your updates. Instead of learning from one sentenceat a time, you want to group sentences of the same length (or even pad all sentences to have the same length)and then perform large matrix multiplications and sum up gradients for the whole batch.That’s because such large matrix multiplications areefficiently handled by aGPU. By not doing this we can get little speed-up from using a GPU and training can be extremelyslow.

So, if you want to train a large model I highly recommended using one of theexisting Deep Learning librariesthat are optimized for performance. A model that would take days/weeks to train with the above code will only take a few hours with these libraries.I personally likeKeras, which is quite simple to use and comes with good examples for RNNs.

Results

To spare you the pain of training a model over many days I trained a model very similar to that inpart 2. I used a vocabulary size of 8000, mapped words into 48-dimensional vectors, and used two 128-dimensional GRU layers. TheiPython notebookcontains code to load the model so you can play with it, modify it, and use it to generate text.

Here are a few good examples of the network output (capitalization added by me).

I am a bot , and this action was performed automatically . I enforce myself ridiculously well enough to just youtube. I’ve got a good rhythm going ! There is no problem here, but at least still wave ! It depends on how plausible my judgement is . ( with the constitution which makes it impossible )

It is interesting to look at the semantic dependencies of these sentences over multiple time steps. For example, bot and automatically are clearly related, as are the opening and closing brackets. Our network was able to learn that, pretty cool!

That’s it for now. I hope you had fun and please leave questions/feedback in the comments!

from://10/recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/

循环神经网络教程4-用Python和Theano实现GRU/LSTM RNN Part 4 – Implementing a GRU/LSTM RNN with Python and Theano

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。