What is the most efficient way to implement multi-layer RNNs in TensorFlow?

464 views Asked by At

I’m trying to figure out if it’s more efficient to run an RNN on the inputs, and then run another RNN on those outputs, repeatedly (one horizontal layer at a time). Or to run one time-step at a time for all layers (one vertical layer at a time).

I know tensorflow's MultiCellRNN class does the latter. Why is this method chosen over the former? Is the former equally efficient? Are there cases where going one time-step at a time for all layers is preferable?

See http://karpathy.github.io/2015/05/21/rnn-effectiveness/ for reference on multi-layer RNNs.

2

There are 2 answers

0
Ben Murdoch On

1: How to easily implement an RNN Use an lstm cell, they're generally better (no vanishing gradient problems) and tensorflow makes it very easy to implement them through:

from tensorflow.python.ops.rnn_cell import BasicLSTMCell ... cell = BasicLSTMCell( state_dim ) stacked_lstm = tf.nn.rnn_cell.MultiRNNCell([cell]*num_layers, state_is_tuple=True)

find out more on the tensorflow website: https://www.tensorflow.org/tutorials/recurrent/

2:Horizontal or Deep? Just like you can have a multi layer neural networks, you can also have a multi layer RNN. Think of the RNN cell as a layer within your neural network, a special layer which allows you to remember sequential inputs. From my experience you will still have linear transforms (or depth) within your network, but the question to have multiple layers of lstm cells depends on your network topology, preference, and computational ability. (the more the merrier) The amount of inputs and outputs depends on your problem, and as far as I can remember there is no such thing as multiple Horizontal RNN cells, just depth. All computation is done depth wise one input at a time. The multi layer function you referenced is awesome, it handles all computation for you under the hood, just tell it how many cells you want and it does the rest.

Good Luck

0
Mr Tsjolder from codidact On

If you run everything sequentially, there should not be that much of a performance difference between both approaches (unless I am overseeing something with cache locality here). The main advantage of the latter approach is that you can parallelise the computation for multiple layers.

E.g. instead of waiting for the inputs to propagate through 2 layers, you can already start the computation of the next time step in the first layer while the result from the current time step is propagating through the second layer.

Disclaimer: I would not consider myself a performance expert.