Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Higher Level API for RNN #3930

Closed
sxjscience opened this issue Nov 22, 2016 · 33 comments
Closed

Higher Level API for RNN #3930

sxjscience opened this issue Nov 22, 2016 · 33 comments

Comments

@sxjscience
Copy link
Member

We've created a higher level API for recurrent neural networks and have completed gradient tests, forward test and speed comparison against CuDNN. The class definition and key methods look like this:

class RNN(object):
    """High level API for constructing stacked RNN layers.

    To use a recurrent neural network, we can first create an RNN object and use the step function
    during the symbol construction.

    Currently four types of RNN are supported and all parameters per layer are grouped into 4 matrices.
    The data layout and transition rules are similar to the RNN API in CuDNN (https://developer.nvidia.com/cudnn)
    1) ReLU RNN:
        h_t = ReLU(W_i x_t + R_i h_{t-1} + b_{W_i} + b_{R_i})

        Parameters:
            W_{i2h} = W_i
            b_{i2h} = b_{W_i}
            W_{h2h} = R_i
            b_{h2h} = b_{R_i}
    2) Tanh RNN:
        h_t = tanh(W_i x_t + R_i h_{t-1} + b_{W_i} + b_{R_i})

        Parameters:
            W_{i2h} = W_i
            b_{i2h} = b_{W_i}
            W_{h2h} = R_i
            b_{h2h} = b_{R_i}
    3) LSTM:
        i_t = \sigma(W_i x_t + R_i h_{t-1} + b_{W_i} + b_{R_i})
        f_t = \sigma(W_f x_t + R_f h_{t-1} + b_{W_f} + b_{R_f})
        o_t = \sigma(W_o x_t + R_o h_{t-1} + b_{W_o} + b_{R_o})
        c^\prime_t = tanh(W_c x_t + R_c h_{t-1} + b_{W_c} + b_{R_c})
        c_t = f_t \circ c_{t-1} + i_t \circ c^\prime_t
        h_t = o_t \circ tanh(c_t)

        Parameters: (input_gate, forget_gate, new_mem, output_gate)
            W_{i2h} = [W_i, W_f, W_c, W_o]
            b_{i2h} = [b_{W_i}, b_{W_f}, b_{W_c}, b_{W_o}]
            W_{h2h} = [R_i, R_f, R_c, R_o]
            b_{h2h} = [b_{R_i}, b_{R_f}, b_{R_c}, b_{R_o}]
    4) GRU:
        i_t = \sigma(W_i x_t + R_i h_{t-1} + b_{W_i} + b_{R_i})
        r_t = \sigma(W_r x_t + R_r h_{t-1} + b_{W_r} + b_{R_r})
        h^\prime_t = tanh(W_h x_t + r_t \circ (R_h h_{t-1} + b_{R_h}) + b_{W_h})
        h_t = (1 - i_t) \circ h^\prime_t + i_t \circ h_{t-1}

        Parameters: (reset_gate, update_gate, new_mem)
            W_{i2h} = [W_r, W_i, W_h]
            b_{i2h} = [b_{W_r}, b_{W_i}, b_{W_h}]
            W_{h2h} = [R_r, R_i, R_h]
            b_{h2h} = [b_{R_r}, b_{R_i}, b_{R_h}]
    """
    def __init__(self, num_hidden, data_dim, typ='lstm',
                 dropout=0., zoneout=0.,
                 i2h_weight=None, i2h_bias=None,
                 h2h_weight=None, h2h_bias=None,
                 init_h=None, init_c=None,
                 cudnn_opt=False,
                 name='LSTM'):
        """Initialization of the RNN object

        Parameters
        ----------
        num_hidden : list or tuple
            Size of the hidden state for all the layers
        data_dim : int
            Dimension of the input data to the symbol
        typ: str
            Type of the Recurrent Neural Network, can be 'gru', 'lstm', 'rnn_relu', 'rnn_tanh'
        dropout : list or tuple, optional
            Dropout ratio for all the hidden layers. Use 0 to indicate no-dropout.
        zoneout : list or tuple, optional
            Zoneout ratio for all the hidden layers. Use 0 to indicate no-zoneout.
        i2h_weight : list or tuple, optional
            Weight of the connections between the input and the hidden state.
        i2h_bias : list or tuple, optional
            Bias of the connections between the input and the hidden state.
        h2h_weight : list or tuple, optional
            Weight of the connections (including gates) between the hidden states of consecutive timestamps.
        h2h_bias : list or tuple, optional
            Bias of the connections (including gates) between the hidden states of consecutive timestamps.
        init_h : list or tuple, optional
            Initial hidden states of all the layers
        init_c : list or tuple, optional
            Initial cell states of all the layers. Only applicable when `typ` is "LSTM"
        cudnn_opt : bool, optional
            If True, the CuDNN version of RNN will be used. Also, the generated symbol could only be
            used with GPU and `zoneout` cannot be used.
        name : str
            Name of the object
        """
    def step(self, data, prev_h=None, prev_c=None, seq_len=1, ret_typ="all"):
        """Feed the data sequence into the RNN and get the state symbols.

        Parameters
        ----------
        data : list or tuple or Symbol
            The input data. Shape: (seq_len, batch_size, data_dim)
        prev_h : list or tuple or Symbol or None, optional
            The initial hidden states. If None, the symbol constructed during initialization
            will be used.
            Number of the initial states must be the same as the layer number,
            e.g, [h0, h1, h2] for a 3-layer RNN
        prev_c : list or tuple or Symbol or None, optional
            The initial cell states. Only applicable when `typ` is 'lstm'. If None,
            the symbol constructed during initialization will be used.
            Number of the initial states must be the same as the layer number,
            e.g, [c0, c1, c2] for a 3-layer LSTM
        seq_len : int, optional
            Length of the data sequence
        ret_typ : str, optional
            Determine the parts of the states to return, which can be 'all', 'out', 'state'
            IMPORTANT!! When `cudnn_opt` is on, only the 'out' flag is supported.
            If 'all', symbols that represent states of all the timestamps as well as
             the state of the last timestamp will be returned,
                e.g, For a 3-layer GRU and length-10 data sequence, the return value will be
                     ([h0, h1, h2], [h0_9, h1_9, h2_9])
                      Here all hi are of shape(seq_len, batch_size, num_hidden[i]) and
                      all hi_j are of shape(batch_size, num_hidden[i])
                     For a 3-layer LSTM and length-10 data sequence, the return value contains both state and cell
                     ([h0, h1, h2], [c0, c1, c2], [h0_9, h1_9, h2_9], [c0_9, c1_9, c2_9])
            If 'out', state outputs of the layers will be returned,
                e.g, For a 3-layer GRU/LSTM and length-10 data sequence, the return value will be
                     [h0, h1, h2]
            If 'state', last state/cell will be returned,
                e.g, For a 3-layer GRU and length-10 data sequence, the return value will be
                     [h0_9, h1_9, h2_9]
                     For a 3-layer LSTM and length-10 data sequence, the return value will be
                     ([h0_9, h1_9, h2_9], [c0_9, c1_9, c2_9])

        Returns
        -------
        tuple
            States generated by feeding the data sequence to the network.

            If the `return_all` flag is set, states of all the timestamps will be returned.
            Otherwise states of all the timestamps will be returned.

        """

We decide to Pull Request this feature @leezu .

Should we create a new directory under "python/mxnet" like "operators" to store these kind of composed symbols? What do you think? @pluskid @piiswrong @tqchen @xlvector @sbodenstein

@pluskid
Copy link
Contributor

pluskid commented Nov 22, 2016

Great! Thanks a lot! We should long have this built in the standard package instead of bare-bone unroll function in examples. Currently since there are not going to be a lot of different Python composed symbols in the standard library, I guess simply put it as python/mxnet/rnn.py would be fine?

Also if I understand correctly, the class RNN is a constructor, whose step function needs to be called to compose a symbol, right? This is different convention from the cudnn RNN cell, as the cudnn RNN cell constructor itself is a composition function. Maybe we need to double think about the naming here to avoid confusion between the two cases.

@sxjscience
Copy link
Member Author

@pluskid Yes, the RNN class here is a symbol constructor. I'm also thinking about the naming issue. May be call it "RNNFactory" in order to distinguish from the cudnn version?

@sxjscience
Copy link
Member Author

sxjscience commented Nov 22, 2016

@leezu @jennyzhang0215 Let's PR by this weekends(Dec 4th) and add some examples.

@xlvector
Copy link
Contributor

Great!

I remember CUDNN RNN cell need to transpose data before input, so what the input shape of this operator? And I find current NON-CUDNN version will also use 30%~50% GPU-Util, and its not easy to use 100% GPU. You mentioned you have done speed test against CUDNN version, is there any reports?

@sxjscience
Copy link
Member Author

@xlvector I find that cudnn will be 3/6 times faster than the original implementation. The input shape is chosen to be the same as cudnn, i.e, (seq_len, batch_size, data_dim)

@leezu
Copy link
Contributor

leezu commented Nov 22, 2016

The iterators defined in https://github.com/dmlc/mxnet/blob/master/example/rnn-time-major/bucket_io.py are helpful to more easily get the correct (time-major) input shape. One could also add (a more general version) to https://github.com/dmlc/mxnet/blob/master/python/mxnet/io.py . What do you think?

@zhenlinluo
Copy link
Contributor

My understanding is that RNN input and output shape is already defined in infershape API in RNN-inl.h. So I am using them in my RNN implementation.
// data: [sequence len, batch, input dimension]
// Hidden shape is dim [total_layers, batch, state_size]
// Cell shape is dim [total_layers, batch, state_size]
// output: [sequence len, numdirection, num_direction * state_size]
// outStateShape: [layer_num, batch, state size]

@piiswrong
Copy link
Contributor

@mli

@sbodenstein
Copy link
Contributor

@sxjscience: you should be explicit by what dropout you are referring to. Is it the version from "
A Theoretically Grounded Application of Dropout in Recurrent Neural Networks" (https://arxiv.org/pdf/1512.05287.pdf), or from the older and less good (arXiv:1409.2329v5)? (or support both as options)

@sxjscience
Copy link
Member Author

@sbodenstein Currently we've just coded the old version. The dropout method in the NIPS paper should be added later. Also, the performance boost in the paper is partially due to the dropout method for the embedding layer and we can also support it.

@sxjscience
Copy link
Member Author

sxjscience commented Nov 23, 2016

@ZhenlinGuo The problem may be that by doing this we cannot support stacked RNNs with different number of hidden state sizes. The API will be easier to use if we separate the weight, bias and states.

@zhenlinluo
Copy link
Contributor

Hi all, I have one question, if I run RNN layer and return is combination of kOut, kStateOut, kStateCellOut, then in python, how I can get the topright of kStateOut and kStateCellout? What structure or API I can use? In rnn_cell_demo, this sentence is used after call RNN.
hidden = mx.sym.Reshape(data=rnn, shape=(-1, num_hidden))

But I don't quite understand it. For seq2seq, h, y, c will be output, but how I can just get topright of h and c from stream?

@magic282
Copy link

@sxjscience Hi, What do you mean by have "3/6 times faster" speed up with CUDNN? 3-6 times speed up?

@sxjscience
Copy link
Member Author

@magic282 Yes, cudnn is faster.

@magic282
Copy link

magic282 commented Nov 23, 2016

@sxjscience I strongly suggest that we have a benchmark for RNN related models, such as RNN LM, s2s, and attention, comparing with popular Theano and TF. Actually I have implemented s2s+attention NMT model with mxnet and got state-of-the-art result on IWSLT data. However, the training speed is not very fast. According to this benchmark https://github.com/guolinke/deep-learning-benchmarks/blob/master/result.md , mxnet is faster than other tools for the FCN models but slower for LSTM. I am really curious about the reason. But this is just a reference since I don't really believe that Theano is much faster than Torch.

And about the CUDNN speed up. Our group has built a DL framework from scratch and we are planning to leverage CUDNN to speed up training. We can only get 2 times speed up at most since our tool is already very fast for RNN models (because of some optimization for RNN). So I am thinking that if mxnet can have some optimization for RNN models without the support of CUDNN?

@sxjscience
Copy link
Member Author

@magic282 Yes, we need to include such a benchmark. Would you mind sharing your implementation? In fact, we haven't got an s2s + attention example yet. It will be easier for us to investigate whether the speed is not that satisfactory if we have the example code.

@magic282
Copy link

@sxjscience Sure, I will refactor the code and share it and hope that we can speed it up.

@zhenlinluo
Copy link
Contributor

@magic282 , do you have early code which call RNN layer rather than lstm_unroll python? I already implemented mkl based RNN for CPU and going to release later. But I am lack of s2s model to test the perf.

@magic282
Copy link

@zhenlinluo Nope.

@zhenlinluo
Copy link
Contributor

@magic282 since you have s2s model based on cudnn RNN layer, could you pls help to answer my question about what need to be returned after call mx.sym.RNN(xxxx, state_output=True) in encoder and decoder func? Since output Tblob will include kOut, kStateOut and kCell, but I just need to return 2D array kStateOut and kCell on top-right cell as input to decoder, how to do that?

@magic282
Copy link

@zhenlinluo I don't have s2s base on cudnn RNN symbol since I implemented LSTM or GRU using the basic OP.

@zhenlinluo
Copy link
Contributor

@mli @piiswrong Do you know how to get specific data when outputs are multiple in python?

@magic282
Copy link

@sxjscience Hi, I have uploaded the s2s + attention code. It might have some bugs since this time I can only get 44 BLEU score. Link, https://github.com/magic282/MXNMT

@sxjscience
Copy link
Member Author

@magic282 Is it possible to revise the code to run the WMT14 or WMT15 dataset? We'd better do pair-test with TensorFlow.

@magic282
Copy link

Actually I am not very familiar with MT task. Could you please provide the data format? Is it the reference number issue?

@sxjscience
Copy link
Member Author

@magic282 We may need to refer to their seq2seq example. https://www.tensorflow.org/versions/r0.12/tutorials/seq2seq/index.html

@goodmansasha
Copy link

goodmansasha commented Dec 17, 2016

Regarding this discussion (re: Yarin Gal's RNN implementation):

@sxjscience: you should be explicit by what dropout you are referring to. Is it the version from "
A Theoretically Grounded Application of Dropout in Recurrent Neural Networks" (https://arxiv.org/pdf/1512.05287.pdf), or from the older and less good (arXiv:1409.2329v5)? (or support both as options)

@sbodenstein Currently we've just coded the old version. The dropout method in the NIPS paper should be added later. Also, the performance boost in the paper is partially due to the dropout method for the embedding layer and we can also support it.

According to Yarin, his implementation is already in Keras, TensorFlow, and Torch. I think it basically uses the same dropout mask across timesteps on RNNs. The implementations might include examples to handle RNN layers of different sizes, and I also speculate the potential of allowing for uncertainty in RNN predictions as that research continues.

See:
yaringal/BayesianRNN#3

@sxjscience
Copy link
Member Author

sxjscience commented Dec 18, 2016

@predict-r Thanks very much! I'm busy with my PQE exam and need to work on this after I finish the exam. I find that the "dropout" used in the paper is actually a type of "DropConnect"(http://www.jmlr.org/proceedings/papers/v28/wan13.pdf) that directly masks the weight matrix. I'm thinking to keep its original name.

@magic282
Copy link

@sxjscience Sorry for late reply. I think the code will run if we have parallel corpus. Did it fail when you try it?

@goodmansasha
Copy link

@sxjscience It might be called DropConnect, but apparently its even older. Here is Yann LeCun's writeup on the history: https://www.facebook.com/yann.lecun/posts/10154058859142143 .

@karishmamalkan
Copy link

@sxjscience Hey, what is the progress on the API for RNN? Is it complete or still in progress?

@sxjscience
Copy link
Member Author

@karishmamalkan Still in progress, decide to add also the example of batchnorm. Will finish it after I finish the PQE. You can view some codes here. https://github.com/ML-HK/mxnet/blob/master/python/mxnet/recurrent.py

@karishmamalkan
Copy link

karishmamalkan commented Jan 5, 2017

Thanks @sxjscience I wanted to know is this a working version of the code. When i try to import recurrent.py, i get an error about importing "utils".. Is something missing?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests