-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MXNET-96] Language model with Google's billion words dataset #10025
Conversation
Language model with Google's billion words dataset (#197)
@@ -214,9 +238,7 @@ If ``no_bias`` is set to be true, then the ``bias`` term is ignored. | |||
.set_attr<nnvm::FInferShape>("FInferShape", FullyConnectedShape) | |||
.set_attr<nnvm::FInferType>("FInferType", FullyConnectedType) | |||
.set_attr<FCompute>("FCompute<cpu>", FullyConnectedCompute<cpu>) | |||
#if MXNET_USE_MKLDNN == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for adding sparse support for FC. See line 213.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Both sparse and mkldnn support need run into this dispatch.
src/operator/nn/fully_connected.cc
Outdated
std::vector<TBlob> out_blobs(outputs.size()); | ||
for (size_t i = 0; i < out_blobs.size(); i++) out_blobs[i] = outputs[i].data(); | ||
FullyConnectedCompute<cpu>(attrs, ctx, in_blobs, req, out_blobs); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ‘FallBackCompute’ should be used to fall back the computation to original cpu implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block will only be executed when MKL is absent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you not use FallBackCompute for fallback?
If an input is a sparse matrix, does data() return a dense ndarray? it doesn't seem SetTBlob is doing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does MKL support kFComputeFallback dispatch mode?
Are you both referring to line 93 - line 99? FallBackCompute
is only defined when USE_MKL=1
. Can I still use it?
What I need to address is the following case for inference:
- data = dense
- weight = rowsparse
- bias = rowsparse
- output = dense
But I don't know how to deal with this efficiently withUSE_MKL=1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When USE_MKL=1,
I wanted to simply use the non-MKL FCForward with
data.data(),
weight.data(),
bias.data(),
assuming data.data()
returns a TBlob with normal cpu layout even if data is in MKL layout. weight.data()
and bias.data()
should always return normal cpu layout if weight and bias are row_sparse
. Please advice.
src/operator/nn/fully_connected.cc
Outdated
#endif | ||
if (valid_data && valid_weight && valid_bias && valid_out) { | ||
std::vector<TBlob> in_blobs(inputs.size()); | ||
for (size_t i = 0; i < in_blobs.size(); i++) in_blobs[i] = inputs[i].data(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, if an NDArray uses MKLDNN format, data() will convert its layout inside the array. This caused a race condition, I just fixed. You shouldn't call data() on an NDArray with MKLDNN format. Please check FallBackCompute to see how it is handled correctly.
I also don't understand what happens if the input array is a row sparse. If the row sparse array doesn't have zero-entry rows, it works fine. What is the row sparse array have zero-entry rows, isn't the memory returned from data() is smaller than we expect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the data()
should be improved functionality. The developer doesn't know there's a potential issue when working with other formats of NDArray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the only option is to disable data() for NDArrays with MKLDNN format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create and link a JIRA ticket
@eric-haibin-lin The RNN.unroll is just updated to support variable length sequence. Need to resolve the conflict. |
@@ -181,3 +181,126 @@ def unroll(self, length, inputs, begin_state=None, layout='NTC', merge_outputs=N | |||
outputs, _, _, _ = _format_sequence(length, outputs, layout, merge_outputs) | |||
|
|||
return outputs, states | |||
|
|||
|
|||
class LSTMPCell(HybridRecurrentCell): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be implemented as a ModifierCell? The implementation should be similar as ZoneOut.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an easy way to modify state in Modifier cell (especially during unroll)??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't it's a good abstraction as a modifier cell. The projection only happens on the hidden state of LSTM, not on the cell state. I am not sure what's the expected behavior for GRU cell and RNN cell where the number of states is just one. Also the paper call this LSTMP, it's specific to LSTM. I think it's find to just call it contrib.LSTMP
which inherits HybridRecurrentCell
.
@@ -56,7 +56,10 @@ static bool FullyConnectedShape(const nnvm::NodeAttrs& attrs, | |||
} | |||
SHAPE_ASSIGN_CHECK(*in_shape, fullc::kWeight, Shape2(param.num_hidden, num_input)); | |||
if (!param.no_bias) { | |||
SHAPE_ASSIGN_CHECK(*in_shape, fullc::kBias, Shape1(param.num_hidden)); | |||
if (!shape_assign(&(*in_shape)[fullc::kBias], Shape1(param.num_hidden)) && | |||
!shape_assign(&(*in_shape)[fullc::kBias], Shape2(param.num_hidden, 1))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should we check it against (num_hidden, 1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bias are trained with sparse embedding with requires a 2-D shape.
example/rnn/large_word_lm/data.py
Outdated
""" A dataset for truncated bptt with multiple sentences. | ||
Adapeted from @rafaljozefowicz's implementation. | ||
""" | ||
def __init__(self, vocab, file_pattern, deterministic=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shuffle
instead of deterministic
, since a fixed random seed may still produce deterministic but shuffled results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
example/rnn/large_word_lm/model.py
Outdated
def cross_entropy_loss(inputs, labels, rescale_loss=1): | ||
""" cross entropy loss """ | ||
criterion = mx.gluon.loss.SoftmaxCrossEntropyLoss() | ||
loss = criterion.hybrid_forward(S, inputs, labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loss = criterion(inputs, labels) should do. Also, rescale_loss
can be put in the constructor call of the SoftmaxCELoss using argument weight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
Tensor<xpu, 1, DType> bias = in_data[fullc::kBias].get_with_shape<xpu, 1, DType>( | ||
Shape1(wmat.shape_[0]), s); | ||
CHECK_EQ(bias.shape_[0], wmat.shape_[0]) | ||
<< "bias.data().shape[0] != weight.data().shape[0]. Not supported by FCForward"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use words to describe the error instead. Also, if flatten is True
the error message for data.data()
might not make sense.
example/rnn/large_word_lm/data.py
Outdated
""" | ||
def __init__(self): | ||
self._token_to_id = {} | ||
self._token_to_count = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collections.Counter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a JIRA
22edc00
to
148e345
Compare
src/operator/nn/fully_connected.cc
Outdated
// inputs | ||
std::vector<TBlob> in_blobs(inputs.size()); | ||
auto get_data = [](const NDArray& nd) -> TBlob { | ||
if (nd.storage_type() == kDefaultStorage) return nd.Reorder2Default().data(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nd.Reorder2Default() returns a temp object here, which controls a piece of memory. This memory will be freed when the function returns, and TBlob will reference to a free'd memory.
Anything else to address? |
looks good to me |
…#10025) * Language model with Google's billion words dataset (apache#197) Language model with Google's billion words dataset (apache#197) * fix lint * ffix license * patch * fix lint * cr comment * update fc fallback * fix build * fix temp memory in fc * fix compilation
…#10025) * Language model with Google's billion words dataset (apache#197) Language model with Google's billion words dataset (apache#197) * fix lint * ffix license * patch * fix lint * cr comment * update fc fallback * fix build * fix temp memory in fc * fix compilation
…#10025) * Language model with Google's billion words dataset (apache#197) Language model with Google's billion words dataset (apache#197) * fix lint * ffix license * patch * fix lint * cr comment * update fc fallback * fix build * fix temp memory in fc * fix compilation
…#10025) * Language model with Google's billion words dataset (apache#197) Language model with Google's billion words dataset (apache#197) * fix lint * ffix license * patch * fix lint * cr comment * update fc fallback * fix build * fix temp memory in fc * fix compilation
Description
This example reproduces the result (~42 perplexity) on Exploring the Limits of Language Modeling on the GBW dataset.
See
readme.mk
for details.@mli @szha @zheng-da @piiswrong
Checklist
Essentials
make lint
)Changes
Comments