Embedding Layer is expanded every timestep in LSTM #722

Kublai-Jing · 2015-11-26T01:56:29Z

Hi mxnet,

Not sure if this is true, but the embedding layer, say with shape (10000,256, i.e. 10000 vocabulary, 256-dimensional embedding) seems to be expanded at every timestep for the LSTM, even if the weights are shared. I don't know how to verify this but as far as what I observed I believe this is true, which causes huge memory consumption. I thought if we share weights for every timestep, we only need to allocate memory for the actual embedding (256 x t) for t-step lstm in this case but not allocating (10000 x 256 x t), which for the FullyConnected layer will be true. It seems that the current Embedding layer takes the same amount of memory as a FullyConnected layer, but I am not sure whether it's this issue that causes lots of memory overhead in lstm. I wonder where I should be looking at to try to fix this issue ?

Thanks!

mli · 2015-11-26T02:35:18Z

@antinucleon probably this is one of the reason why our lstm isn't fast as expected?

antinucleon · 2015-11-26T02:38:03Z

Will check it tomorrow. Today & tomorrow morning I am benchmarking ELU.

tqchen · 2015-11-26T06:35:41Z

This could be the reason, because the weight is binded with a shared variable, and current graph optimizer is yet not smart enough to change the gradient summation to an addto to the original variable.
Either we directly bind by ndarray, or make the graph executor smarter to handle the addto case more gracefully

JianboTang · 2015-12-02T05:36:37Z

has this problem been fixed?

Kublai-Jing · 2015-12-09T20:55:36Z

Probably not. Seems that solving this will have to dive into mshadow code, which is harder ?

piiswrong · 2015-12-09T21:35:50Z

@JianboTang @Kublai-Jing As tqchen pointed out, I think you can replace simple_bind with bind and directly provide the ndarrays as a temporary solution

Kublai-Jing · 2015-12-09T23:10:36Z

Yeah that's what I am doing now : )

tqchen · 2016-03-22T05:14:19Z

hopefully fixed by #1697

tqchen mentioned this issue Jan 11, 2016

can mxnet provide the sparse gradient update for word embedding #1237

Closed

tqchen mentioned this issue Mar 22, 2016

[EXECUTOR] Efficient gradient memory aggregation for RNN #1697

Merged

tqchen closed this as completed Mar 22, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedding Layer is expanded every timestep in LSTM #722

Embedding Layer is expanded every timestep in LSTM #722

Kublai-Jing commented Nov 26, 2015

mli commented Nov 26, 2015

antinucleon commented Nov 26, 2015

tqchen commented Nov 26, 2015

JianboTang commented Dec 2, 2015

Kublai-Jing commented Dec 9, 2015

piiswrong commented Dec 9, 2015

Kublai-Jing commented Dec 9, 2015

tqchen commented Mar 22, 2016

Embedding Layer is expanded every timestep in LSTM #722

Embedding Layer is expanded every timestep in LSTM #722

Comments

Kublai-Jing commented Nov 26, 2015

mli commented Nov 26, 2015

antinucleon commented Nov 26, 2015

tqchen commented Nov 26, 2015

JianboTang commented Dec 2, 2015

Kublai-Jing commented Dec 9, 2015

piiswrong commented Dec 9, 2015

Kublai-Jing commented Dec 9, 2015

tqchen commented Mar 22, 2016