-
Notifications
You must be signed in to change notification settings - Fork 665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Output/Hidden to Hidden gatings in LSTM-RNN #30
Comments
@Sleepwalking what do you mean exactly when you say that memory cells should also be gated by themselves? Currently memory cells project a connection to theirselves:
And this connection is gated by the forget gate:
Its amazing that you got a LSTM to reproduce whole sentences, I'm very interested in this, but I still don't get how's the topology of the network you are using, could you explain a little more or share some of the code you are using to connect the memory cells? |
Hello @cazala, Precisely speaking, the gates to memory cells have 1st-order connections from the memory cells themselves: memoryCell.project(forgetGate);
memoryCell.project(inputGate);
memoryCell.project(outputGate); Sorry that I just discovered that my implementation used in the "informal test" didn't included hidden-to-gates connections (because I made lots of local forks and they messed up and I ran the code on a wrong fork...) Kanru Hua |
I see, so all the memory cells project connections to all the input/forget/output gates? because now we have peephole connections projected from each memory cell to its corresponding gates:
It would be great if you could share the result of your tests when you finish them (: |
Large networks are trained notoriously slow...
The LSTM is trained using stochastic gradient descent with initial learning rate = 0.01. Learning rate decays by 0.99 times until it is less than 0.005. Result Without Hidden-to-Gates Connections
Output sample:
Result With Hidden-to-Gates Connections
Output sample:
It's kind of weird that the one with hidden-to-gates connections perform even worse than the original one. This is probably because the training hasn't converged yet. Seems like the learning rate is so high that MSE bounces back after 25 epochs. |
That's way better than the results I got on the Wikipedia language modeling task. You said you translated the generated code into C.. you mean the hardcded, optimized javascript code, right? Could you share the network you used to generate that optimized code? (The network itself). Or did you use the LSTM that's already in the Architect for the first result (without hidden-to-gates)? |
Yeah I used the one in Architect. // hidden-to-gates
memoryCell.project(inputGate);
memoryCell.project(outputGate);
memoryCell.project(forgetGate); Now I'm running more epochs and see if it works better. Each epoch takes half a minute... better to test this on some simplified tasks. The work by Gers is way more complicated than |
Do you exported the network of synaptic directly to C? On Wed, Apr 29, 2015 at 11:45 AM, Hua Kanru [email protected]
Agustin Mendez |
@menduz Yes. I'm actually working on a fork that directly outputs C codes. Once it's fully functioning I'll push it to github. |
Just a thought: you say that the network with hidden-to-gates connections produces a better output, while the MSE is higher.. I read one or two articles stating that for classification tasks like this one, sometimes it is more representative to use cross entropy as the cost function, instead of MSE. There's an implementation included in the Trainer (
Maybe you could try to use that one and see if the hidden-to-gates network produces a lower error. Oh, and +1 for the C code generator (: |
If you have a draft of the C generator we have some ideas of simd SSE El mié, abr 29, 2015 13:30, Juan Cazala [email protected] escribió:
|
@cazala Thanks for pointing out the cross entropy criterion :) But the Trainer seems just measure the error and display them instead of actually changing the error function used in back propagation (and https://github.com/cazala/synaptic/blob/master/src/neuron.js#L119 has already adopted MCE criterion, implicitly). @menduz Nice. I'm also aiming for some SIMDs cause I saw something like this in the codes, F[132948] = F[132020] * F[132920] * influences[27];
F[132949] = F[132020] * F[132920] * influences[28];
F[132950] = F[132020] * F[132920] * influences[29];
F[132951] = F[65];
F[132952] = F[132020] * F[132951] * influences[0];
F[132953] = F[132020] * F[132951] * influences[1];
F[132954] = F[132020] * F[132951] * influences[2];
F[132955] = F[132020] * F[132951] * influences[3];
F[132956] = F[132020] * F[132951] * influences[4];
F[132957] = F[132020] * F[132951] * influences[5];
F[132958] = F[132020] * F[132951] * influences[6]; F[210] += F[49755] * F[239782];
F[210] += F[49758] * F[239783];
F[210] += F[49761] * F[239784];
F[56820] += F[0] * F[210];
F[210] = F[49674] * F[239786];
F[210] += F[49677] * F[239787];
F[210] += F[49680] * F[239788];
F[210] += F[49683] * F[239789];
F[210] += F[49686] * F[239790];
F[210] += F[49689] * F[239791];
F[210] += F[49692] * F[239792]; which can be easily optimized by packed dot products & element-wise products. We also need to be cache-friendly. Performance may double for small networks because their binaries are below 64KB (size of L1 instruction cache). |
Fresh Test Result on the New TaskI found this task in Herbert Jeager's A Tutorial on Training Recurrent Neural Networks. This task is about timing and was originally introduced for testing Echo State Networks. I tested two different LSTM configurations (w/o hidden-to-gates) on this task. Task DescriptionThe task has 2 inputs and 1 output. These two figures from the above tutorial describe the task well: LSTM Topologyvar myLSTM = new Architect.LSTM(2, 15, 1);
myLSTM.layers.input.set({squash: Neuron.squash.IDENTITY});
myLSTM.layers.output.set({squash: Neuron.squash.IDENTITY}); The network with h2g connections has the following lines in place of peephole connections at https://github.com/cazala/synaptic/blob/master/src/architect.js#L110, // hidden-to-gates
memoryCell.project(inputGate);
memoryCell.project(outputGate);
memoryCell.project(forgetGate); Training ConfigurationBoth networks are exported to C, compiled by clang-3.5 (double precision, default optimization); Data set generationOnce data set is randomly generated, the same data set is used to train both two networks. Nt = 8000;
Ntest = 1000;
t = 1;
train_input = zeros(Nt, 2);
train_target = zeros(Nt, 1);
while(t < Nt - 20)
n = round(rand * 20);
train_input(t, 1) = 1;
train_input(t:t + n, 2) = n / 20;
train_target(t:t + n) = 0.5;
t += n;
n = round(rand * 20);
train_input(t + 1:t + n, 2) = train_input(t, 2);
t += n;
end
Nt -= Ntest;
test_input = train_input(Nt + 1:end, :);
test_target = train_target(Nt + 1:end);
train_input = train_input(1:Nt, :);
train_target = train_target(1:Nt); ResultsLSTM without h2gTo make sure it converges I ran extra 100 epochs. LSTM with h2gComments
|
This is because 15 hidden units are too many for our simple timing task. I tried again with a 2-5-1 topology and here are the new results. A bug in forward propagation was fixed before this test. LSTM without H2GLSTM with H2GComments
|
That's impressive. Do you think we should replace the current LSTM topology with the H2G one? We should also take into account that the later one has more connections, do you think if we compared H2G with an LSTM with a similar amount of connections (by adding more hidden units) it would outperform it in the same way? Note: an easy way to know the number of neurons/connections (that is not documented) is to call The number of neurons and connections is global (for all the |
@cazala The above results already shown that 2-5-1 H2G (28KB) outperformed 2-15-1 without H2G (70KB). |
👍 maybe even h2g should be the default topology with one-to-one peepholes as the option |
According to Felix Gers' dissertation[1], gates on memory cells have connection not only from input layer, but also the memory cells themselves. However, Architect currently only projects input-to-out/forget/in gates connections for LSTMs (except peepholes),
which means that the neural network remembers/forgets information only based on its current inputs, which could be diastrous for certain tasks that require long term memory.
In some other applications memory cells are even gated by outputs. Besides gating, first order connections from output to hidden layers also exist in some literatures.
This observation provides an insight for why the Wikipedia language modeling task doesn't give promising results even after hours of training. Through informal test enabling hidden layer to gates connections, the network is able to reproduce text such as "the of the of the of the of the of ..." on its own. I also trained a LSTM with 70 memory cells on some short paragraphs and the network can exactly reproduce two or three sentences on its own.
I'm going to run further tests to compare the hidden-to-gates connected LSTMs with input-to-gates connected ones.
[1] Gers, Felix. Long short-term memory in recurrent neural networks. PhD dissertation, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland, 2001.
The text was updated successfully, but these errors were encountered: