-
Notifications
You must be signed in to change notification settings - Fork 665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize neuron activation sequence in self-connected layers #31
Conversation
@cazala I think we also need to do the same changes to back propagation codes:
Don't merge the below commit (7009132) please. It fails in the timing task. |
@Sleepwalking what do you mean exactly when you say that former neurons always overwrite activations fed to later neurons? Currently we use one-to-one connections for self-connected layers, so if layer L has neurons A, B and C, each of them projects a connection to theirselves, so A would project a connection to A, but not to B and C... so I don't see how this would change the output in the end, did you get any different results? |
Oh... I didn't noticed they're one-to-one by default, thanks for pointing out! |
Yes that would be great, what you say makes sense so we should test it and see if it improves the outcome of RNN's. The main reason the layers are working that way today is because I coded them following this algorythm which states that "if a layer receives recurrent connections from a layer which has not yet been activated this time-step, the sending layer’s activation from the previous time-step is used" |
Actually half a month earlier I also followed Monner's paper and implemented lstm-g, in octave. It seemed to work but it was 100x slower than synaptic and that's why I decided to move to synaptic haha :) |
@cazala RNN fails to train after a few epochs, regardless of learning rate and whether using parallelized activation or not. Weights all become NaN. My topology is simply adding a hidden-to-hidden all-to-all connection based on Architect.Perceptron. RNN = new Network();
var input = new Layer(2);
var hidden = new Layer(20);
var output = new Layer(1);
input.project(hidden);
hidden.project(hidden, Layer.connectionType.ALL_TO_ALL);
hidden.project(output);
RNN.set({input: input, hidden: [hidden], output: output});
RNN.trainer = new Trainer(RNN); If that self connection is removed, the network becomes an ordinary MLP and everything are fine... |
Yea, synaptic would be that slow if it wasn't for the hardcoder/optimizer, but the logic its pretty much the same. What you are suggesting for self-connected layers actually makes sense, and it sticks to the principle in the algorithm that states: If layer A has not been activated in this time-step, the activation from previous time-step is used. So lets say layer A is self-connected (all-to-all), and we activate it. The input has to be layer A's output, and since layer A hasn't been activated on this time-step yet, we need to use layer A's output from previous time-step. Right now the first neuron in the layer is using the correct input, but the following ones will not be using the right inputs, since we are overwriting them as we activate each individual neuron of the layer. So yea, this PR makes sense. What I don't know is which task can we use to test two RNNs with fully self-connected layers and make sure that the one with the parallelized activations works better.. |
Mmm, I just read your last comment. I should take a look to see why are the weights becoming NaN's.. can you make a new branch and push these changes there, so I can pull them? EDIT: I can just copy the RNN topology actually, no need for a new branch |
I created a p.s. even when learning rate = 0, the weights still go to NaN at around 300th iteration in the 1st epoch. |
Yep, keep committing here to master. I just tested this code using the version of synaptic that we have on the website:
And tried to train an XOR: It fails to learn the XOR, but the weights are not NaN, it's the topology that just doesn't converge. If I delete the self-connection it converges nicely. Are you sure that it wasn't something from these last changes what it's making the weights to become NaN? EDIT: it can be tested just by pasting the code in the console on the website |
Hmm.. seems like something wrong with my C code generator |
It's caused by numerical error. After 100 iterations I found some of the weights go up to 100 or more. Since I used p.s. went to sleep... see you later |
While training the XOR some of the weights went up to ~40k after 10,000 iterations, so yeah it seems they tend to explode using that topology.. I'll do some more tests later once I get out of my job |
@cazala I know why. It's because inherently lstm-g algorithm is not able to project self connection like an ordinary RNN does! The self connection connects the state of j to itself, but not the activity of j to the state of j, which means increases linearly with time! RNN = new Network();
var input = new Layer(2);
var hidden1 = new Layer(20);
var hidden2 = new Layer(20);
var output = new Layer(1);
input.project(hidden1);
hidden1.project(hidden2);
hidden2.project(hidden1, Layer.connectionType.ONE_TO_ONE);
hidden2.project(output);
RNN.set({input: input, hidden: [hidden1, hidden2], output: output});
RNN.trainer = new Trainer(RNN); Now it is trainable, though RNN is way worse than LSTM on the timing task (which is consistent with the theory that LSTM is better). |
That makes sense. How much time/iterations does it take for the LSTM to converge on the timing task? At least to an acceptable error.. cos it would be great to add a spec with that task to our current tests, if it could be solved within a few seconds. Do you think that would be possible or would it take to much to solve that task every time the CI runs the tests (i.e. every time we merge a PR)? |
Full convergence takes 400 epochs * 7000 iterations (samples) each epoch; that's 7 seconds for C implementation. I'm afraid that's too much for javascript. But you start to see the trend towards convergence after 10 epochs which may take less than 10 seconds for javascript and is acceptable. |
I could try to code a javascript test for the timing task and see how much time takes to complete it up to an acceptable error. Regarding the fully connected h2h for LSTM, does it perform better? did you try it with or without the parallelized activation? |
Thanks for the test. |
Results for LSTM + h2g + h2h, 500 epochs, MSE on test set
It's really hard to tell which is better. |
Added a new connection type: ALL_TO_ELSE, which is equivalent to ALL_TO_ALL excluding self connection.
Anyway, NaN is bad. Now I'd rather prefer sequential activation. |
Maybe we should leave the parallelization outside for now if it doesn't provide a significant improvement in performance, for the sake of keeping the code simple. I think The new connection type and the working RNN architecture are good additions to the library tho, and I'm specially curious about the DOT language compiler feature, which I haven't played with yet but looks promising |
awesome, thank you (: |
Parallelize neuron activation sequence in self-connected layers
Hi, did activation order (or something else) turn out to be wrong? As the writer of https://github.com/MrMormon/lstm-g, I'm quire curious. |
This PR is not about parallelizing the library to run on multiple cores/gpu. It's about a bug in the sequence of activating neurons in self-connected layers.
We need to take great care when activating a self-connected layer (e.g.
memoryCell.project(memoryCell);
). Before this PR all neurons in such a layer are activated in sequence, which means the former neurons always overwrite the activations fed to later neurons, and the activations may mess up. The right way to do this is to update the activations only after all neurons in the layer have been activated.This PR fixes the bug in forward propagation, including hardcoder (standalone hardcoder is fixed but not tested yet). I'm not sure yet if similar bug exists in back propagation. We have to fix these bugs if we want to parallelize (multi-corelize) it because we don't want resource conflicts.
The fixed codes pass the DSR task as well as the timing task.
Notable changes:
Neuron
got two new properties:.newactivation
and.inselfconnectedlayer
update_sentences
was added to the hardcoder which contains a few lines like this,Neuron.activation = Neuron.newactivation;