Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize neuron activation sequence in self-connected layers #31

Merged
merged 3 commits into from
May 1, 2015

Conversation

Sleepwalking
Copy link
Contributor

This PR is not about parallelizing the library to run on multiple cores/gpu. It's about a bug in the sequence of activating neurons in self-connected layers.

We need to take great care when activating a self-connected layer (e.g. memoryCell.project(memoryCell);). Before this PR all neurons in such a layer are activated in sequence, which means the former neurons always overwrite the activations fed to later neurons, and the activations may mess up. The right way to do this is to update the activations only after all neurons in the layer have been activated.

This PR fixes the bug in forward propagation, including hardcoder (standalone hardcoder is fixed but not tested yet). I'm not sure yet if similar bug exists in back propagation. We have to fix these bugs if we want to parallelize (multi-corelize) it because we don't want resource conflicts.

The fixed codes pass the DSR task as well as the timing task.

Notable changes:

  • Neuron got two new properties: .newactivation and .inselfconnectedlayer
  • update_sentences was added to the hardcoder which contains a few lines like this, Neuron.activation = Neuron.newactivation;

@Sleepwalking
Copy link
Contributor Author

@cazala I think we also need to do the same changes to back propagation codes:

Don't merge the below commit (7009132) please. It fails in the timing task.
I've reverted the commit, having tried many different propagation sequences with none of them work except the original sequencial way. Probably the back propagation just doesn't work in the parallel way.
At least the parallelized forward prop (4a9e831 to 867f544) is completely fine.

@cazala
Copy link
Owner

cazala commented Apr 30, 2015

@Sleepwalking what do you mean exactly when you say that former neurons always overwrite activations fed to later neurons? Currently we use one-to-one connections for self-connected layers, so if layer L has neurons A, B and C, each of them projects a connection to theirselves, so A would project a connection to A, but not to B and C... so I don't see how this would change the output in the end, did you get any different results?

@Sleepwalking
Copy link
Contributor Author

Oh... I didn't noticed they're one-to-one by default, thanks for pointing out!
Nevertheless, ordinary RNNs have all-to-all self connections. How about doing some tests on ordinary RNNs and see if this PR helps?

@cazala
Copy link
Owner

cazala commented Apr 30, 2015

Yes that would be great, what you say makes sense so we should test it and see if it improves the outcome of RNN's. The main reason the layers are working that way today is because I coded them following this algorythm which states that "if a layer receives recurrent connections from a layer which has not yet been activated this time-step, the sending layer’s activation from the previous time-step is used"

@Sleepwalking
Copy link
Contributor Author

Actually half a month earlier I also followed Monner's paper and implemented lstm-g, in octave. It seemed to work but it was 100x slower than synaptic and that's why I decided to move to synaptic haha :)
I found the same statement regarding time steps but I (probably) didn't fully understand what that means.

@Sleepwalking
Copy link
Contributor Author

@cazala RNN fails to train after a few epochs, regardless of learning rate and whether using parallelized activation or not. Weights all become NaN. My topology is simply adding a hidden-to-hidden all-to-all connection based on Architect.Perceptron.

RNN = new Network();
var input = new Layer(2);
var hidden = new Layer(20);
var output = new Layer(1);
input.project(hidden);
hidden.project(hidden, Layer.connectionType.ALL_TO_ALL);
hidden.project(output);
RNN.set({input: input, hidden: [hidden], output: output});
RNN.trainer = new Trainer(RNN);

If that self connection is removed, the network becomes an ordinary MLP and everything are fine...

@cazala
Copy link
Owner

cazala commented Apr 30, 2015

Yea, synaptic would be that slow if it wasn't for the hardcoder/optimizer, but the logic its pretty much the same. What you are suggesting for self-connected layers actually makes sense, and it sticks to the principle in the algorithm that states: If layer A has not been activated in this time-step, the activation from previous time-step is used.

So lets say layer A is self-connected (all-to-all), and we activate it. The input has to be layer A's output, and since layer A hasn't been activated on this time-step yet, we need to use layer A's output from previous time-step. Right now the first neuron in the layer is using the correct input, but the following ones will not be using the right inputs, since we are overwriting them as we activate each individual neuron of the layer.

So yea, this PR makes sense. What I don't know is which task can we use to test two RNNs with fully self-connected layers and make sure that the one with the parallelized activations works better..

@cazala
Copy link
Owner

cazala commented Apr 30, 2015

Mmm, I just read your last comment. I should take a look to see why are the weights becoming NaN's.. can you make a new branch and push these changes there, so I can pull them?

EDIT: I can just copy the RNN topology actually, no need for a new branch

@Sleepwalking
Copy link
Contributor Author

I created a rnn branch and you can pull from it. https://github.com/Sleepwalking/synaptic/tree/rnn
I will not commit on that branch; further changes stay on master. Right?

p.s. even when learning rate = 0, the weights still go to NaN at around 300th iteration in the 1st epoch.

@cazala
Copy link
Owner

cazala commented Apr 30, 2015

Yep, keep committing here to master. I just tested this code using the version of synaptic that we have on the website:

RNN = new Network();
var input = new Layer(2);
var hidden = new Layer(20);
var output = new Layer(1);
input.project(hidden);
hidden.project(hidden, Layer.connectionType.ALL_TO_ALL);
hidden.project(output);
RNN.set({input: input, hidden: [hidden], output: output});
RNN.trainer = new Trainer(RNN);

And tried to train an XOR: RNN.trainer.XOR()

It fails to learn the XOR, but the weights are not NaN, it's the topology that just doesn't converge. If I delete the self-connection it converges nicely. Are you sure that it wasn't something from these last changes what it's making the weights to become NaN?

EDIT: it can be tested just by pasting the code in the console on the website

@Sleepwalking
Copy link
Contributor Author

Hmm.. seems like something wrong with my C code generator

@Sleepwalking
Copy link
Contributor Author

It's caused by numerical error. After 100 iterations I found some of the weights go up to 100 or more. Since I used libfastapprox for calculating sigmoid & sigmoid derivative, when input gets too large, some step in calculating the sigmoid dervative gives a NaN.
I replaced libfastapprox with std lib. No NaN anymore, but still fails to learn the timing task. Weights get even larger (1e+05 in average). There must be something wrong.
Do weights explode on your branch?
EDIT: when learning rate is 0, the network still explodes. Analysis shows that responsibility and trace elegibilities explode.

p.s. went to sleep... see you later

@cazala
Copy link
Owner

cazala commented Apr 30, 2015

While training the XOR some of the weights went up to ~40k after 10,000 iterations, so yeah it seems they tend to explode using that topology.. I'll do some more tests later once I get out of my job

@Sleepwalking
Copy link
Contributor Author

@cazala I know why. It's because inherently lstm-g algorithm is not able to project self connection like an ordinary RNN does!
Let's look at Eq.15,

The self connection connects the state of j to itself, but not the activity of j to the state of j, which means increases linearly with time!
To solve this I slightly modified the topology:

RNN = new Network();
var input = new Layer(2);
var hidden1 = new Layer(20);
var hidden2 = new Layer(20);
var output = new Layer(1);
input.project(hidden1);
hidden1.project(hidden2);
hidden2.project(hidden1, Layer.connectionType.ONE_TO_ONE);
hidden2.project(output);
RNN.set({input: input, hidden: [hidden1, hidden2], output: output});
RNN.trainer = new Trainer(RNN);

Now it is trainable, though RNN is way worse than LSTM on the timing task (which is consistent with the theory that LSTM is better).
rnn-sample
But if we discard the hidden1-to-hidden1 connection, this PR would be useless unless there exist some other topology that has hidden-to-hidden full connection...

@cazala
Copy link
Owner

cazala commented May 1, 2015

That makes sense. How much time/iterations does it take for the LSTM to converge on the timing task? At least to an acceptable error.. cos it would be great to add a spec with that task to our current tests, if it could be solved within a few seconds. Do you think that would be possible or would it take to much to solve that task every time the CI runs the tests (i.e. every time we merge a PR)?

@Sleepwalking
Copy link
Contributor Author

Full convergence takes 400 epochs * 7000 iterations (samples) each epoch; that's 7 seconds for C implementation. I'm afraid that's too much for javascript. But you start to see the trend towards convergence after 10 epochs which may take less than 10 seconds for javascript and is acceptable.
I think it's better to add a new test for timing task or if you think it's too slow then just don't include it in the CI.
Shall we close this PR since we do not need full h2h self connection?
EDIT: wait, full h2h for LSTM works!

@cazala
Copy link
Owner

cazala commented May 1, 2015

I could try to code a javascript test for the timing task and see how much time takes to complete it up to an acceptable error. Regarding the fully connected h2h for LSTM, does it perform better? did you try it with or without the parallelized activation?

@Sleepwalking
Copy link
Contributor Author

Thanks for the test.
I'm doing more tests on fully/self-only connected h2h. Seems the result is easily affected by different random initializations.

@Sleepwalking
Copy link
Contributor Author

I've ran the Timing Task 2 a dozen of times and in general,

  • LSTM + h2g + h2h generally outperforms those without h2h;
  • parallelized activ. does not have significant influcence on this task;
  • convergence largely depends on random initialization.

There was once I got the smallest MSE ever achieved on this task, 0.0008, on LSTM + h2g + h2h with parallelized activ. but that's only once and since then I can't reproduce the result.
1st-sample

@Sleepwalking
Copy link
Contributor Author

Results for LSTM + h2g + h2h, 500 epochs, MSE on test set

No. of Trial MSE (parallelized activ.) MSE (sequential activ.)
1 0.001278 0.002079
2 0.004472 0.005276
3 0.002202 0.000560 (wtf?!)
4 0.003039 0.004859
5 0.004421 0.001620
Avg 0.003082 0.002879

It's really hard to tell which is better.

@Sleepwalking
Copy link
Contributor Author

Added a new connection type: ALL_TO_ELSE, which is equivalent to ALL_TO_ALL excluding self connection.

No. of Trial MSE (parallelized activ.) MSE (sequential activ.)
1 0.003381 0.002827
2 NaN 0.002786
3 0.005046 0.001771
4 0.003540 0.002759
5 0.003155 0.003218

Anyway, NaN is bad. Now I'd rather prefer sequential activation.

@cazala
Copy link
Owner

cazala commented May 1, 2015

Maybe we should leave the parallelization outside for now if it doesn't provide a significant improvement in performance, for the sake of keeping the code simple. I think The new connection type and the working RNN architecture are good additions to the library tho, and I'm specially curious about the DOT language compiler feature, which I haven't played with yet but looks promising

@Sleepwalking
Copy link
Contributor Author

I reverted the previous changes on parallel activation.

An example of topology graph generated by dot:
out

Have fun with it!

@cazala
Copy link
Owner

cazala commented May 1, 2015

awesome, thank you (:

cazala pushed a commit that referenced this pull request May 1, 2015
Parallelize neuron activation sequence in self-connected layers
@cazala cazala merged commit ac4c268 into cazala:master May 1, 2015
@0joshuaolson1
Copy link

0joshuaolson1 commented Mar 20, 2017

Hi, did activation order (or something else) turn out to be wrong? As the writer of https://github.com/MrMormon/lstm-g, I'm quire curious.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants