Parallelize neuron activation sequence in self-connected layers #31

Sleepwalking · 2015-04-30T07:31:41Z

This PR is not about parallelizing the library to run on multiple cores/gpu. It's about a bug in the sequence of activating neurons in self-connected layers.

We need to take great care when activating a self-connected layer (e.g. memoryCell.project(memoryCell);). Before this PR all neurons in such a layer are activated in sequence, which means the former neurons always overwrite the activations fed to later neurons, and the activations may mess up. The right way to do this is to update the activations only after all neurons in the layer have been activated.

This PR fixes the bug in forward propagation, including hardcoder (standalone hardcoder is fixed but not tested yet). I'm not sure yet if similar bug exists in back propagation. We have to fix these bugs if we want to parallelize (multi-corelize) it because we don't want resource conflicts.

The fixed codes pass the DSR task as well as the timing task.

Notable changes:

Neuron got two new properties: .newactivation and .inselfconnectedlayer
update_sentences was added to the hardcoder which contains a few lines like this, Neuron.activation = Neuron.newactivation;

Sleepwalking · 2015-04-30T08:02:23Z

@cazala I think we also need to do the same changes to back propagation codes:

Responsibility update: https://github.com/cazala/synaptic/blob/master/src/neuron.js#L153
Weight update (for all neurons regardless of self-connection): https://github.com/cazala/synaptic/blob/master/src/neuron.js#L170

Don't merge the below commit (7009132) please. It fails in the timing task.
I've reverted the commit, having tried many different propagation sequences with none of them work except the original sequencial way. Probably the back propagation just doesn't work in the parallel way.
At least the parallelized forward prop (4a9e831 to 867f544) is completely fine.

cazala · 2015-04-30T13:57:14Z

@Sleepwalking what do you mean exactly when you say that former neurons always overwrite activations fed to later neurons? Currently we use one-to-one connections for self-connected layers, so if layer L has neurons A, B and C, each of them projects a connection to theirselves, so A would project a connection to A, but not to B and C... so I don't see how this would change the output in the end, did you get any different results?

Sleepwalking · 2015-04-30T14:20:04Z

Oh... I didn't noticed they're one-to-one by default, thanks for pointing out!
Nevertheless, ordinary RNNs have all-to-all self connections. How about doing some tests on ordinary RNNs and see if this PR helps?

cazala · 2015-04-30T14:30:18Z

Yes that would be great, what you say makes sense so we should test it and see if it improves the outcome of RNN's. The main reason the layers are working that way today is because I coded them following this algorythm which states that "if a layer receives recurrent connections from a layer which has not yet been activated this time-step, the sending layer’s activation from the previous time-step is used"

Sleepwalking · 2015-04-30T14:41:56Z

Actually half a month earlier I also followed Monner's paper and implemented lstm-g, in octave. It seemed to work but it was 100x slower than synaptic and that's why I decided to move to synaptic haha :)
I found the same statement regarding time steps but I (probably) didn't fully understand what that means.

Sleepwalking · 2015-04-30T15:05:25Z

@cazala RNN fails to train after a few epochs, regardless of learning rate and whether using parallelized activation or not. Weights all become NaN. My topology is simply adding a hidden-to-hidden all-to-all connection based on Architect.Perceptron.

RNN = new Network();
var input = new Layer(2);
var hidden = new Layer(20);
var output = new Layer(1);
input.project(hidden);
hidden.project(hidden, Layer.connectionType.ALL_TO_ALL);
hidden.project(output);
RNN.set({input: input, hidden: [hidden], output: output});
RNN.trainer = new Trainer(RNN);

If that self connection is removed, the network becomes an ordinary MLP and everything are fine...

cazala · 2015-04-30T15:12:19Z

Yea, synaptic would be that slow if it wasn't for the hardcoder/optimizer, but the logic its pretty much the same. What you are suggesting for self-connected layers actually makes sense, and it sticks to the principle in the algorithm that states: If layer A has not been activated in this time-step, the activation from previous time-step is used.

So lets say layer A is self-connected (all-to-all), and we activate it. The input has to be layer A's output, and since layer A hasn't been activated on this time-step yet, we need to use layer A's output from previous time-step. Right now the first neuron in the layer is using the correct input, but the following ones will not be using the right inputs, since we are overwriting them as we activate each individual neuron of the layer.

So yea, this PR makes sense. What I don't know is which task can we use to test two RNNs with fully self-connected layers and make sure that the one with the parallelized activations works better..

cazala · 2015-04-30T15:14:33Z

Mmm, I just read your last comment. I should take a look to see why are the weights becoming NaN's.. can you make a new branch and push these changes there, so I can pull them?

EDIT: I can just copy the RNN topology actually, no need for a new branch

Sleepwalking · 2015-04-30T15:19:15Z

I created a rnn branch and you can pull from it. https://github.com/Sleepwalking/synaptic/tree/rnn
I will not commit on that branch; further changes stay on master. Right?

p.s. even when learning rate = 0, the weights still go to NaN at around 300th iteration in the 1st epoch.

cazala · 2015-04-30T15:27:06Z

Yep, keep committing here to master. I just tested this code using the version of synaptic that we have on the website:

RNN = new Network();
var input = new Layer(2);
var hidden = new Layer(20);
var output = new Layer(1);
input.project(hidden);
hidden.project(hidden, Layer.connectionType.ALL_TO_ALL);
hidden.project(output);
RNN.set({input: input, hidden: [hidden], output: output});
RNN.trainer = new Trainer(RNN);

And tried to train an XOR: RNN.trainer.XOR()

It fails to learn the XOR, but the weights are not NaN, it's the topology that just doesn't converge. If I delete the self-connection it converges nicely. Are you sure that it wasn't something from these last changes what it's making the weights to become NaN?

EDIT: it can be tested just by pasting the code in the console on the website

Sleepwalking · 2015-04-30T15:35:54Z

Hmm.. seems like something wrong with my C code generator

Sleepwalking · 2015-04-30T16:07:01Z

It's caused by numerical error. After 100 iterations I found some of the weights go up to 100 or more. Since I used libfastapprox for calculating sigmoid & sigmoid derivative, when input gets too large, some step in calculating the sigmoid dervative gives a NaN.
I replaced libfastapprox with std lib. No NaN anymore, but still fails to learn the timing task. Weights get even larger (1e+05 in average). There must be something wrong.
Do weights explode on your branch?
EDIT: when learning rate is 0, the network still explodes. Analysis shows that responsibility and trace elegibilities explode.

p.s. went to sleep... see you later

cazala · 2015-04-30T16:19:26Z

While training the XOR some of the weights went up to ~40k after 10,000 iterations, so yeah it seems they tend to explode using that topology.. I'll do some more tests later once I get out of my job

Sleepwalking · 2015-05-01T00:43:04Z

@cazala I know why. It's because inherently lstm-g algorithm is not able to project self connection like an ordinary RNN does!
Let's look at Eq.15,

The self connection connects the state of j to itself, but not the activity of j to the state of j, which means increases linearly with time!
To solve this I slightly modified the topology:

RNN = new Network();
var input = new Layer(2);
var hidden1 = new Layer(20);
var hidden2 = new Layer(20);
var output = new Layer(1);
input.project(hidden1);
hidden1.project(hidden2);
hidden2.project(hidden1, Layer.connectionType.ONE_TO_ONE);
hidden2.project(output);
RNN.set({input: input, hidden: [hidden1, hidden2], output: output});
RNN.trainer = new Trainer(RNN);

Now it is trainable, though RNN is way worse than LSTM on the timing task (which is consistent with the theory that LSTM is better).

But if we discard the hidden1-to-hidden1 connection, this PR would be useless unless there exist some other topology that has hidden-to-hidden full connection...

cazala · 2015-05-01T00:50:42Z

That makes sense. How much time/iterations does it take for the LSTM to converge on the timing task? At least to an acceptable error.. cos it would be great to add a spec with that task to our current tests, if it could be solved within a few seconds. Do you think that would be possible or would it take to much to solve that task every time the CI runs the tests (i.e. every time we merge a PR)?

Sleepwalking · 2015-05-01T01:09:18Z

Full convergence takes 400 epochs * 7000 iterations (samples) each epoch; that's 7 seconds for C implementation. I'm afraid that's too much for javascript. But you start to see the trend towards convergence after 10 epochs which may take less than 10 seconds for javascript and is acceptable.
I think it's better to add a new test for timing task or if you think it's too slow then just don't include it in the CI.
Shall we close this PR since we do not need full h2h self connection?
EDIT: wait, full h2h for LSTM works!

cazala · 2015-05-01T01:19:09Z

I could try to code a javascript test for the timing task and see how much time takes to complete it up to an acceptable error. Regarding the fully connected h2h for LSTM, does it perform better? did you try it with or without the parallelized activation?

Sleepwalking · 2015-05-01T01:30:13Z

Thanks for the test.
I'm doing more tests on fully/self-only connected h2h. Seems the result is easily affected by different random initializations.

Sleepwalking · 2015-05-01T01:47:17Z

I've ran the Timing Task 2 a dozen of times and in general,

LSTM + h2g + h2h generally outperforms those without h2h;
parallelized activ. does not have significant influcence on this task;
convergence largely depends on random initialization.

There was once I got the smallest MSE ever achieved on this task, 0.0008, on LSTM + h2g + h2h with parallelized activ. but that's only once and since then I can't reproduce the result.

Sleepwalking · 2015-05-01T02:02:12Z

Results for LSTM + h2g + h2h, 500 epochs, MSE on test set

No. of Trial	MSE (parallelized activ.)	MSE (sequential activ.)
1	0.001278	0.002079
2	0.004472	0.005276
3	0.002202	0.000560 (wtf?!)
4	0.003039	0.004859
5	0.004421	0.001620
Avg	0.003082	0.002879

It's really hard to tell which is better.

Sleepwalking · 2015-05-01T02:13:28Z

Added a new connection type: ALL_TO_ELSE, which is equivalent to ALL_TO_ALL excluding self connection.

No. of Trial	MSE (parallelized activ.)	MSE (sequential activ.)
1	0.003381	0.002827
2	NaN	0.002786
3	0.005046	0.001771
4	0.003540	0.002759
5	0.003155	0.003218

Anyway, NaN is bad. Now I'd rather prefer sequential activation.

cazala · 2015-05-01T02:20:11Z

Maybe we should leave the parallelization outside for now if it doesn't provide a significant improvement in performance, for the sake of keeping the code simple. I think The new connection type and the working RNN architecture are good additions to the library tho, and I'm specially curious about the DOT language compiler feature, which I haven't played with yet but looks promising

Sleepwalking · 2015-05-01T02:37:18Z

I reverted the previous changes on parallel activation.

An example of topology graph generated by dot:

Have fun with it!

cazala · 2015-05-01T03:09:23Z

awesome, thank you (:

Parallelize neuron activation sequence in self-connected layers

0joshuaolson1 · 2017-03-20T22:48:38Z

Hi, did activation order (or something else) turn out to be wrong? As the writer of https://github.com/MrMormon/lstm-g, I'm quire curious.

Sleepwalking mentioned this pull request Apr 30, 2015

Output/Hidden to Hidden gatings in LSTM-RNN #30

Closed

Sleepwalking added 2 commits May 1, 2015 10:30

export topology to DOT language

c2ebf9c

all to else connection type

281474b

Sleepwalking force-pushed the master branch from c809803 to 281474b Compare May 1, 2015 02:35

Merge https://github.com/cazala/synaptic

4186102

cazala pushed a commit that referenced this pull request May 1, 2015

Merge pull request #31 from Sleepwalking/master

ac4c268

Parallelize neuron activation sequence in self-connected layers

cazala merged commit ac4c268 into cazala:master May 1, 2015

0joshuaolson1 mentioned this pull request Feb 22, 2018

Possible issue (partially?) responsible for poor training... 0joshuaolson1/lstm-g#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize neuron activation sequence in self-connected layers #31

Parallelize neuron activation sequence in self-connected layers #31

Sleepwalking commented Apr 30, 2015

Sleepwalking commented Apr 30, 2015

cazala commented Apr 30, 2015

Sleepwalking commented Apr 30, 2015

cazala commented Apr 30, 2015

Sleepwalking commented Apr 30, 2015

Sleepwalking commented Apr 30, 2015

cazala commented Apr 30, 2015

cazala commented Apr 30, 2015

Sleepwalking commented Apr 30, 2015

cazala commented Apr 30, 2015

Sleepwalking commented Apr 30, 2015

Sleepwalking commented Apr 30, 2015

cazala commented Apr 30, 2015

Sleepwalking commented May 1, 2015

cazala commented May 1, 2015

Sleepwalking commented May 1, 2015

cazala commented May 1, 2015

Sleepwalking commented May 1, 2015

Sleepwalking commented May 1, 2015

Sleepwalking commented May 1, 2015

Sleepwalking commented May 1, 2015

cazala commented May 1, 2015

Sleepwalking commented May 1, 2015

cazala commented May 1, 2015

0joshuaolson1 commented Mar 20, 2017 •

edited

Loading

Parallelize neuron activation sequence in self-connected layers #31

Parallelize neuron activation sequence in self-connected layers #31

Conversation

Sleepwalking commented Apr 30, 2015

Sleepwalking commented Apr 30, 2015

cazala commented Apr 30, 2015

Sleepwalking commented Apr 30, 2015

cazala commented Apr 30, 2015

Sleepwalking commented Apr 30, 2015

Sleepwalking commented Apr 30, 2015

cazala commented Apr 30, 2015

cazala commented Apr 30, 2015

Sleepwalking commented Apr 30, 2015

cazala commented Apr 30, 2015

Sleepwalking commented Apr 30, 2015

Sleepwalking commented Apr 30, 2015

cazala commented Apr 30, 2015

Sleepwalking commented May 1, 2015

cazala commented May 1, 2015

Sleepwalking commented May 1, 2015

cazala commented May 1, 2015

Sleepwalking commented May 1, 2015

Sleepwalking commented May 1, 2015

Sleepwalking commented May 1, 2015

Sleepwalking commented May 1, 2015

cazala commented May 1, 2015

Sleepwalking commented May 1, 2015

cazala commented May 1, 2015

0joshuaolson1 commented Mar 20, 2017 • edited Loading

0joshuaolson1 commented Mar 20, 2017 •

edited

Loading