Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output/Hidden to Hidden gatings in LSTM-RNN #30

Closed
Sleepwalking opened this issue Apr 29, 2015 · 19 comments
Closed

Output/Hidden to Hidden gatings in LSTM-RNN #30

Sleepwalking opened this issue Apr 29, 2015 · 19 comments

Comments

@Sleepwalking
Copy link
Contributor

According to Felix Gers' dissertation[1], gates on memory cells have connection not only from input layer, but also the memory cells themselves. However, Architect currently only projects input-to-out/forget/in gates connections for LSTMs (except peepholes),

      inputLayer.project(inputGate);
      inputLayer.project(forgetGate);
      inputLayer.project(outputGate);

which means that the neural network remembers/forgets information only based on its current inputs, which could be diastrous for certain tasks that require long term memory.
In some other applications memory cells are even gated by outputs. Besides gating, first order connections from output to hidden layers also exist in some literatures.

This observation provides an insight for why the Wikipedia language modeling task doesn't give promising results even after hours of training. Through informal test enabling hidden layer to gates connections, the network is able to reproduce text such as "the of the of the of the of the of ..." on its own. I also trained a LSTM with 70 memory cells on some short paragraphs and the network can exactly reproduce two or three sentences on its own.

I'm going to run further tests to compare the hidden-to-gates connected LSTMs with input-to-gates connected ones.

[1] Gers, Felix. Long short-term memory in recurrent neural networks. PhD dissertation, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland, 2001.

@cazala
Copy link
Owner

cazala commented Apr 29, 2015

@Sleepwalking what do you mean exactly when you say that memory cells should also be gated by themselves? Currently memory cells project a connection to theirselves:

var self = memoryCell.project(memoryCell);

And this connection is gated by the forget gate:

forgetGate.gate(self, Layer.gateType.ONE_TO_ONE);

Its amazing that you got a LSTM to reproduce whole sentences, I'm very interested in this, but I still don't get how's the topology of the network you are using, could you explain a little more or share some of the code you are using to connect the memory cells?

@Sleepwalking
Copy link
Contributor Author

Hello @cazala,

Precisely speaking, the gates to memory cells have 1st-order connections from the memory cells themselves:

memoryCell.project(forgetGate);
memoryCell.project(inputGate);
memoryCell.project(outputGate);

Sorry that I just discovered that my implementation used in the "informal test" didn't included hidden-to-gates connections (because I made lots of local forks and they messed up and I ran the code on a wrong fork...)
This seems to contradict with my hypothesis because my 30-72-30 LSTM works unbelieveably well on small language modeling tasks where training sets only contain a few lines of words. The network is able to remember up to one hundreds steps of characters.
Anyway I'm doing the test with hidden-to-gates connections now and I'll post the results as soon as possible.

Kanru Hua

@cazala
Copy link
Owner

cazala commented Apr 29, 2015

I see, so all the memory cells project connections to all the input/forget/output gates? because now we have peephole connections projected from each memory cell to its corresponding gates:

// peepholes
memoryCell.project(inputGate, Layer.connectionType.ONE_TO_ONE);
memoryCell.project(forgetGate, Layer.connectionType.ONE_TO_ONE);
memoryCell.project(outputGate, Layer.connectionType.ONE_TO_ONE);

It would be great if you could share the result of your tests when you finish them (:

@Sleepwalking
Copy link
Contributor Author

Large networks are trained notoriously slow...
I actually translated the generated codes into C and compiled them using clang; encapsulated into MEX files and trained & tested the network using GNU Octave.
LSTM networks with 30 binary inputs, 72 memory cells, and 30 binary outputs are trained to predict the next character based on current character (and its memory).
Input/Output format:
| 26: a-z | 1:period or comma | 1:space | 1:quote | 1:parentheses |
The training set is some super stupid non-sense,
pata thinks that kata wants the pala of the kaba, because kata hates the pala of the kaba so much. kaba wants to kill pata so that kaba can kill kata. this is so nice.
specially designed to see if the network can capture the context because pata, kata, kaba and pala... occur multiple times in the text.
To work around initialization problems, the training set is repeated lots of times and joined together to give a 7000-character-long text,

pata thinks that kata wants the pala of the kaba, because kata hates the pala of the kaba so 
much. kaba wants to kill pata so that kaba can kill kata. this is so nice. pata thinks that kata 
wants the pala of the kaba, because kata hates the pala of the kaba so much. kaba wants 
to kill pata so that kaba can kill kata. this is so nice. pata thinks that kata wants the pala of 
the kaba, because kata hates the pala of the kaba so much. kaba wants to kill pata so that 
kaba can kill kata. this is so nice.
...

The LSTM is trained using stochastic gradient descent with initial learning rate = 0.01. Learning rate decays by 0.99 times until it is less than 0.005.

Result Without Hidden-to-Gates Connections

Epoch 1, MSE = 0.830720
Epoch 2, MSE = 0.724659
Epoch 3, MSE = 0.692267
Epoch 4, MSE = 0.670466
Epoch 5, MSE = 0.647484
Epoch 6, MSE = 0.619622
Epoch 7, MSE = 0.589051
Epoch 8, MSE = 0.558995
Epoch 9, MSE = 0.530920
Epoch 10, MSE = 0.505305
Epoch 11, MSE = 0.482118
Epoch 12, MSE = 0.460689
Epoch 13, MSE = 0.440006
Epoch 14, MSE = 0.419213
Epoch 15, MSE = 0.397737
Epoch 16, MSE = 0.375261
Epoch 17, MSE = 0.351678
Epoch 18, MSE = 0.327267
Epoch 19, MSE = 0.302879
Epoch 20, MSE = 0.279280
Epoch 21, MSE = 0.256809
Epoch 22, MSE = 0.235538
Epoch 23, MSE = 0.215458
Epoch 24, MSE = 0.196564
Epoch 25, MSE = 0.178876
Epoch 26, MSE = 0.162826
Epoch 27, MSE = 0.149508
Epoch 28, MSE = 0.136092
Epoch 29, MSE = 0.123352
Epoch 30, MSE = 0.111187

Output sample:

t kill pata thant kill kata so this kata the kaba wans thes o much. kill pata so that kaba wans
to kill pata so that kaba can kill kata this so nice. kata. this inis s nice. pata this s o nice. kill
pata that kaba so nthe kaba kaba so chice. kaba so nthice. kata kaba wants thes the pala 
of the kaba so much. kaba can kill kata this so inis s nice. pata this so nice. kata thinks that 
kata so the kaba wants the pala of the kaba wants to kill kata so that kaba can kill kata so 
this so nice. pata thinks that kata wants the pala of the kaba wants the kaba so much. kaba
can kill kata so this so nice. pata thinks that kata wants the pala of the kaba wants to kill pata
so tha kaba can this kil so katat kata wants the pala of the kaba wants to kill kata so that kaba
can kill kata so this so nice. kill pata tha kata wants the kaba wanso this s o nice. pata thinks
that kata wants the pala of the kaba wants to kill kata so that kaba can kill kata so this so nice.
pata thinks that kata wants the pala of the kaba wants the kaba so much.
...

Result With Hidden-to-Gates Connections

Epoch 1, MSE = 0.841429
Epoch 2, MSE = 0.731850
Epoch 3, MSE = 0.691192
Epoch 4, MSE = 0.663000
Epoch 5, MSE = 0.657220
Epoch 6, MSE = 0.652227
Epoch 7, MSE = 0.628353
Epoch 8, MSE = 0.613436
Epoch 9, MSE = 0.583812
Epoch 10, MSE = 0.567494
Epoch 11, MSE = 0.529668
Epoch 12, MSE = 0.505728
Epoch 13, MSE = 0.470627
Epoch 14, MSE = 0.443538
Epoch 15, MSE = 0.419392
Epoch 16, MSE = 0.398090
Epoch 17, MSE = 0.371459
Epoch 18, MSE = 0.347105
Epoch 19, MSE = 0.325191
Epoch 20, MSE = 0.315286
Epoch 21, MSE = 0.299850
Epoch 22, MSE = 0.278598
Epoch 23, MSE = 0.266516
Epoch 24, MSE = 0.243895
Epoch 25, MSE = 0.251854
Epoch 26, MSE = 0.231868
Epoch 27, MSE = 0.225016
Epoch 28, MSE = 0.253627
Epoch 29, MSE = 0.217485
Epoch 30, MSE = 0.203307

Output sample:

this this so thinininise. ts the pata the pala of the kaba ba can kill kata so kila so this so thiso this
so thiso the pata the pala of the kaba ba can kill kata so kila so this so thiso this so thiso the pata
the pala of the kaba ban cats the kaba wants the kaba so the kaba can kill kata so kata this s ice.
pata this this so the paba of the kaba. can kill kata this s ice. pata this the kaba ban kila so ice.
killl pata that kaba so kanis this so thiso thise so that kata so that kaba can kill kata this s ice.
pata this the kaba ban cats the kaba wants the kaba so the kaba canks this the kaba ba can kill
kata so kill so ice. pata that kata wants the kaba wants the kaba of kill kata so this so thiso this
so the paba o cathis the kaba wants the kaba of the kaba wants thes o kaba cants the kaba
wants the kaba so the kaba can kill kata so kata this s ice. pata this this so the paba of the kaba.
can kill kata this s ice. pata this this so the paba of the kaba ba cants the killats s tha o kaba can
kill so ice. pata this this so thininise ts t ...

It's kind of weird that the one with hidden-to-gates connections perform even worse than the original one. This is probably because the training hasn't converged yet. Seems like the learning rate is so high that MSE bounces back after 25 epochs.
Note that the later one is 3 times larger than the one without hidden-to-gates connections; it gives a 23MB C file and training is extremely slow even translated to C!

@cazala
Copy link
Owner

cazala commented Apr 29, 2015

That's way better than the results I got on the Wikipedia language modeling task. You said you translated the generated code into C.. you mean the hardcded, optimized javascript code, right? Could you share the network you used to generate that optimized code? (The network itself).

Or did you use the LSTM that's already in the Architect for the first result (without hidden-to-gates)?

@Sleepwalking
Copy link
Contributor Author

Yeah I used the one in Architect.
For the other one with hidden-to-gates I inserted three lines at https://github.com/cazala/synaptic/blob/master/src/architect.js#L108,

      // hidden-to-gates
      memoryCell.project(inputGate);
      memoryCell.project(outputGate);
      memoryCell.project(forgetGate);

Now I'm running more epochs and see if it works better. Each epoch takes half a minute... better to test this on some simplified tasks.

The work by Gers is way more complicated than Architect.LSTM - hidden-to-gates, output-to-gates, etc.; he also puts an intermediate layer between memory cells and output layer - if I remember correctly. There are tons of variations on LSTM-RNNs and would take us lots of time to try them out.

@menduz
Copy link
Collaborator

menduz commented Apr 29, 2015

Do you exported the network of synaptic directly to C?

On Wed, Apr 29, 2015 at 11:45 AM, Hua Kanru [email protected]
wrote:

Large networks are trained notoriously slow...
I actually translated the generated codes into C and compiled them using
clang; encapsulated into MEX files and trained & tested the network using
GNU Octave.
LSTM networks with 30 binary inputs, 72 memory cells, and 30 binary
outputs are trained to predict the next character based on current
character
(and its memory).
Input/Output format:
| 26: a-z | 1:period or comma | 1:space | 1:quote | 1:parentheses |
The training set is some super stupid non-sense,
pata thinks that kata wants the pala of the kaba, because kata hates the
pala of the kaba so much. kaba wants to kill pata so that kaba can kill
kata. this is so nice.
specially designed to see if the network can capture the context because
pata, kata, kaba and pala... occur multiple times in the text.
To work around initialization problems, the training set is repeated lots
of times and joined together to give a 7000-character-long text,

pata thinks that kata wants the pala of the kaba, because kata hates the pala of the kaba so
much. kaba wants to kill pata so that kaba can kill kata. this is so nice. pata thinks that kata
wants the pala of the kaba, because kata hates the pala of the kaba so much. kaba wants
to kill pata so that kaba can kill kata. this is so nice. pata thinks that kata wants the pala of
the kaba, because kata hates the pala of the kaba so much. kaba wants to kill pata so that
kaba can kill kata. this is so nice.
...

The LSTM is trained using stochastic gradient descent with learning rate =
0.01.
Result Without Hidden-to-Gates Connections

MSE at epoch 30 = 0.111187
Output sample:

t kill pata thant kill kata so this kata the kaba wans thes o much. kill pata so that kaba wans
to kill pata so that kaba can kill kata this so nice. kata. this inis s nice. pata this s o nice. kill
pata that kaba so nthe kaba kaba so chice. kaba so nthice. kata kaba wants thes the pala
of the kaba so much. kaba can kill kata this so inis s nice. pata this so nice. kata thinks that
kata so the kaba wants the pala of the kaba wants to kill kata so that kaba can kill kata so
this so nice. pata thinks that kata wants the pala of the kaba wants the kaba so much. kaba
can kill kata so this so nice. pata thinks that kata wants the pala of the kaba wants to kill pata
so tha kaba can this kil so katat kata wants the pala of the kaba wants to kill kata so that kaba
can kill kata so this so nice. kill pata tha kata wants the kaba wanso this s o nice. pata thinks
that kata wants the pala of the kaba wants to kill kata so that kaba can kill kata so this so nice.
pata thinks that kata wants the pala of the kaba wants the kaba so much.
...

Result Without Hidden-to-Gates Connections

MSE at epoch 30 = 0.203307
Output sample:

this this so thinininise. ts the pata the pala of the kaba ba can kill kata so kila so this so thiso this
so thiso the pata the pala of the kaba ba can kill kata so kila so this so thiso this so thiso the pata
the pala of the kaba ban cats the kaba wants the kaba so the kaba can kill kata so kata this s ice.
pata this this so the paba of the kaba. can kill kata this s ice. pata this the kaba ban kila so ice.
killl pata that kaba so kanis this so thiso thise so that kata so that kaba can kill kata this s ice.
pata this the kaba ban cats the kaba wants the kaba so the kaba canks this the kaba ba can kill
kata so kill so ice. pata that kata wants the kaba wants the kaba of kill kata so this so thiso this
so the paba o cathis the kaba wants the kaba of the kaba wants thes o kaba cants the kaba
wants the kaba so the kaba can kill kata so kata this s ice. pata this this so the paba of the kaba.
can kill kata this s ice. pata this this so the paba of the kaba ba cants the killats s tha o kaba can
kill so ice. pata this this so thininise ts t ...

It's kind of weird that the one with hidden-to-gates connections perform
even worse than the original one. This is probably because the training
hasn't converged yet. Seems like the learning rate is so high that MSE
bounces back after 25 epochs.
Note that the later one is 3 times larger than the one without
hidden-to-gates connections; it gives a 23MB C file and training is
extremely slow even translated to C!


Reply to this email directly or view it on GitHub
#30 (comment).

Agustin Mendez

@Sleepwalking
Copy link
Contributor Author

@menduz Yes. I'm actually working on a fork that directly outputs C codes. Once it's fully functioning I'll push it to github.

@Sleepwalking
Copy link
Contributor Author

MSE for the first 60 epochs. Learning rate starts from 0.01 and decays by 0.99 times each epoch until being less than 0.001. When MSE 'explodes' (being greater than 1), the trainer automatically restores the network to the last epoch when MSE did not explode.
1st

Output sample:

thinks that kaba wants to kill pata so that kaba can kanka tha kanka tha kanka tha kanka tha kanka
 tha kanka tha kanka kil kilata so tha ka bchabs kisa to kanil that ka ba can kill kata. this s is nice. 
pat kilata so tha ksant so kaninks thininil pats ps ps nkininill pat pantst t kat kisats panis tha to 
kabcs kil pata tha of kilata ts nkil pa soant tkil kata wanininis tinise. t psn pinis the pa kat kila. 
tkaninil tha ts ka this ka thika so isn pis nkice. panininininininisesnksinil patt pat kilan tst t t ka 
thankss kinill nka. that kil kata. this kis pant that pa of kis ps nkilinks pspa ths n panint tka. this 
ksa tha ois nin pantso tha tha kants kis kinice. pat kilata ts kil pants kininil pats ta that pa o kis 
ps nkininil pats sn ps nkice. panininill pat kilan tininininise. paninise. the t ka. tha the ka.a kil 
katants tha kants th kaba wants the pa o kila. thaso t kas nkice. pa il pants kil pants kilu pant 
kathis so n pants kininilt pat ksath kso tha ka bchat kaba wanto kill kata so that ka bcan kill kata. 
this s kanis inices patt kant ...

Although MSE gets lower than the network without hidden-to-gates, the output is worse. Seems like the network overfits after 47 epochs (it gets unstable and explodes regularly after 47 epochs).
We really need some smaller scale tasks. It took me half an hour to train this stuff.

@cazala
Copy link
Owner

cazala commented Apr 29, 2015

Just a thought: you say that the network with hidden-to-gates connections produces a better output, while the MSE is higher.. I read one or two articles stating that for classification tasks like this one, sometimes it is more representative to use cross entropy as the cost function, instead of MSE. There's an implementation included in the Trainer (Trainer.cost.CROSS_ENTROPY). You just pass it in the train() options, like this:

trainer.train(trainingSet,{
    rate: .1,
    iterations: 20000,
    error: .005,
    shuffle: true,
    log: 1000,
    cost: Trainer.cost.CROSS_ENTROPY
});

Maybe you could try to use that one and see if the hidden-to-gates network produces a lower error.

Oh, and +1 for the C code generator (:

@menduz
Copy link
Collaborator

menduz commented Apr 29, 2015

If you have a draft of the C generator we have some ideas of simd SSE
optimizations. I think we can archieve 8x speedup in some scenarios

El mié, abr 29, 2015 13:30, Juan Cazala [email protected] escribió:

Just a thought: you say that the network with hidden-to-gates connections
produces a better output, while the MSE is higher.. I read one or two
articles stating that for classification tasks like this one, sometimes it
is more representative to use cross entropy as the cost function, instead
of MSE. There's an implementation included in the Trainer (
Trainer.cost.CROSS_ENTROPY). Maybe you could try to use that one and see
if the hidden-to-gates network produces a lower error.

Oh, and +1 for the C code generator (:


Reply to this email directly or view it on GitHub
#30 (comment).

@Sleepwalking
Copy link
Contributor Author

@cazala Thanks for pointing out the cross entropy criterion :) But the Trainer seems just measure the error and display them instead of actually changing the error function used in back propagation (and https://github.com/cazala/synaptic/blob/master/src/neuron.js#L119 has already adopted MCE criterion, implicitly).
We've also seen that the output sample of the h2g network is bad so I don't think MCE of the h2g network would outperform the one without h2g.

@menduz Nice. I'm also aiming for some SIMDs cause I saw something like this in the codes,

 F[132948] = F[132020] * F[132920] * influences[27];
 F[132949] = F[132020] * F[132920] * influences[28];
 F[132950] = F[132020] * F[132920] * influences[29];
 F[132951] = F[65];
 F[132952] = F[132020] * F[132951] * influences[0];
 F[132953] = F[132020] * F[132951] * influences[1];
 F[132954] = F[132020] * F[132951] * influences[2];
 F[132955] = F[132020] * F[132951] * influences[3];
 F[132956] = F[132020] * F[132951] * influences[4];
 F[132957] = F[132020] * F[132951] * influences[5];
 F[132958] = F[132020] * F[132951] * influences[6];
 F[210] += F[49755] * F[239782];
 F[210] += F[49758] * F[239783];
 F[210] += F[49761] * F[239784];
 F[56820] += F[0] * F[210];
 F[210] = F[49674] * F[239786];
 F[210] += F[49677] * F[239787];
 F[210] += F[49680] * F[239788];
 F[210] += F[49683] * F[239789];
 F[210] += F[49686] * F[239790];
 F[210] += F[49689] * F[239791];
 F[210] += F[49692] * F[239792];

which can be easily optimized by packed dot products & element-wise products.

We also need to be cache-friendly. Performance may double for small networks because their binaries are below 64KB (size of L1 instruction cache).
Basically I'm going to do it in the same manner as IB-EE-FFT. I suggest we'd better open a new issue which focus on optimization.

@Sleepwalking
Copy link
Contributor Author

Fresh Test Result on the New Task

I found this task in Herbert Jeager's A Tutorial on Training Recurrent Neural Networks. This task is about timing and was originally introduced for testing Echo State Networks. I tested two different LSTM configurations (w/o hidden-to-gates) on this task.

Task Description

The task has 2 inputs and 1 output. These two figures from the above tutorial describe the task well:
screenshot from 2015-04-30 08 39 36
screenshot from 2015-04-30 08 39 44

LSTM Topology

var myLSTM = new Architect.LSTM(2, 15, 1);
myLSTM.layers.input.set({squash: Neuron.squash.IDENTITY});
myLSTM.layers.output.set({squash: Neuron.squash.IDENTITY});

The network with h2g connections has the following lines in place of peephole connections at https://github.com/cazala/synaptic/blob/master/src/architect.js#L110,

      // hidden-to-gates
      memoryCell.project(inputGate);
      memoryCell.project(outputGate);
      memoryCell.project(forgetGate);

Training Configuration

Both networks are exported to C, compiled by clang-3.5 (double precision, default optimization);
The generation of train & test set as well as training & testing are implemented in Octave;
Learning rate = 0.03 throughout the training;
An epoch is defined as a full pass through the whole training set;
Since there is only one output, I still measured the MSE instead of MCE. Measurements are taken on both train set and test set;
Training takes 500 epochs;
The train set has 7000 samples and the test set has 1000 samples.

Data set generation

Once data set is randomly generated, the same data set is used to train both two networks.

Nt = 8000;
Ntest = 1000;
t = 1;
train_input  = zeros(Nt, 2);
train_target = zeros(Nt, 1);
while(t < Nt - 20)
    n = round(rand * 20);
    train_input(t, 1) = 1;
    train_input(t:t + n, 2) = n / 20;
    train_target(t:t + n) = 0.5;
    t += n;
    n = round(rand * 20);
    train_input(t + 1:t + n, 2) = train_input(t, 2);
    t += n;
end
Nt -= Ntest;
test_input = train_input(Nt + 1:end, :);
test_target = train_target(Nt + 1:end);
train_input = train_input(1:Nt, :);
train_target = train_target(1:Nt);

Results

LSTM without h2g

To make sure it converges I ran extra 100 epochs.
1st

Output Sample
1st-sample

LSTM with h2g

2nd

Output Sample
2nd-sample

Comments

  • LSTM with h2g achieves smaller MSE (0.003883) comparing to LSTM without h2g (0.006104).
  • Sample outputs are consistent with the MSE. Output of LSTM with h2g looks a little bit better than LSTM without h2g.
  • This test maybe unfair because LSTM with h2g has almost 2x more connections than LSTM without h2g.
  • Sometimes the MSE converges within 100 epochs while some other times it doesn't converge after 500 epochs.
  • Gers' implementation includes an intermediate hidden layer between memory cells layer and output layer. This hidden layer has one-to-one 1st-order connections from memory cells and all-to-all 1st-order connections to output layer. Output gates act as the input gates of the intermediate layer. h2g connections are from this intermediate layer but not from memory cells.

@Sleepwalking
Copy link
Contributor Author

Sometimes the MSE converges within 100 epochs while some other times it doesn't converge after 500 epochs.

This is because 15 hidden units are too many for our simple timing task. I tried again with a 2-5-1 topology and here are the new results.

A bug in forward propagation was fixed before this test.
The learning rate is set to 0.1 and decays 0.99 times each epoch. Other settings remain same.

LSTM without H2G

3rd

Output Sample
3rd-sample

LSTM with H2G

4th

Output Sample
4th-sample

Comments

  • LSTM with H2G (MSE = 0.001467) performs significantly better than LSTM without H2G (MSE = 0.006556);
  • Large networks easily end up in local minimum during training. That's why 2-5-1 can outperform 2-15-1;
  • C code for LSTM with H2G is 28.8KB; C code for LSTM without H2G is 17.7KB. The difference in size becomes smaller for small networks and the comparison is fairer - in fact, the 2-5-1 LSTM with H2G achieves the best-ever result on our timing task.

@Sleepwalking
Copy link
Contributor Author

The following experiment shows how LSTM with H2G can accomplish tasks that LSTM without H2G cannot achieve.

Timing Task 2

  • The first input are pulses whose height indicate the duration of each rectangular signal;
  • The second input in the timing task is set to all zeros, to keep the size of the networks comparable to our previous experiment;
  • The output remain same.

LSTM without H2G

The "and" in the title means the network still has parallel activation sequence. Sorry for this. I'm not native English speaker.
1st

Output Sample
1st-sample

LSTM with H2G

2nd

Output Sample
2nd-sample

@cazala
Copy link
Owner

cazala commented Apr 30, 2015

That's impressive. Do you think we should replace the current LSTM topology with the H2G one? We should also take into account that the later one has more connections, do you think if we compared H2G with an LSTM with a similar amount of connections (by adding more hidden units) it would outperform it in the same way?

Note: an easy way to know the number of neurons/connections (that is not documented) is to call Neuron.quantity(). It returns something like: {neurons: 680, connections: 11430}.

The number of neurons and connections is global (for all the Networks)

@Sleepwalking
Copy link
Contributor Author

@cazala The above results already shown that 2-5-1 H2G (28KB) outperformed 2-15-1 without H2G (70KB).
I suggest we add an option to Architect.LSTM and let users choose if there should be H2G connections :)

@cazala
Copy link
Owner

cazala commented Apr 30, 2015

👍 maybe even h2g should be the default topology with one-to-one peepholes as the option

@Sleepwalking
Copy link
Contributor Author

2nd-sample
Best-looking result on test set ever got on Timing Task 2. 2-7-1 topology with hidden-to-hidden and hidden-to-gate connections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants