Normalize rule50 plane #602

mooskagh · 2018-05-14T11:46:33Z

That's long discussed, I've not been able to find the issue.

So, while all other input planes have only values of 0 and 1, in rule50 values can be from 0 to 99.
That causes weights for that plane hugely inflated (although it's not immediately clear to me why it's 50x larger not smaller, but it seems it indeed should be like that), and potentially may be the reason of the current training problems.

We need to normalize that plane.

Tilps · 2018-05-14T11:51:38Z

I find it extremely unlikely this has anything to do with the current training problems. As we've had this problem forever.
I do think its worth addressing for the points raised, that it doesn't play well with sgd, but given that its a weight versioning event, and the last weight versioning event didn't exactly go perfectly - I would think we should wait to try this until we've finished setting up the new training pipeline details.

trebe · 2018-05-14T11:52:08Z

Yes, I really see this as a bug.

My idea about what is happening:

Compare the current situation to the correct method of using plycount/100 as input

During learning, inputs are 100x higher than they should be, so the gradients on the input (lowest) weights are 100x what they should be.
On the other hand, the optimal weight matrix is now expected to be .01x of the weight on the normalized input.
So we have two factors going in opposite direction: Learning happens 100x faster, but the end result matrix is expect to be 100x smaller. This totals a 10,000 factor of wrongness. Effectively, the learning rate is 10,000 times bigger for this input layer than it should be.

Also, the L2 regularization is less effective with bigger-range inputs

Error323 · 2018-05-14T13:00:57Z

I'm gonna fix this.

mooskagh · 2018-05-14T13:07:24Z

It should be fixed during training and in engines. I think everywhere it can be just val /= 100; before feeding it into a network. (so training data format won't change)

Graham853 · 2018-05-14T13:16:41Z

Just a thought: at some point, there will be a version of Leela that uses int8 ops on a GPU. At that point you might wish you'd divided by 64 or 128 not 100.

mooskagh · 2018-05-14T13:21:24Z

I think it's not how uint8 works. GPU maps from uint8 to float with k*X + b formula, and keeps k internally (b is often not kept and just 0). That way it's not really necessary that X is power of 2, as long is it's linear it should work.

…

On Mon, May 14, 2018 at 3:16 PM Graham Jones ***@***.***> wrote: Just a thought: at some point, there will be a version of Leela that uses int8 ops on a GPU. At that point you might wish you'd divided by 64 or 128 not 100. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#602 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKvpl347PBWB1EVhHoitTkMPhKy_8uNCks5tyYO7gaJpZM4T9pEX> .

takacsg84 · 2018-05-14T20:38:41Z

Was the insertion of a batch normalization layer ever considered?

Tilps · 2018-05-14T23:23:07Z

To keep my thoughts closer to this issue.
We could potentially skip the pain of a client release and weights versioning to address this issue by just multiplying the rule 50 associated weights by 100 on import to initialize training, and then dividing them by 100 on export after training.

so-much-meta · 2018-05-15T18:27:11Z

Although clearly less of an issue. Mean value of pawn-planes are going to be quite a bit larger than mean value of other piece planes. If trebe's logic is sound, then maybe this is more problematic than it would seem at first glance. Or maybe batch-normalization on the proceeding planes helps temper these effects?

so-much-meta · 2018-05-16T01:53:06Z

FYI - I posted an animation of all of the input planes' weights (mean and standard deviation) here: https://groups.google.com/forum/#!topic/lczero/clkvp2y3APk

I included many of the weights files from ID 199 through ID 297.

tldr summary of the email: Red=mean, blue=std devation. The mean is scaled as (mean*10 - 0.025). Horizontal axis is input channel.

borg323 · 2018-05-24T00:44:46Z

Similar to Tilps' idea, I tested an awk script (based on one given by Gyathaar on discord) to divide the rule50 weights by 99 directly in the weights file. This gives identical output with the patched lczero on all of my tests, but allows the normalization to be done using only the server side code in #605.
awk 'BEGIN{ plane=110; numplanes=112; } {i++; if(i != 2)print $0; else { split($0,vals," "); line = ""; for (i in vals){ if(int(((i-1)%(9*numplanes))/9) == (plane-1)) {line = line " " vals[i]/99.0; } else line=line " " vals[i];} print substr(line,2);}}' < weights_in.txt > weights_out.txt

killerducky · 2018-06-06T02:18:25Z

Some data about the main pipeline trying to recover from R50. I took a script someone else made and output np.sum(y*y), np.mean(np.abs(y)), np.mean(y), np.max(y), np.min(y) for the last net before r50 norm got turned on in training, and a few nets after. I did the same for scs-ben's bootstrap net. The size of the weights is not dropping very fast. And the bootstrap net weights are much smaller, the mean is 142X smaller. Maybe they r50 weights in the main pipeline will start to accelerate down as the window fills again? Or maybe it will take a very long time for them to recover?

Source for bootstrap net:
http://webphactory.net/lczero/bootstrap-192x15-270000.txt.gz

Results:

                          np.sum(y*y), np.mean(np.abs(y)), np.mean(y), np.max(y), np.min(y)
id378.txt                   0.788429    0.0142106   0.00364672  0.127805    -0.0798043
id380.txt                   0.789642    0.0142167   0.00363392  0.127904    -0.079843
id381.txt                   0.788659    0.0142085   0.00363179  0.127807    -0.0797948
bootstrap-192x15-270000.txt 3.13604e-05 0.000106134 -8.50894e-06    0.000510617 -0.000546426

Script:

import sys
import numpy as np

lines = []
with open(sys.argv[1]) as f:
    for line in f.readlines():
        lines.append([float(x) for x in line.split()])

    np.set_printoptions(threshold=np.nan, linewidth=270)

    channels = len(lines[2])

    x = np.array(lines[1]).reshape([channels, -1, 9]).transpose((1, 0, 2))
    #print(x)

    #for y in x:
    #    print("%g\t%g\t%g\t%g" % (np.mean(np.abs(y)), np.mean(y), np.max(y), np.min(y)))
    y = x[110-1]
    print("%g\t%g\t%g\t%g\t%g" % (np.sum(y*y), np.mean(np.abs(y)), np.mean(y), np.max(y), np.min(y)))

killerducky · 2018-06-06T02:32:00Z

Also a google sheet of two positions that seem to be r50 related, showing wild swings net to net on the N% of the correct move:

https://docs.google.com/spreadsheets/d/1zIyjOP0665S4sRqwVvh-VTRDkUEHxSAhDfGxjFvgYK8/edit#gid=449522518

Tilps · 2018-06-06T12:17:07Z

So I did some calculations, it would appear the rule 50 weights are probably moving downwards at about 50% of their maximum possible rate given the current LR.
But increasing LR doesn't look like a real option right now, Especially not by x100 that would really be needed to make quick progress.
I don't see any good options. Changing regularization weight would likely damage all the other weights. that aren't ridiculously huge, adding a separate regularization term for just the rule50 weights seems complicated and with the current low LR, its probably not going to learn fast enough to compensate for the weight loss. Similarly adjusting the weights by chunks in between nets, LR probably not high enough to compensate.
I think we're probably best served by bootstrap transition once there is an option to transition to.

killerducky · 2018-06-06T20:21:15Z

For reference, this issue is addressed by LeelaChessZero/lczero-training#3. We are using this issue to track recovery from the problem. As the posts above show, recovery is going to be very slow on the current main pipeline.

dubslow · 2018-06-06T21:07:25Z

On the face of it, "Changing regularization weight would likely damage all the other weights" doens't seem true a priori to me. It certainly worked for the value head just fine. What's the rationale for that sentence?

(Even so, I agree a reboostrap is a cleaner and more effective method regardless.)

Tilps · 2018-06-06T21:19:04Z

what we tried and had work, was decreasing only the value head weight. If we decrease both, everywhere across the whole net has less optimization power. Decreasing only the value head weight, limits the regurlaization much more to be stronger primarily in the value head related weights. Which was fine, because those weights were in extremely bad shape.
I'm not saying its definitely wrong to try, but I wouldn't want to do it on primary tranining without first testing it on a secondary training first.

dubslow · 2018-06-06T21:52:39Z

It's not that everywhere has less optimization power, it's that everywhere has different optimization power.

And the current network continues to be in bad shape (if not extremely bad shape), though the most recent LR reduction will help quite a bit I think.

Tilps · 2018-06-06T22:44:35Z

A few points to consider.

Since the last time we experimented with the lower value weight, we've decreased our LR. In order to get a reasonable time frame reduction of the r50 weights we don't need a 4x difference between optimization and regularization, we need more like 50x.
regularization and output optimization are opposing forces - regularization only has one thing it wants to do, reduce weights. Without output optimization to oppose regularization, weights will free fall. Suddenly changing to a 50x different ratio between regularization and optimization seems pretty likely going to cause global weight freefall, until they find their new balance. Its possible the residual stack will be resilient to this since it should affect all weights somewhat equally and BN is being applied repeatedly down the stack, but I'm less confident about the ability of the outputs of policy and value heads to remain vaguely stable under such conditions.
We wouldn't want this to be a long term change as it seems likely to directly inhibit learning speed with such a large ratio, so then we have to change back - and I'm very uncertain what that would do...

For these reasons I stand by my previous thought - we don't want to just try this out on main training without first doing some decent test runs on an outside training.
(This may also all be moot soon since talk of restart is growing...)

dubslow · 2018-06-06T23:07:14Z

Never said 50x shift in ratio :) just 2-4x shift. Like both nn head weights to 0.5 or something.

And yea I agree it's pretty much moot in light of recent conversation. I guess this will be my last post on this particular topic.

Tilps · 2018-06-06T23:15:09Z

Just to round out this discussion then, a 4x ratio, and increasing the LR by 4x as well would result in a regularization rate of change indicating in excess of 500 nets worth of training to get the regularization term down to pre normalization levels. (I'm pretty confident of that based on observing the slope change when I reduced the LR by 2.5.) Hence why I was talking about needing closer to 50x. It is what would be needed in order to get the recovery time down to a few weeks rather than closer to 6 months.

dubslow · 2018-06-06T23:18:52Z

I guess we were addressing different objectives then. I was only talking in what would produce good short term strength benefits, not "how long to totally normalize weights". Oh well.

killerducky · 2018-06-15T15:44:28Z

Main pipeline bootstrap net has fixed the rule50 issues. Test pipeline is going well. Closing this issue.

Error323 self-assigned this May 14, 2018

Error323 added a commit that referenced this issue May 14, 2018

Normalize r50 plane during training closes #602

7226f78

Tilps mentioned this issue May 29, 2018

Normalize the 50 move rule input in training #691

Closed

borg323 mentioned this issue Jun 2, 2018

Normalize 50 move rule plane in training LeelaChessZero/lczero-training#3

Merged

killerducky closed this as completed Jun 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize rule50 plane #602

Normalize rule50 plane #602

mooskagh commented May 14, 2018

Tilps commented May 14, 2018

trebe commented May 14, 2018 •

edited

Loading

Error323 commented May 14, 2018

mooskagh commented May 14, 2018

Graham853 commented May 14, 2018

mooskagh commented May 14, 2018 via email

takacsg84 commented May 14, 2018

Tilps commented May 14, 2018 •

edited

Loading

so-much-meta commented May 15, 2018 •

edited

Loading

so-much-meta commented May 16, 2018 •

edited

Loading

borg323 commented May 24, 2018

killerducky commented Jun 6, 2018

killerducky commented Jun 6, 2018

Tilps commented Jun 6, 2018

killerducky commented Jun 6, 2018

dubslow commented Jun 6, 2018

Tilps commented Jun 6, 2018

dubslow commented Jun 6, 2018

Tilps commented Jun 6, 2018

dubslow commented Jun 6, 2018

Tilps commented Jun 6, 2018

dubslow commented Jun 6, 2018

killerducky commented Jun 15, 2018

Normalize rule50 plane #602

Normalize rule50 plane #602

Comments

mooskagh commented May 14, 2018

Tilps commented May 14, 2018

trebe commented May 14, 2018 • edited Loading

Error323 commented May 14, 2018

mooskagh commented May 14, 2018

Graham853 commented May 14, 2018

mooskagh commented May 14, 2018 via email

takacsg84 commented May 14, 2018

Tilps commented May 14, 2018 • edited Loading

so-much-meta commented May 15, 2018 • edited Loading

so-much-meta commented May 16, 2018 • edited Loading

borg323 commented May 24, 2018

killerducky commented Jun 6, 2018

killerducky commented Jun 6, 2018

Tilps commented Jun 6, 2018

killerducky commented Jun 6, 2018

dubslow commented Jun 6, 2018

Tilps commented Jun 6, 2018

dubslow commented Jun 6, 2018

Tilps commented Jun 6, 2018

dubslow commented Jun 6, 2018

Tilps commented Jun 6, 2018

dubslow commented Jun 6, 2018

killerducky commented Jun 15, 2018

trebe commented May 14, 2018 •

edited

Loading

Tilps commented May 14, 2018 •

edited

Loading

so-much-meta commented May 15, 2018 •

edited

Loading

so-much-meta commented May 16, 2018 •

edited

Loading