Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize rule50 plane #602

Closed
mooskagh opened this issue May 14, 2018 · 23 comments
Closed

Normalize rule50 plane #602

mooskagh opened this issue May 14, 2018 · 23 comments
Assignees

Comments

@mooskagh
Copy link
Contributor

That's long discussed, I've not been able to find the issue.

So, while all other input planes have only values of 0 and 1, in rule50 values can be from 0 to 99.
That causes weights for that plane hugely inflated (although it's not immediately clear to me why it's 50x larger not smaller, but it seems it indeed should be like that), and potentially may be the reason of the current training problems.

We need to normalize that plane.

@Tilps
Copy link
Contributor

Tilps commented May 14, 2018

I find it extremely unlikely this has anything to do with the current training problems. As we've had this problem forever.
I do think its worth addressing for the points raised, that it doesn't play well with sgd, but given that its a weight versioning event, and the last weight versioning event didn't exactly go perfectly - I would think we should wait to try this until we've finished setting up the new training pipeline details.

@trebe
Copy link

trebe commented May 14, 2018

Yes, I really see this as a bug.

My idea about what is happening:

Compare the current situation to the correct method of using plycount/100 as input

During learning, inputs are 100x higher than they should be, so the gradients on the input (lowest) weights are 100x what they should be.
On the other hand, the optimal weight matrix is now expected to be .01x of the weight on the normalized input.
So we have two factors going in opposite direction: Learning happens 100x faster, but the end result matrix is expect to be 100x smaller. This totals a 10,000 factor of wrongness. Effectively, the learning rate is 10,000 times bigger for this input layer than it should be.

Also, the L2 regularization is less effective with bigger-range inputs

@Error323
Copy link
Collaborator

I'm gonna fix this.

@Error323 Error323 self-assigned this May 14, 2018
@mooskagh
Copy link
Contributor Author

It should be fixed during training and in engines. I think everywhere it can be just val /= 100; before feeding it into a network. (so training data format won't change)

@Graham853
Copy link

Just a thought: at some point, there will be a version of Leela that uses int8 ops on a GPU. At that point you might wish you'd divided by 64 or 128 not 100.

@mooskagh
Copy link
Contributor Author

mooskagh commented May 14, 2018 via email

@takacsg84
Copy link

Was the insertion of a batch normalization layer ever considered?

@Tilps
Copy link
Contributor

Tilps commented May 14, 2018

To keep my thoughts closer to this issue.
We could potentially skip the pain of a client release and weights versioning to address this issue by just multiplying the rule 50 associated weights by 100 on import to initialize training, and then dividing them by 100 on export after training.

@so-much-meta
Copy link

so-much-meta commented May 15, 2018

Although clearly less of an issue. Mean value of pawn-planes are going to be quite a bit larger than mean value of other piece planes. If trebe's logic is sound, then maybe this is more problematic than it would seem at first glance. Or maybe batch-normalization on the proceeding planes helps temper these effects?

@so-much-meta
Copy link

so-much-meta commented May 16, 2018

FYI - I posted an animation of all of the input planes' weights (mean and standard deviation) here: https://groups.google.com/forum/#!topic/lczero/clkvp2y3APk

I included many of the weights files from ID 199 through ID 297.

tldr summary of the email: Red=mean, blue=std devation. The mean is scaled as (mean*10 - 0.025). Horizontal axis is input channel.

@borg323
Copy link

borg323 commented May 24, 2018

Similar to Tilps' idea, I tested an awk script (based on one given by Gyathaar on discord) to divide the rule50 weights by 99 directly in the weights file. This gives identical output with the patched lczero on all of my tests, but allows the normalization to be done using only the server side code in #605.
awk 'BEGIN{ plane=110; numplanes=112; } {i++; if(i != 2)print $0; else { split($0,vals," "); line = ""; for (i in vals){ if(int(((i-1)%(9*numplanes))/9) == (plane-1)) {line = line " " vals[i]/99.0; } else line=line " " vals[i];} print substr(line,2);}}' < weights_in.txt > weights_out.txt

@killerducky
Copy link
Collaborator

Some data about the main pipeline trying to recover from R50. I took a script someone else made and output np.sum(y*y), np.mean(np.abs(y)), np.mean(y), np.max(y), np.min(y) for the last net before r50 norm got turned on in training, and a few nets after. I did the same for scs-ben's bootstrap net. The size of the weights is not dropping very fast. And the bootstrap net weights are much smaller, the mean is 142X smaller. Maybe they r50 weights in the main pipeline will start to accelerate down as the window fills again? Or maybe it will take a very long time for them to recover?

Source for bootstrap net:
http://webphactory.net/lczero/bootstrap-192x15-270000.txt.gz

Results:

                          np.sum(y*y), np.mean(np.abs(y)), np.mean(y), np.max(y), np.min(y)
id378.txt                   0.788429    0.0142106   0.00364672  0.127805    -0.0798043
id380.txt                   0.789642    0.0142167   0.00363392  0.127904    -0.079843
id381.txt                   0.788659    0.0142085   0.00363179  0.127807    -0.0797948
bootstrap-192x15-270000.txt 3.13604e-05 0.000106134 -8.50894e-06    0.000510617 -0.000546426

Script:

import sys
import numpy as np

lines = []
with open(sys.argv[1]) as f:
    for line in f.readlines():
        lines.append([float(x) for x in line.split()])

    np.set_printoptions(threshold=np.nan, linewidth=270)

    channels = len(lines[2])

    x = np.array(lines[1]).reshape([channels, -1, 9]).transpose((1, 0, 2))
    #print(x)

    #for y in x:
    #    print("%g\t%g\t%g\t%g" % (np.mean(np.abs(y)), np.mean(y), np.max(y), np.min(y)))
    y = x[110-1]
    print("%g\t%g\t%g\t%g\t%g" % (np.sum(y*y), np.mean(np.abs(y)), np.mean(y), np.max(y), np.min(y)))

@killerducky
Copy link
Collaborator

Also a google sheet of two positions that seem to be r50 related, showing wild swings net to net on the N% of the correct move:

https://docs.google.com/spreadsheets/d/1zIyjOP0665S4sRqwVvh-VTRDkUEHxSAhDfGxjFvgYK8/edit#gid=449522518

@Tilps
Copy link
Contributor

Tilps commented Jun 6, 2018

So I did some calculations, it would appear the rule 50 weights are probably moving downwards at about 50% of their maximum possible rate given the current LR.
But increasing LR doesn't look like a real option right now, Especially not by x100 that would really be needed to make quick progress.
I don't see any good options. Changing regularization weight would likely damage all the other weights. that aren't ridiculously huge, adding a separate regularization term for just the rule50 weights seems complicated and with the current low LR, its probably not going to learn fast enough to compensate for the weight loss. Similarly adjusting the weights by chunks in between nets, LR probably not high enough to compensate.
I think we're probably best served by bootstrap transition once there is an option to transition to.

@killerducky
Copy link
Collaborator

For reference, this issue is addressed by LeelaChessZero/lczero-training#3. We are using this issue to track recovery from the problem. As the posts above show, recovery is going to be very slow on the current main pipeline.

@dubslow
Copy link
Contributor

dubslow commented Jun 6, 2018

On the face of it, "Changing regularization weight would likely damage all the other weights" doens't seem true a priori to me. It certainly worked for the value head just fine. What's the rationale for that sentence?

(Even so, I agree a reboostrap is a cleaner and more effective method regardless.)

@Tilps
Copy link
Contributor

Tilps commented Jun 6, 2018

what we tried and had work, was decreasing only the value head weight. If we decrease both, everywhere across the whole net has less optimization power. Decreasing only the value head weight, limits the regurlaization much more to be stronger primarily in the value head related weights. Which was fine, because those weights were in extremely bad shape.
I'm not saying its definitely wrong to try, but I wouldn't want to do it on primary tranining without first testing it on a secondary training first.

@dubslow
Copy link
Contributor

dubslow commented Jun 6, 2018

It's not that everywhere has less optimization power, it's that everywhere has different optimization power.

And the current network continues to be in bad shape (if not extremely bad shape), though the most recent LR reduction will help quite a bit I think.

@Tilps
Copy link
Contributor

Tilps commented Jun 6, 2018

A few points to consider.

  1. Since the last time we experimented with the lower value weight, we've decreased our LR. In order to get a reasonable time frame reduction of the r50 weights we don't need a 4x difference between optimization and regularization, we need more like 50x.
  2. regularization and output optimization are opposing forces - regularization only has one thing it wants to do, reduce weights. Without output optimization to oppose regularization, weights will free fall. Suddenly changing to a 50x different ratio between regularization and optimization seems pretty likely going to cause global weight freefall, until they find their new balance. Its possible the residual stack will be resilient to this since it should affect all weights somewhat equally and BN is being applied repeatedly down the stack, but I'm less confident about the ability of the outputs of policy and value heads to remain vaguely stable under such conditions.
  3. We wouldn't want this to be a long term change as it seems likely to directly inhibit learning speed with such a large ratio, so then we have to change back - and I'm very uncertain what that would do...

For these reasons I stand by my previous thought - we don't want to just try this out on main training without first doing some decent test runs on an outside training.
(This may also all be moot soon since talk of restart is growing...)

@dubslow
Copy link
Contributor

dubslow commented Jun 6, 2018

Never said 50x shift in ratio :) just 2-4x shift. Like both nn head weights to 0.5 or something.

And yea I agree it's pretty much moot in light of recent conversation. I guess this will be my last post on this particular topic.

@Tilps
Copy link
Contributor

Tilps commented Jun 6, 2018

Just to round out this discussion then, a 4x ratio, and increasing the LR by 4x as well would result in a regularization rate of change indicating in excess of 500 nets worth of training to get the regularization term down to pre normalization levels. (I'm pretty confident of that based on observing the slope change when I reduced the LR by 2.5.) Hence why I was talking about needing closer to 50x. It is what would be needed in order to get the recovery time down to a few weeks rather than closer to 6 months.

@dubslow
Copy link
Contributor

dubslow commented Jun 6, 2018

I guess we were addressing different objectives then. I was only talking in what would produce good short term strength benefits, not "how long to totally normalize weights". Oh well.

@killerducky
Copy link
Collaborator

Main pipeline bootstrap net has fixed the rule50 issues. Test pipeline is going well. Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants