-
Notifications
You must be signed in to change notification settings - Fork 299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalize rule50 plane #602
Comments
I find it extremely unlikely this has anything to do with the current training problems. As we've had this problem forever. |
Yes, I really see this as a bug. My idea about what is happening: Compare the current situation to the correct method of using plycount/100 as input During learning, inputs are 100x higher than they should be, so the gradients on the input (lowest) weights are 100x what they should be. Also, the L2 regularization is less effective with bigger-range inputs |
I'm gonna fix this. |
It should be fixed during training and in engines. I think everywhere it can be just |
Just a thought: at some point, there will be a version of Leela that uses int8 ops on a GPU. At that point you might wish you'd divided by 64 or 128 not 100. |
I think it's not how uint8 works. GPU maps from uint8 to float with k*X +
b formula, and keeps k internally (b is often not kept and just 0).
That way it's not really necessary that X is power of 2, as long is it's
linear it should work.
…On Mon, May 14, 2018 at 3:16 PM Graham Jones ***@***.***> wrote:
Just a thought: at some point, there will be a version of Leela that uses
int8 ops on a GPU. At that point you might wish you'd divided by 64 or 128
not 100.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#602 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AKvpl347PBWB1EVhHoitTkMPhKy_8uNCks5tyYO7gaJpZM4T9pEX>
.
|
Was the insertion of a batch normalization layer ever considered? |
To keep my thoughts closer to this issue. |
Although clearly less of an issue. Mean value of pawn-planes are going to be quite a bit larger than mean value of other piece planes. If trebe's logic is sound, then maybe this is more problematic than it would seem at first glance. Or maybe batch-normalization on the proceeding planes helps temper these effects? |
FYI - I posted an animation of all of the input planes' weights (mean and standard deviation) here: https://groups.google.com/forum/#!topic/lczero/clkvp2y3APk I included many of the weights files from ID 199 through ID 297. tldr summary of the email: Red=mean, blue=std devation. The mean is scaled as (mean*10 - 0.025). Horizontal axis is input channel. |
Similar to Tilps' idea, I tested an awk script (based on one given by Gyathaar on discord) to divide the rule50 weights by 99 directly in the weights file. This gives identical output with the patched lczero on all of my tests, but allows the normalization to be done using only the server side code in #605. |
Some data about the main pipeline trying to recover from R50. I took a script someone else made and output Source for bootstrap net: Results:
Script:
|
Also a google sheet of two positions that seem to be r50 related, showing wild swings net to net on the N% of the correct move: |
So I did some calculations, it would appear the rule 50 weights are probably moving downwards at about 50% of their maximum possible rate given the current LR. |
For reference, this issue is addressed by LeelaChessZero/lczero-training#3. We are using this issue to track recovery from the problem. As the posts above show, recovery is going to be very slow on the current main pipeline. |
On the face of it, "Changing regularization weight would likely damage all the other weights" doens't seem true a priori to me. It certainly worked for the value head just fine. What's the rationale for that sentence? (Even so, I agree a reboostrap is a cleaner and more effective method regardless.) |
what we tried and had work, was decreasing only the value head weight. If we decrease both, everywhere across the whole net has less optimization power. Decreasing only the value head weight, limits the regurlaization much more to be stronger primarily in the value head related weights. Which was fine, because those weights were in extremely bad shape. |
It's not that everywhere has less optimization power, it's that everywhere has different optimization power. And the current network continues to be in bad shape (if not extremely bad shape), though the most recent LR reduction will help quite a bit I think. |
A few points to consider.
For these reasons I stand by my previous thought - we don't want to just try this out on main training without first doing some decent test runs on an outside training. |
Never said 50x shift in ratio :) just 2-4x shift. Like both nn head weights to 0.5 or something. And yea I agree it's pretty much moot in light of recent conversation. I guess this will be my last post on this particular topic. |
Just to round out this discussion then, a 4x ratio, and increasing the LR by 4x as well would result in a regularization rate of change indicating in excess of 500 nets worth of training to get the regularization term down to pre normalization levels. (I'm pretty confident of that based on observing the slope change when I reduced the LR by 2.5.) Hence why I was talking about needing closer to 50x. It is what would be needed in order to get the recovery time down to a few weeks rather than closer to 6 months. |
I guess we were addressing different objectives then. I was only talking in what would produce good short term strength benefits, not "how long to totally normalize weights". Oh well. |
Main pipeline bootstrap net has fixed the rule50 issues. Test pipeline is going well. Closing this issue. |
That's long discussed, I've not been able to find the issue.
So, while all other input planes have only values of 0 and 1, in rule50 values can be from 0 to 99.
That causes weights for that plane hugely inflated (although it's not immediately clear to me why it's 50x larger not smaller, but it seems it indeed should be like that), and potentially may be the reason of the current training problems.
We need to normalize that plane.
The text was updated successfully, but these errors were encountered: