-
Notifications
You must be signed in to change notification settings - Fork 535
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overhaul of PUCT formula (and ramifications) #818
Comments
Good issue. I hadn't seen that paper before. One downside is that it's not clear that any of the distributions are normal, or how correlated they are with each other. I've been working for a long time to try and find a more grounded formula. If you want help, I have a lot of free time this week. Feel free to PM me in discord (username oscardssmith) |
True. I definitely think it's going to be a lot of heavy lifting just to get something reasonably effective in place. And I sincerely doubt it will be competitive with the existing PUCT formula right off the bat (after all, that PUCT formula has been massively-tuned and CLOP'ed ad nauseum - even though it's built on an ugly model, I do not question that it's been hammered into a decent approximation of any superior model). However, once something functional is in place, the real beauty of the change should become apparent. We'll be able to tweak/tune the PUCT formula with an eye toward "accurately modeling a real phenomenon" (e.g. finding a version of the statistical methods of the paper which apply to a better-suited non-Gaussian distribution)... rather than relying on the "smack the television in the back and check if the picture got clearer" approach necessitated by the existing PUCT formula. |
Does #317 relate? |
I've tried something like this and I couldn't get it to help. The issues are that
It's #2 that killed my idea. |
@Veedrac the problem with 2 is that by relying on node distribution for weight, you remove the possibility of lots of search optimizations such as graph (rather than tree) search. |
This is relevant to the discussion collected in #1231 about changing the PUCT formula. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Both PR #816 and PR #796 have become stepping stones toward a vision I have toward a broader PUCT overhaul. I wanted to use this issue to lay out that idea for feedback (and ideas for how we might approach implementation).
My biggest pet-peeve with the A0 / Deepmind PUCT formula is it's "fuzziness". It seems to be the result of countless empirical optimizations of educated guesses at good ways to describe the "search-worthiness" of a given action a in a given state s. In the statement: "whichever action a gives the greatest value for Q(s,a)+U(s,a) will make "something" better than actions which give smaller values of Q+U", there is no coherent definition of "something".
As a result of this fuzziness, and lack of coherent "something" we're seeking to improve, it is impossible to determine whether any modification to Q+U (or any of its underlying hyper-parameters) is an objective improvement in measuring that "something". So the only means of checking whether any change is useful is trial-and-error (and the only means of guessing at what good changes might be is, literally, guessing).
My aim is to put in place a PUCT formula that's tied to a real "something" that can be measured for each action. And, more, a real "something" for which it actually makes sense to seek the action that maximizes that "something."
My proposed "something" is based on the calculations outlined in this paper:
Exact Distribution of the Max/Min of Two Gaussian Random Variables
If we can calculate reasonable values for expected value EV of each child node, along with meaningful estimate of uncertainty/stderr of that EV, then this paper shows that it is possible to calculate
The "Golden Ticket" PUCT formula - G(s,a) for lack of better alternative - I envision would assess for each action (s,a), "If this action a is taken in state s, and we assume the next new measure of the associated child node would match its current EV, then we can predict how much that measurement=EV would reduce our uncertainty in that child's EV. From that, we can predict what the new uncertainty of the max of all EVs of all children would be, given the reduction of uncertainty of that one child. The PUCT formula would return the reduction from the original pre-action-a uncertainty in max(EV) to the estimated post-action-a uncertainty in max(EV)."
I think it's easier to follow this formula, and appreciate its elegance, with a hypothetical example. Imagine a state s with two children nodes A and B, with current estimated Q values (and 95% confidence interval based on stderr given current stdev and number of visits to each):
A=0.4 +/- 0.1
B=-0.3 +/- 0.5
With that information alone, the math of the paper allows us to compute that the expected value EV(max(A,B)), which should be roughly 0.41. This should be a little higher than EV(A)=0.4 because the slight chance that B could be better than A. That math might also show the uncertainty of this max as 0.11, which should be slightly higher than uncertaint(A)=0.1 due to the uncertainty B brings. Overall, we should get something like:
EV(max(A,B)) = 0.41 +/- 0.11
Without actually visiting A, we can look at current visits to A and compute the new uncertainty we'd get if we "simulated" a new visit A that returned a simulated measurement of 0.4. Assuming the small uncertainty in A is due to an already large number of visits, one more visit should only give a slight reduction of uncertainty in A'; perhaps EV(A') = 0.4 +/- 0.99. Using the new EV(A'), along with unmodified EV(B), we might find the new expected max value becomes:
EV(max(A',B)) = 0.409 +/- 0.108
Conversely, the initial large uncertainty in B, +/- 0.5, is likely to be based on B having relatively few visits. In the case of B having just two visits, we might find that a third measurement of B (with simulated value of -0.3 = EV(B) ) would drastically cut its uncertainty, perhaps yielding EV(B') = -0.3 +/- 0.25.
While this is a big reduction in uncertainty of B, it really just makes B slightly less likely to be bigger than A. Thus the effect on expected max value of both A and B would be a slightly smaller EV(max(A,B')) and slightly smaller uncertainty(A,B'). Running the numbers might give:
EV(max(A,B')) = 0.408 +/- 0.107
The "golden ticket" PUCT formula G(s,a) would thus return, for
G(s,a[A]) = uncertainty(max(A,B))) - uncertainty(max(A',B))) = 0.11 - 0.108 = 0.002
G(s,a[B]) = uncertainty(max(A,B))) - uncertainty(max(A,B'))) = 0.11 - 0.107 = 0.003
In this scenario then, we could say that a visit to B would yield more information than a visit to A in obtaining the most accurate assessment of max(A,B) for state s. This is vitally important because max(A,B) and Q(parent(A,B)) ultimately mean the same thing (and will converge to the same value as number of visits to s increases).
To me, the most compelling aspect of this approach is its parsimony. Though it's a bit of a mouthful (threadful?) to annunciate, and it'll definitely be challenging to implement, it really is just building a PUCT formula on a single atomic idea. Nevertheless, it yields all the desired intuition-based behaviors of Q+U that have required countless tweaks of cPuct "constant", complex FPU logic, and arbitrary-feeling terms like sqrt(N,b) and 1/(1+N(s,a)) to get decent ELO out of Q+U.
I'm not naive enough to present this proposal with the idea that we can whip up a quick prototype to test it out. However, I do plan to work on putting a functional implementation together (possibly using several incremental PRs as additional stepping stones to the final G(s,a) idea) over the coming days/weeks - and welcome input from and assistance of any other developers interested in contributing.
The text was updated successfully, but these errors were encountered: