Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transform Q into logit space when determining Q+U best child #925

Merged
merged 144 commits into from
Sep 14, 2019
Merged

Transform Q into logit space when determining Q+U best child #925

merged 144 commits into from
Sep 14, 2019

Conversation

AlexisOlson
Copy link
Contributor

@AlexisOlson AlexisOlson commented Aug 17, 2019

Edit:
The idea here is that when Q is near +1 or -1, the U term dominates the search since a small change in Q corresponds to a large change in the chance of winning/losing.

This PR transforms Q into logit space (logit is the function that converts log-odd to probability) before adding the U term.

In sumary, instead of Q + U, I use logit(Q) + U.

See also @Naphthalin's explanation on Discord

The problem it tries to help with: if near Q=1 or Q=-1 (so winning or losing), the search goes super wide because the Q differences are very small. Between 0.97 and 0.98, for example, this wide search is a problem for accurate evaluation of sharper lines. Assume there is a move that in depth 4 increases Q from 0.9 to 0.99. However, the highest policy move there loses momentum, reducing Q from 0.9 to 0.6.

To get an accurate eval of >0.95 for this line the better line has to get at least 5x the visits of the other line. If we approach 1.0, this problem increases: if a move would increase Q from 0.98 to 0.99 while an alternative drops it to 0.9, it would need 10x the nodes. If the cpuct would decrease once we approach Q=1.0 it would help with this problem because the PUCT search would spend more visits on the top choices, therefore giving a more accurate eval.

So the originally proposed idea was to effectively reduce cpuct at higher Q. However, especially with including fpu and for symmetry reasons it is better to replace Q+U by something which behaves as an effective Q, which means being always in [-1,1] where -1 and 1 mean definite results. This is done by doing the addition of Q and U in the logit space which is in our case the proper way of thinking about odds of winning/losing. To do this, Q is transformed back into logit (which is what the NN calculates somewhere), adding the exploration term and FPU, and transforming it back by a tanh to get a winrate. The calculation can be simplified a bit so tanh needs to be called only once.

The expected outcome is no different behavior near Q=0 while being more selective near Q=1 and Q=-1 which (hopefully) favors tactical lines where lc0 can already see the progress over staying in a comfortable position because the good line has a higher weight than it would have now. (Also one important detail: one might think about doing the averaging of Q values also in the logit space. This, however, is most likely to be avoided since the Q values are about statistics, and otherwise having a single 1.0 eval somewhere in the tree would destroy the Q.)


An example at extreme Q (PR925 on bottom):
4Rbk1/5p2/p2Q2p1/7p/6N1/2p4P/PP3PPK/8 w - - 0 1
Mate in 2


Original:

Scaling factor for U term: 2 * Q / ln( (1 + Q) / (1 - Q) )

LogitScaling
Graph from Desmos

Cpuct probably needs to be re-tuned along with this change.

It may or may not make more sense to use the root/parent Q value instead.

efficient propagation of certainty, two-fold draw scoring, mate display and more.
=1 suitable for training
=2 for play
Currently negabound search depth is one.
Improves play in positions with many certain positions (nrear endgame TBs, mates).
Sees repetitions faster and scores positions more accurately.
…ersion. Increasing threads (e.g. 4 or 6) will get to masters speed now. Further speed fixes (move generator) possible....
…with lto, this yields a speed up by 30-50% in backend=random. In order to fully use CP please use 4 threads+. Changed default temporarily to 4 threads with this commit, to collect more scaling data.
…ds instant play of certain winning moves and avoidance of loosing moves regardless of visits. CP=3 now adds advanced pruning.
- exposed depth parameter (0 is no-look-ahead)
- only two modes CP=1 for training and CP=2 for play
Todo:
- change option from int to choiceoption
- use info.mate to communicate mate scores
- Certainty Propagation is a bool option now, just on or off (default = off).
- Cleanup code and comments
- Threads default = 2, but if certainty propagation is turned on please use 4 threads.
src/mcts/search.cc Outdated Show resolved Hide resolved
src/mcts/search.cc Outdated Show resolved Hide resolved
@mooskagh
Copy link
Member

mooskagh commented Sep 13, 2019

Could you check nps with random backend

  1. before this change
  2. after this change with LogitQEnabled=false, and
  3. after the change with LogitQEnabled=true

e.g. with $ ./lc0 benchmark --backend=random

UPD: did that myself:

  1. 235306 nps
  2. 245927 nps
  3. 235812 nps

which is a bit suspicious that it became faster. :)

UPD2:
First run had a bad day, reruns show ~246knps

src/mcts/params.cc Outdated Show resolved Hide resolved
src/mcts/node.h Outdated Show resolved Hide resolved
src/mcts/params.cc Outdated Show resolved Hide resolved
src/mcts/node.h Show resolved Hide resolved
src/mcts/params.cc Outdated Show resolved Hide resolved
src/mcts/node.h Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.