Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Implement the RENTS algorithm for tree exploration #1418

Draft
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

kiudee
Copy link
Member

@kiudee kiudee commented Sep 10, 2020

Description

This is a WIP branch where I will implement the RENTS algorithm as proposed by Tuan et al. 2020 [1]. Currently, it is not working yet. I will remove this note as soon as you can run it.

Policies are continually recomputed using the following weighted softmax:
max_relent

To ensure enough exploration, the uniform distribution is mixed in:
e3w
with the following exploration term:
lambda_s

The initialization of the Q values is crucial. It is possible to use the existing policy head to come up with a semi-decent initialization, but preliminary experiments in a0lite have shown that initializing using the value head is much better. For that we would need a value head which outputs Q values for every move.

Pros and Cons

Pros:

  • Faster convergence than PUCT
  • Potentially better scaling behaviour
  • Potentially easier collection of batches

Cons:

  • More computational overhead during backpropagation (may need to come up with a clever updating scheme, or update only rarely)
  • Unclear yet, if it is better for chess (the variant TENTS could work better)

Ablations

Here I will record SPRT tests of different design decisions.

To do

Implementation:

  • Implement weighted softmax function
  • Initialize Q values
  • Update Q values and policies during backpropagation
  • Select moves based on policies
  • Additional parameters to control exploration, softmax temperature etc
  • Ensure correct handling of Q values
  • Check how it interacts with cache

Test:

  • Performance using fixed nodes
  • ...

References

[1] Dam, T., D'Eramo, C., Peters, J., and Pajarinen, J., “Convex Regularization in Monte-Carlo Tree Search”, arXiv e-prints, 2020.

@kiudee kiudee added enhancement New feature or request wip Work in progress not for merge Experimental code which is not intended to be merged into the master labels Sep 10, 2020
@Naphthalin Naphthalin added demo Code/concept demonstration. Implies not for merge, won't be closed without consulting author. and removed wip Work in progress not for merge Experimental code which is not intended to be merged into the master labels Nov 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demo Code/concept demonstration. Implies not for merge, won't be closed without consulting author. enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants