Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move move filter population to a constructor. #1281

Merged
merged 4 commits into from
May 9, 2020

Conversation

mooskagh
Copy link
Member

@mooskagh mooskagh commented May 9, 2020

Fixes #1272

@mooskagh mooskagh requested a review from Tilps May 9, 2020 10:58
@mooskagh mooskagh marked this pull request as ready for review May 9, 2020 10:58
if (!root_move_filter_.empty() &&
std::find(root_move_filter_.begin(), root_move_filter_.end(),
child.GetMove()) == root_move_filter_.end()) {
if (!search_->root_move_filter_.empty() &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we trust the optimizer to pull search_->root_move_filter_ outside of the loop to avoid the extra deref or should we do that manually?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't make much sense from disassembly, but it seemed that there was an indirection inside the loop, so I moved it out.
But nowadays indirection itself is no longer considered a performance problem, usually it takes a fraction of a CPU cycle (it's indirection to cold memory that is a problem).

@Tilps
Copy link
Contributor

Tilps commented May 9, 2020

Can we get a random backend check with TB disabled to see what effect this has if any in that case?

Copy link
Contributor

@Tilps Tilps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved assuming nps change in non-TB scenario is in the noise.

@mooskagh
Copy link
Member Author

mooskagh commented May 9, 2020

So far this PR gives -10% nps drop.
But master also gives -5% to what it was when computer was fresh in the morning.

Looking further.

@Tilps
Copy link
Contributor

Tilps commented May 9, 2020

I struggle to see how this could be slower, but maybe the GetBestChild path also similarly needs to avoid the derefs in the loops?

@Naphthalin
Copy link
Contributor

Naphthalin commented May 9, 2020

(ref) Lc0_c4c92f6 with --backend=random:

Total time (ms) : 340111
Nodes searched  : 42331240
Nodes/second    : 124463

(#1281) Lc0_a4bbcf6 with --backend=random:

Total time (ms) : 340126
Nodes searched  : 42238393
Nodes/second    : 124184

(Win10, i7-4770)

@mooskagh mooskagh merged commit 0622049 into LeelaChessZero:master May 9, 2020
@mooskagh mooskagh deleted the tb_performance branch May 9, 2020 17:33
@mooskagh
Copy link
Member Author

mooskagh commented May 9, 2020

This is a fix for a bug causing major slowdown when playing in tablebase position.

Due to this chain of calls:
DoBackupUpdate() ->
DoBackupUpdateSingleNode() ->
GetBestChild() (to update best_edge to know whether to output UCI) ->
PopulateRootMoveLimit() ->
syzygy_tb->root_probe()

We probbed the tablebase (root node, both WDL and DTZ) per every visit.
While in fact only needed to do that once per search (so moved that to a Search constructor).

@Tilps
Copy link
Contributor

Tilps commented May 10, 2020

For posterity, we didn't probe tablebase every visit, only on visits that potentially change the 'best child' of root. This is why the problem didn't show up in every TB position.

@Tilps
Copy link
Contributor

Tilps commented May 10, 2020

Except the check was wrong (visits to the current best edge shouldn't need to check if another edge is now best - but it checked even so), so now I am not clear why the problem didn't show up on every TB position :P

AlexisOlson pushed a commit to AlexisOlson/lc0 that referenced this pull request May 10, 2020
AlexisOlson added a commit to AlexisOlson/lc0 that referenced this pull request May 10, 2020
* Time management refactoring (LeelaChessZero#1195)

* Appended files.

* Compiles.

* Compiles again.

* Make smart pruning use smoothed nps.

* Seems to be fully implemented.

* Mistype.

* One more bug.

* Found discrepancy with documentaiton.

* Bugfixes.

* Don't smooth nps during the first move.

* Too large default for timeuse decay.

* Bugfix.

* Fix build.

* Relax defaults a bit. Add fixed to logging.

* Remove "smooth" to "smooth-experimental" for now.

* MLH verbose stats - Issue 1200  (LeelaChessZero#1230)

* Add M effect logic to output section

* Fix missing prefixes and semicolons

* Some fixes.

* Slight format improvement?

Co-authored-by: Tilps <[email protected]>

* Start TempDecay only after a given number of moves (LeelaChessZero#1212)

* Added TempDecayStartMove for starting temp decay only after a given number of moves. This allows keeping initial game up for a few moves and still use decay.

* Doesn't allow temperature to fall below endgame temp during temp decay. Still allows initial temp to be below endgame temp.

* Doesn't allow temperature to fall below endgame temp during temp decay. Still allows initial temp to be below endgame temp.

* Hide temp options

* renamed TempDecayStartMove to TempDecayDelayMoves

Co-authored-by: Alexis Olson <[email protected]>

* Changelog for 0.25.0-rc2. (LeelaChessZero#1233)

* Changelog for 0.25.0-rc2.

* Add one more PR to the changelog.

* Cuda winograd (LeelaChessZero#1228)

* custom winograd convolution for cuda backends

* custom winograd fixes

- fix a bug to make it work for non-SE networks
- enable by default only with fp32.

* address review comments

* remove random line in comment

* remove unused constants

- W,H are hardcoded to 8 - because there are assumptions in the code based on that. No point in defining constants.

* cuda winograd fixes (LeelaChessZero#1238)

* cuda winograd fixes
- don't typecast directly to half datatype in CPU side code as older CUDA runtime doesn't support that.
 - don't use gemmEx version on GPUs older than Maxwell generation (not supported).
 - modify the check to enable custom_winograd setting. It should be faster in most cases - except  presently on RTX GPUs when using fp16.

* Allow most parts of fen to be optional. (LeelaChessZero#1234)

Default to white to move, no castling, no en passant, 0 rule50ply, 1 total move. Also convert other string to std::string and removing using.

* Fix UpdateNps to actually smooth the nps and correctly handle time_since_movestart_ms == 0 (LeelaChessZero#1243)

* Update changelog for 0.25.0 final release. (LeelaChessZero#1244)

* Always report at least 1 depth. (LeelaChessZero#1247)

* Fix un-intended regression for GTX GPUs (LeelaChessZero#1246)

* memory optimization for cudnn custom_winograd (LeelaChessZero#1250)

* memory optimization for cudnn custom_winograd

- don't save untransformed weights
- print warning message when low memory is detected.

* address review comments

* fix warning message

* fix total weight size calculation

2 layers per residual block!

* keep pdb files only for release builds  (LeelaChessZero#1256)

* doc update (LeelaChessZero#1267)

* Include verbose stats for the node. (LeelaChessZero#1268)

Use printing lambdas for parts of the verbose output to share between the newly outputted node and its children.

* add alphazero time manager (LeelaChessZero#1201)

* Updated FLAGS.md with logfile flag (LeelaChessZero#1275)

* Fixed a typo in CONTRIBUTING.md (LeelaChessZero#1274)

* Update Readme about using git (LeelaChessZero#1265)

* Make `wl_` double. (LeelaChessZero#1280)

* Move move filter population to a constructor. (LeelaChessZero#1281)

* Filter out illegal searchmoves to avoid crashing. (LeelaChessZero#1282)

* Clear policy for terminal loss. (LeelaChessZero#1285)

* Allow smart pruning to terminate search if win is known. (LeelaChessZero#1284)

* Allow smart pruning to terminate search if win is known.

* Minor tweak, better safe than sorry.

* Fix bug where pv might not update for best move change. (LeelaChessZero#1286)

* Fix bug where pv might not update.

* Fix...

Co-authored-by: Alexander Lyashuk <[email protected]>
Co-authored-by: Tilps <[email protected]>
Co-authored-by: Naphthalin <[email protected]>
Co-authored-by: Ankan Banerjee <[email protected]>
Co-authored-by: Ed Lee <[email protected]>
Co-authored-by: borg323 <[email protected]>
Co-authored-by: Hace <[email protected]>
Co-authored-by: Kip Hamiltons <[email protected]>
Co-authored-by: nguyenpham <[email protected]>
Tilps added a commit that referenced this pull request Jun 28, 2020
* Time management refactoring (#1195)

* Appended files.

* Compiles.

* Compiles again.

* Make smart pruning use smoothed nps.

* Seems to be fully implemented.

* Mistype.

* One more bug.

* Found discrepancy with documentaiton.

* Bugfixes.

* Don't smooth nps during the first move.

* Too large default for timeuse decay.

* Bugfix.

* Fix build.

* Relax defaults a bit. Add fixed to logging.

* Remove "smooth" to "smooth-experimental" for now.

* MLH verbose stats - Issue 1200  (#1230)

* Add M effect logic to output section

* Fix missing prefixes and semicolons

* Some fixes.

* Slight format improvement?

Co-authored-by: Tilps <[email protected]>

* Start TempDecay only after a given number of moves (#1212)

* Added TempDecayStartMove for starting temp decay only after a given number of moves. This allows keeping initial game up for a few moves and still use decay.

* Doesn't allow temperature to fall below endgame temp during temp decay. Still allows initial temp to be below endgame temp.

* Doesn't allow temperature to fall below endgame temp during temp decay. Still allows initial temp to be below endgame temp.

* Hide temp options

* renamed TempDecayStartMove to TempDecayDelayMoves

Co-authored-by: Alexis Olson <[email protected]>

* Changelog for 0.25.0-rc2. (#1233)

* Changelog for 0.25.0-rc2.

* Add one more PR to the changelog.

* Cuda winograd (#1228)

* custom winograd convolution for cuda backends

* custom winograd fixes

- fix a bug to make it work for non-SE networks
- enable by default only with fp32.

* address review comments

* remove random line in comment

* remove unused constants

- W,H are hardcoded to 8 - because there are assumptions in the code based on that. No point in defining constants.

* cuda winograd fixes (#1238)

* cuda winograd fixes
- don't typecast directly to half datatype in CPU side code as older CUDA runtime doesn't support that.
 - don't use gemmEx version on GPUs older than Maxwell generation (not supported).
 - modify the check to enable custom_winograd setting. It should be faster in most cases - except  presently on RTX GPUs when using fp16.

* Allow most parts of fen to be optional. (#1234)

Default to white to move, no castling, no en passant, 0 rule50ply, 1 total move. Also convert other string to std::string and removing using.

* Fix UpdateNps to actually smooth the nps and correctly handle time_since_movestart_ms == 0 (#1243)

* Update changelog for 0.25.0 final release. (#1244)

* Always report at least 1 depth. (#1247)

* Fix un-intended regression for GTX GPUs (#1246)

* memory optimization for cudnn custom_winograd (#1250)

* memory optimization for cudnn custom_winograd

- don't save untransformed weights
- print warning message when low memory is detected.

* address review comments

* fix warning message

* fix total weight size calculation

2 layers per residual block!

* keep pdb files only for release builds  (#1256)

* doc update (#1267)

* Include verbose stats for the node. (#1268)

Use printing lambdas for parts of the verbose output to share between the newly outputted node and its children.

* add alphazero time manager (#1201)

* Updated FLAGS.md with logfile flag (#1275)

* Fixed a typo in CONTRIBUTING.md (#1274)

* Update Readme about using git (#1265)

* Make `wl_` double. (#1280)

* Move move filter population to a constructor. (#1281)

* Filter out illegal searchmoves to avoid crashing. (#1282)

* Clear policy for terminal loss. (#1285)

* Allow smart pruning to terminate search if win is known. (#1284)

* Allow smart pruning to terminate search if win is known.

* Minor tweak, better safe than sorry.

* Fix bug where pv might not update for best move change. (#1286)

* Fix bug where pv might not update.

* Fix...

* Catch up to master (#6)

* Time management refactoring (#1195)

* Appended files.

* Compiles.

* Compiles again.

* Make smart pruning use smoothed nps.

* Seems to be fully implemented.

* Mistype.

* One more bug.

* Found discrepancy with documentaiton.

* Bugfixes.

* Don't smooth nps during the first move.

* Too large default for timeuse decay.

* Bugfix.

* Fix build.

* Relax defaults a bit. Add fixed to logging.

* Remove "smooth" to "smooth-experimental" for now.

* MLH verbose stats - Issue 1200  (#1230)

* Add M effect logic to output section

* Fix missing prefixes and semicolons

* Some fixes.

* Slight format improvement?

Co-authored-by: Tilps <[email protected]>

* Start TempDecay only after a given number of moves (#1212)

* Added TempDecayStartMove for starting temp decay only after a given number of moves. This allows keeping initial game up for a few moves and still use decay.

* Doesn't allow temperature to fall below endgame temp during temp decay. Still allows initial temp to be below endgame temp.

* Doesn't allow temperature to fall below endgame temp during temp decay. Still allows initial temp to be below endgame temp.

* Hide temp options

* renamed TempDecayStartMove to TempDecayDelayMoves

Co-authored-by: Alexis Olson <[email protected]>

* Changelog for 0.25.0-rc2. (#1233)

* Changelog for 0.25.0-rc2.

* Add one more PR to the changelog.

* Cuda winograd (#1228)

* custom winograd convolution for cuda backends

* custom winograd fixes

- fix a bug to make it work for non-SE networks
- enable by default only with fp32.

* address review comments

* remove random line in comment

* remove unused constants

- W,H are hardcoded to 8 - because there are assumptions in the code based on that. No point in defining constants.

* cuda winograd fixes (#1238)

* cuda winograd fixes
- don't typecast directly to half datatype in CPU side code as older CUDA runtime doesn't support that.
 - don't use gemmEx version on GPUs older than Maxwell generation (not supported).
 - modify the check to enable custom_winograd setting. It should be faster in most cases - except  presently on RTX GPUs when using fp16.

* Allow most parts of fen to be optional. (#1234)

Default to white to move, no castling, no en passant, 0 rule50ply, 1 total move. Also convert other string to std::string and removing using.

* Fix UpdateNps to actually smooth the nps and correctly handle time_since_movestart_ms == 0 (#1243)

* Update changelog for 0.25.0 final release. (#1244)

* Always report at least 1 depth. (#1247)

* Fix un-intended regression for GTX GPUs (#1246)

* memory optimization for cudnn custom_winograd (#1250)

* memory optimization for cudnn custom_winograd

- don't save untransformed weights
- print warning message when low memory is detected.

* address review comments

* fix warning message

* fix total weight size calculation

2 layers per residual block!

* keep pdb files only for release builds  (#1256)

* doc update (#1267)

* Include verbose stats for the node. (#1268)

Use printing lambdas for parts of the verbose output to share between the newly outputted node and its children.

* add alphazero time manager (#1201)

* Updated FLAGS.md with logfile flag (#1275)

* Fixed a typo in CONTRIBUTING.md (#1274)

* Update Readme about using git (#1265)

* Make `wl_` double. (#1280)

* Move move filter population to a constructor. (#1281)

* Filter out illegal searchmoves to avoid crashing. (#1282)

* Clear policy for terminal loss. (#1285)

* Allow smart pruning to terminate search if win is known. (#1284)

* Allow smart pruning to terminate search if win is known.

* Minor tweak, better safe than sorry.

* Fix bug where pv might not update for best move change. (#1286)

* Fix bug where pv might not update.

* Fix...

Co-authored-by: Alexander Lyashuk <[email protected]>
Co-authored-by: Tilps <[email protected]>
Co-authored-by: Naphthalin <[email protected]>
Co-authored-by: Ankan Banerjee <[email protected]>
Co-authored-by: Ed Lee <[email protected]>
Co-authored-by: borg323 <[email protected]>
Co-authored-by: Hace <[email protected]>
Co-authored-by: Kip Hamiltons <[email protected]>
Co-authored-by: nguyenpham <[email protected]>

* Change defaults and unhide MLH options

* Update values per @Naphthalin's comments

Co-authored-by: Alexander Lyashuk <[email protected]>
Co-authored-by: Tilps <[email protected]>
Co-authored-by: Naphthalin <[email protected]>
Co-authored-by: Ankan Banerjee <[email protected]>
Co-authored-by: Ed Lee <[email protected]>
Co-authored-by: borg323 <[email protected]>
Co-authored-by: Hace <[email protected]>
Co-authored-by: Kip Hamiltons <[email protected]>
Co-authored-by: nguyenpham <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Syzygy TB extreme performance dropoff.
3 participants