memory optimization for cudnn custom_winograd #1250

ankan-ban · 2020-04-29T11:43:18Z

don't save untransformed weights
print warning message when low memory is detected.

- don't save untransformed weights - print warning message when low memory is detected.

src/neural/cuda/layers.cc

src/neural/cuda/network_cudnn.cc

2 layers per residual block!

borg323 · 2020-04-29T17:05:50Z

src/neural/cuda/network_cudnn.cc

+      // No hope of using custom winograd - even the fallback path might not run.
+      use_custom_winograd_ = false;
+    } else if (use_custom_winograd_) {
+      if (transformed_residual_weight_size > 0.8 * deviceProp.totalGlobalMem) {


I was thinking about an additional check here to only disable the custom code if it wasn't explicitly requested by the user, allowing to overriding this (with the warning in the else branch) but this is probably overkill.

I'm going assume that we can do such things in a follow up if they end up looking useful and just merge this for now.

As I said, overkill. Will only consider it if we get complaints about it.

* memory optimization for cudnn custom_winograd - don't save untransformed weights - print warning message when low memory is detected. * address review comments * fix warning message * fix total weight size calculation 2 layers per residual block!

* Time management refactoring (LeelaChessZero#1195) * Appended files. * Compiles. * Compiles again. * Make smart pruning use smoothed nps. * Seems to be fully implemented. * Mistype. * One more bug. * Found discrepancy with documentaiton. * Bugfixes. * Don't smooth nps during the first move. * Too large default for timeuse decay. * Bugfix. * Fix build. * Relax defaults a bit. Add fixed to logging. * Remove "smooth" to "smooth-experimental" for now. * MLH verbose stats - Issue 1200 (LeelaChessZero#1230) * Add M effect logic to output section * Fix missing prefixes and semicolons * Some fixes. * Slight format improvement? Co-authored-by: Tilps <[email protected]> * Start TempDecay only after a given number of moves (LeelaChessZero#1212) * Added TempDecayStartMove for starting temp decay only after a given number of moves. This allows keeping initial game up for a few moves and still use decay. * Doesn't allow temperature to fall below endgame temp during temp decay. Still allows initial temp to be below endgame temp. * Doesn't allow temperature to fall below endgame temp during temp decay. Still allows initial temp to be below endgame temp. * Hide temp options * renamed TempDecayStartMove to TempDecayDelayMoves Co-authored-by: Alexis Olson <[email protected]> * Changelog for 0.25.0-rc2. (LeelaChessZero#1233) * Changelog for 0.25.0-rc2. * Add one more PR to the changelog. * Cuda winograd (LeelaChessZero#1228) * custom winograd convolution for cuda backends * custom winograd fixes - fix a bug to make it work for non-SE networks - enable by default only with fp32. * address review comments * remove random line in comment * remove unused constants - W,H are hardcoded to 8 - because there are assumptions in the code based on that. No point in defining constants. * cuda winograd fixes (LeelaChessZero#1238) * cuda winograd fixes - don't typecast directly to half datatype in CPU side code as older CUDA runtime doesn't support that. - don't use gemmEx version on GPUs older than Maxwell generation (not supported). - modify the check to enable custom_winograd setting. It should be faster in most cases - except presently on RTX GPUs when using fp16. * Allow most parts of fen to be optional. (LeelaChessZero#1234) Default to white to move, no castling, no en passant, 0 rule50ply, 1 total move. Also convert other string to std::string and removing using. * Fix UpdateNps to actually smooth the nps and correctly handle time_since_movestart_ms == 0 (LeelaChessZero#1243) * Update changelog for 0.25.0 final release. (LeelaChessZero#1244) * Always report at least 1 depth. (LeelaChessZero#1247) * Fix un-intended regression for GTX GPUs (LeelaChessZero#1246) * memory optimization for cudnn custom_winograd (LeelaChessZero#1250) * memory optimization for cudnn custom_winograd - don't save untransformed weights - print warning message when low memory is detected. * address review comments * fix warning message * fix total weight size calculation 2 layers per residual block! * keep pdb files only for release builds (LeelaChessZero#1256) * doc update (LeelaChessZero#1267) * Include verbose stats for the node. (LeelaChessZero#1268) Use printing lambdas for parts of the verbose output to share between the newly outputted node and its children. * add alphazero time manager (LeelaChessZero#1201) * Updated FLAGS.md with logfile flag (LeelaChessZero#1275) * Fixed a typo in CONTRIBUTING.md (LeelaChessZero#1274) * Update Readme about using git (LeelaChessZero#1265) * Make `wl_` double. (LeelaChessZero#1280) * Move move filter population to a constructor. (LeelaChessZero#1281) * Filter out illegal searchmoves to avoid crashing. (LeelaChessZero#1282) * Clear policy for terminal loss. (LeelaChessZero#1285) * Allow smart pruning to terminate search if win is known. (LeelaChessZero#1284) * Allow smart pruning to terminate search if win is known. * Minor tweak, better safe than sorry. * Fix bug where pv might not update for best move change. (LeelaChessZero#1286) * Fix bug where pv might not update. * Fix... Co-authored-by: Alexander Lyashuk <[email protected]> Co-authored-by: Tilps <[email protected]> Co-authored-by: Naphthalin <[email protected]> Co-authored-by: Ankan Banerjee <[email protected]> Co-authored-by: Ed Lee <[email protected]> Co-authored-by: borg323 <[email protected]> Co-authored-by: Hace <[email protected]> Co-authored-by: Kip Hamiltons <[email protected]> Co-authored-by: nguyenpham <[email protected]>

@Naphthalin

* Time management refactoring (#1195) * Appended files. * Compiles. * Compiles again. * Make smart pruning use smoothed nps. * Seems to be fully implemented. * Mistype. * One more bug. * Found discrepancy with documentaiton. * Bugfixes. * Don't smooth nps during the first move. * Too large default for timeuse decay. * Bugfix. * Fix build. * Relax defaults a bit. Add fixed to logging. * Remove "smooth" to "smooth-experimental" for now. * MLH verbose stats - Issue 1200 (#1230) * Add M effect logic to output section * Fix missing prefixes and semicolons * Some fixes. * Slight format improvement? Co-authored-by: Tilps <[email protected]> * Start TempDecay only after a given number of moves (#1212) * Added TempDecayStartMove for starting temp decay only after a given number of moves. This allows keeping initial game up for a few moves and still use decay. * Doesn't allow temperature to fall below endgame temp during temp decay. Still allows initial temp to be below endgame temp. * Doesn't allow temperature to fall below endgame temp during temp decay. Still allows initial temp to be below endgame temp. * Hide temp options * renamed TempDecayStartMove to TempDecayDelayMoves Co-authored-by: Alexis Olson <[email protected]> * Changelog for 0.25.0-rc2. (#1233) * Changelog for 0.25.0-rc2. * Add one more PR to the changelog. * Cuda winograd (#1228) * custom winograd convolution for cuda backends * custom winograd fixes - fix a bug to make it work for non-SE networks - enable by default only with fp32. * address review comments * remove random line in comment * remove unused constants - W,H are hardcoded to 8 - because there are assumptions in the code based on that. No point in defining constants. * cuda winograd fixes (#1238) * cuda winograd fixes - don't typecast directly to half datatype in CPU side code as older CUDA runtime doesn't support that. - don't use gemmEx version on GPUs older than Maxwell generation (not supported). - modify the check to enable custom_winograd setting. It should be faster in most cases - except presently on RTX GPUs when using fp16. * Allow most parts of fen to be optional. (#1234) Default to white to move, no castling, no en passant, 0 rule50ply, 1 total move. Also convert other string to std::string and removing using. * Fix UpdateNps to actually smooth the nps and correctly handle time_since_movestart_ms == 0 (#1243) * Update changelog for 0.25.0 final release. (#1244) * Always report at least 1 depth. (#1247) * Fix un-intended regression for GTX GPUs (#1246) * memory optimization for cudnn custom_winograd (#1250) * memory optimization for cudnn custom_winograd - don't save untransformed weights - print warning message when low memory is detected. * address review comments * fix warning message * fix total weight size calculation 2 layers per residual block! * keep pdb files only for release builds (#1256) * doc update (#1267) * Include verbose stats for the node. (#1268) Use printing lambdas for parts of the verbose output to share between the newly outputted node and its children. * add alphazero time manager (#1201) * Updated FLAGS.md with logfile flag (#1275) * Fixed a typo in CONTRIBUTING.md (#1274) * Update Readme about using git (#1265) * Make `wl_` double. (#1280) * Move move filter population to a constructor. (#1281) * Filter out illegal searchmoves to avoid crashing. (#1282) * Clear policy for terminal loss. (#1285) * Allow smart pruning to terminate search if win is known. (#1284) * Allow smart pruning to terminate search if win is known. * Minor tweak, better safe than sorry. * Fix bug where pv might not update for best move change. (#1286) * Fix bug where pv might not update. * Fix... * Catch up to master (#6) * Time management refactoring (#1195) * Appended files. * Compiles. * Compiles again. * Make smart pruning use smoothed nps. * Seems to be fully implemented. * Mistype. * One more bug. * Found discrepancy with documentaiton. * Bugfixes. * Don't smooth nps during the first move. * Too large default for timeuse decay. * Bugfix. * Fix build. * Relax defaults a bit. Add fixed to logging. * Remove "smooth" to "smooth-experimental" for now. * MLH verbose stats - Issue 1200 (#1230) * Add M effect logic to output section * Fix missing prefixes and semicolons * Some fixes. * Slight format improvement? Co-authored-by: Tilps <[email protected]> * Start TempDecay only after a given number of moves (#1212) * Added TempDecayStartMove for starting temp decay only after a given number of moves. This allows keeping initial game up for a few moves and still use decay. * Doesn't allow temperature to fall below endgame temp during temp decay. Still allows initial temp to be below endgame temp. * Doesn't allow temperature to fall below endgame temp during temp decay. Still allows initial temp to be below endgame temp. * Hide temp options * renamed TempDecayStartMove to TempDecayDelayMoves Co-authored-by: Alexis Olson <[email protected]> * Changelog for 0.25.0-rc2. (#1233) * Changelog for 0.25.0-rc2. * Add one more PR to the changelog. * Cuda winograd (#1228) * custom winograd convolution for cuda backends * custom winograd fixes - fix a bug to make it work for non-SE networks - enable by default only with fp32. * address review comments * remove random line in comment * remove unused constants - W,H are hardcoded to 8 - because there are assumptions in the code based on that. No point in defining constants. * cuda winograd fixes (#1238) * cuda winograd fixes - don't typecast directly to half datatype in CPU side code as older CUDA runtime doesn't support that. - don't use gemmEx version on GPUs older than Maxwell generation (not supported). - modify the check to enable custom_winograd setting. It should be faster in most cases - except presently on RTX GPUs when using fp16. * Allow most parts of fen to be optional. (#1234) Default to white to move, no castling, no en passant, 0 rule50ply, 1 total move. Also convert other string to std::string and removing using. * Fix UpdateNps to actually smooth the nps and correctly handle time_since_movestart_ms == 0 (#1243) * Update changelog for 0.25.0 final release. (#1244) * Always report at least 1 depth. (#1247) * Fix un-intended regression for GTX GPUs (#1246) * memory optimization for cudnn custom_winograd (#1250) * memory optimization for cudnn custom_winograd - don't save untransformed weights - print warning message when low memory is detected. * address review comments * fix warning message * fix total weight size calculation 2 layers per residual block! * keep pdb files only for release builds (#1256) * doc update (#1267) * Include verbose stats for the node. (#1268) Use printing lambdas for parts of the verbose output to share between the newly outputted node and its children. * add alphazero time manager (#1201) * Updated FLAGS.md with logfile flag (#1275) * Fixed a typo in CONTRIBUTING.md (#1274) * Update Readme about using git (#1265) * Make `wl_` double. (#1280) * Move move filter population to a constructor. (#1281) * Filter out illegal searchmoves to avoid crashing. (#1282) * Clear policy for terminal loss. (#1285) * Allow smart pruning to terminate search if win is known. (#1284) * Allow smart pruning to terminate search if win is known. * Minor tweak, better safe than sorry. * Fix bug where pv might not update for best move change. (#1286) * Fix bug where pv might not update. * Fix... Co-authored-by: Alexander Lyashuk <[email protected]> Co-authored-by: Tilps <[email protected]> Co-authored-by: Naphthalin <[email protected]> Co-authored-by: Ankan Banerjee <[email protected]> Co-authored-by: Ed Lee <[email protected]> Co-authored-by: borg323 <[email protected]> Co-authored-by: Hace <[email protected]> Co-authored-by: Kip Hamiltons <[email protected]> Co-authored-by: nguyenpham <[email protected]> * Change defaults and unhide MLH options * Update values per @Naphthalin's comments Co-authored-by: Alexander Lyashuk <[email protected]> Co-authored-by: Tilps <[email protected]> Co-authored-by: Naphthalin <[email protected]> Co-authored-by: Ankan Banerjee <[email protected]> Co-authored-by: Ed Lee <[email protected]> Co-authored-by: borg323 <[email protected]> Co-authored-by: Hace <[email protected]> Co-authored-by: Kip Hamiltons <[email protected]> Co-authored-by: nguyenpham <[email protected]>

memory optimization for cudnn custom_winograd

476fd5f

- don't save untransformed weights - print warning message when low memory is detected.

Tilps reviewed Apr 29, 2020

View reviewed changes

src/neural/cuda/layers.cc Show resolved Hide resolved

src/neural/cuda/network_cudnn.cc Show resolved Hide resolved

ankan-ban added 3 commits April 29, 2020 17:32

address review comments

99eba75

fix warning message

45c3cb4

fix total weight size calculation

5e4c659

2 layers per residual block!

borg323 approved these changes Apr 29, 2020

View reviewed changes

Tilps approved these changes Apr 30, 2020

View reviewed changes

Tilps merged commit ad4b5f2 into LeelaChessZero:master Apr 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory optimization for cudnn custom_winograd #1250

memory optimization for cudnn custom_winograd #1250

ankan-ban commented Apr 29, 2020

borg323 Apr 29, 2020

Tilps Apr 30, 2020

borg323 Apr 30, 2020

memory optimization for cudnn custom_winograd #1250

memory optimization for cudnn custom_winograd #1250

Conversation

ankan-ban commented Apr 29, 2020

borg323 Apr 29, 2020

Choose a reason for hiding this comment

Tilps Apr 30, 2020

Choose a reason for hiding this comment

borg323 Apr 30, 2020

Choose a reason for hiding this comment