lc0-cudnn : add support for fp16 network eval #5

ankan-ban · 2018-06-01T16:21:02Z

slightly more than 2x speedup (for large batch sizes) on supported hardware, without much loss of precision.

currently fp16 is only supported on Volta based GPUs (TitanV, tesla V100, etc)

jjoshua2 · 2018-06-01T18:24:29Z

How can there be more than 2x improvement. Shouldn't that be max. Is it loss of precision causing higher avg batchsize?

mooskagh · 2018-06-01T19:02:52Z

Would it be possible instead of converting all float* to void* to have two template variants, for float and __fp16, similarly to what I did here: https://github.com/LeelaChessZero/lc0/blob/release/src/neural/network_tf.cc#L299 for tensorflow.
and then use ValueType* instead? (maybe some better name for ValueType though.. Float?)

So, it would be

template <typename ValueType>
class CudnnNetwork ...

and all helper classes/function would also become templates, and registered like:

REGISTER_NETWORK("cudnn", CudnnNetwork<float>, 110)
REGISTER_NETWORK("cudnn-fp16", CudnnNetwork<__fp16>, 105)

The drawback though is that best configuration won't be automatically chosen.
But on other hand, as cudnn-fp32 has better precision, everyone should be able to select it.

So, with this approach, if there is a function which needs completely different implementation, then template specialization is useful. If however there is difference in small details, this is possible:

if (std::is_same<float, ValueType>::value) {  // In C++17 (or 14?)  ::value is not necessary
   // Do something float-specific.
}

Also this file becomes bit, maybe at that point it deserves a separate src/neural/cudnn/ directory and it is worth splitting into several files?

ankan-ban · 2018-06-01T19:05:00Z

Volta Tensor cores (for FP16 math) have much higher throughput than fp32.
E.g: see table in https://www.anandtech.com/show/12170/nvidia-titan-v-preview-titanomachy

Just slightly more than 2X (instead of 8X is because):

we aren't getting full utilization yet (need even bigger batch size).
cudnn can't use winograd convolution algorithm with tensor cores. (edit: actually I think it can, but it needs really big channel counts/batch sizes.. maybe I will try it out with 256 filter networks)

jnewlin12345 · 2018-06-01T19:11:37Z

I'm looking at the anandtech link and see:

Single Precision: 13.8 TFLOPS
Half Precision: 27.6 TFLOPS

Where is the 8x coming from?

ankan-ban · 2018-06-01T19:26:45Z

Tensor Performance(Deep Learning) | 110 TFLOPS
Tensor cores are basically special units for small matrix multiplies.
see this blog for more details: https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

get latest

ankan-ban · 2018-06-24T17:24:26Z

@mooskagh , Sorry for long delay. I was playing with int8 but I don't have it working yet. I have made the changes to have two template variants for half and float. Will add int8 in seperate PR when it's ready.

Error323 · 2018-06-24T17:29:15Z

Hi @ankan-ban are you using TensorRT? Could you push what you have? I'd like to dive in :) (in a separate branch that is)

ankan-ban · 2018-06-24T17:33:02Z

Haven't tried tensorRT yet. It's linux only and I am on Windows right now :(. I think it's time for me to setup a linux partition and compare performance of TensorRT with cudnn.

I was trying out int8 with cudnn where the adjustment/calibration of weights has to be done manually. (I believe TensorRT does it for you?)

Error323 · 2018-06-24T17:35:44Z

That's correct, tensorrt allows you to calibrate automatically given a dataset. I did some preliminary testing here glinscott/leela-chess#52 (comment)

mooskagh · 2018-06-27T21:28:55Z

src/neural/network_cudnn.cu

  reportCUDAErrors(cudaMalloc(&biases, biasSize));

+  bool fp16 = std::is_same<half, DataType>::value;


mooskagh · 2018-06-27T21:29:50Z

src/neural/network_cudnn.cu

            cublasHandle_t cublas) override;

 private:
  const bool use_relu_;
+
+  // always in float irrespective of DataType


Sorry, for nitpicking, but consider changing all comments to proper English sentences. (starting with capital letter, ending with a period).

Also consider running clang-format -I -style=google network_cudnn.cu but I can do that myself if you don't have clang installed.

Done. I don't have clang but I got just the clang-format binary for windows.

mooskagh · 2018-06-27T21:32:56Z

src/neural/network_cudnn.cu

+void ConvLayer<DataType>::Eval(int N, DataType *output, const DataType *input,
+                               const DataType *input2, void *scratch, cudnnHandle_t cudnn,
+                               cublasHandle_t cublas) {
+  bool fp16 = std::is_same<half, DataType>::value;


mooskagh · 2018-06-27T21:34:44Z

src/neural/network_cudnn.cu

 class CudnnNetwork : public Network {
 public:
  CudnnNetwork(Weights weights, const OptionsDict &options) {
    gpuId_ = options.GetOrDefault<int>("gpu", 0);

+    //int tryFp16 = options.GetOrDefault<int>("fp16", 0);


not needed?

get latest

Looks good, but remove clang-format.exe from the PR. :)

* Changelog for v0.17.0-rc1 * More details in changelog. * Code review fixes. * Mistype fix. * Clang-format all files. * Fix comment. * Add "for FPU reduction" to changelog. * Fix typo. * Typo. * typo * Typo #5. * Fix number_out_of_order.

* Changelog for v0.17.0-rc1 * More details in changelog. * Code review fixes. * Mistype fix. * Clang-format all files. * Fix comment. * Add "for FPU reduction" to changelog. * Fix typo. * Typo. * typo * Typo LeelaChessZero#5. * Fix number_out_of_order.

…aChessZero#5) * included the issue 1308 deblunder mechanism in loop.cc * blunder detection now acts on missed proven wins and unforced proven losses * added comment on missing activeM * removed probabilistic randomization of result rescorer and worked with v6 data instead * included moves left rescore, removed unneeded options * doubled code not needed as final positions aren't special * changed appveyor script to hopefully build rescorer.sln * reverted failed attempt at fixing appveyor * included minimal std::cout for blunders * included blunder counter, added comment to visits v6 data checking * checking for bit 3 of invariance info to make sure best_q is a proven win

batched blas smolgen

Update for Linux compile

* Add a mode to turn lc0 into a chunk data rescorer powered by Tablebase. * Add some stats. * Add secondary rescoring using wdl to reduce back propigation of blunders a bit. * Add policy distribution adjust support to rescorer. * Track the game outcomes, and the change to the start of the game * Add DTZ based assistance for secondary rescoring. * Change move count to a moves remaining to potentially use for modulating target value. * Use DTZ for pawnless 3 piece positions as a substitue for DTM to adjust move_count to be more correct * another fix. * More fixing. * Getting things compiling again. * Make rescorer more obvious. * reorder to match struct order. * Actually update the version when converting to v4 format. * Implement the threading support. * Fix compilation issues on some compilers. * More compilation fixing. * Fix off by one. * Add support for root probe policy boosting for minimum dtz in winning positions. * Fix test compile. * Fix missing option. * Add a counter. * Log if policy boost is for a move labelled illegal. * Add a histogram for total amount of boosted policy per boosted position. * Distribute boost rather than apply to all - also log before and after dists. * Add gaviotatb code for later use in dtm_boost * Fix compile issue on linux. * Prepare logic for dtm policy boost. * Load gaviota tb if specified. * Probe gaviota to decide which 'safe' moves are most deserving of boost based on dtm. * First attempt at supporting arbitrary starting point training data for rescorer. * Fix missing brackets. * Some fixes. * Avoid crashes from walking history before start of provided game information. * Some more merge fixes. * Fix some formatting. * Only process .gz files, don't crash out on invalid files, don't create output until input has been read. * Don't keep partially valid files. * Add basic range validation for input data. * Don't create writer any earlier than needed. * Fix decoding castling moves for the new Move format. * Validate game moves for legality. * Also log illegal move if it passes probability check but fails the real check. * Fix another merge error. * Compile fix for linux. * Plies left in rescorer (#1) * Rescore move_count using Gaviota TBs * Fix lczero-common commit * Add condition for Gaviota move_count rescoring * Post merge fixup for the kings/knights change in board. * Rescore tb v5 (#2) * Make lc0 output v5 training data. * Finish merge of v5 data into rescorer tb. * Fixes for rescoring v4 data. * Revert some unneeded formatting changes. * Support FRC input_format in rescoring. * Add some very important missing break statements... * Fix merge. * Change movement decode to not rely on there being any history planes filled in. Since that will not always be the case for input type 3. * Minimum changes to make it compile again post merge. * Input format 3 support. * Fix data range checks were incorrect for format 3 and 2. * Fix up bugs with chess 960 castle moves that leave a rook or king in place. * Post merge compile fixups for renames. * Add support for hectoplies and hectoplies armageddon to validate, and fixup the merge of latest code. * More fixes for type 4 and 132. * Add input format conversion support to rescorer. * Better match for training. * Add canonical v2 format to rescorer. * Add a utility for substituting policy from higher quality data into main data. * Fix missing option and add some commented out diagnostic code. * More cleanup in comments. * Handle empty policy-substitutions dir and input dir better. * Don't keep chunks that are marked as not for training. * More fixes for handling files with placeholder chunks. * Add 'deblunderer' Completely untested... * Fix some bugs in deblunder. * simplify windows rescorer build (#4) Co-authored-by: borg323 <[email protected]> * Tweak windows build file. * Some updates for writer.h/cc for v6 * Update rescorer loop.cc for V6. * Some additional validations to do with played_idx/best_idx. * make appveyor build the rescorer (#7) Co-authored-by: borg323 <[email protected]> * subproject for gaviota tb files (#8) Co-authored-by: borg323 <[email protected]> * 'Fix' for build on windows Probably should be fixed some other way... * Fix my breakage. (#9) * Update loop.cc * Update meson.build * Use the v6 field played_q to do a more direct blunder rescoring (#5) * included the issue 1308 deblunder mechanism in loop.cc * blunder detection now acts on missed proven wins and unforced proven losses * added comment on missing activeM * removed probabilistic randomization of result rescorer and worked with v6 data instead * included moves left rescore, removed unneeded options * doubled code not needed as final positions aren't special * changed appveyor script to hopefully build rescorer.sln * reverted failed attempt at fixing appveyor * included minimal std::cout for blunders * included blunder counter, added comment to visits v6 data checking * checking for bit 3 of invariance info to make sure best_q is a proven win * Fix v5 upgrading for decisive games. * Additional safety. * Add missing brackets. * don't keep the first TB position for the deblundering pass. (#10) * included the issue 1308 deblunder mechanism in loop.cc * blunder detection now acts on missed proven wins and unforced proven losses * added comment on missing activeM * removed probabilistic randomization of result rescorer and worked with v6 data instead * included moves left rescore, removed unneeded options * doubled code not needed as final positions aren't special * changed appveyor script to hopefully build rescorer.sln * reverted failed attempt at fixing appveyor * included minimal std::cout for blunders * included blunder counter, added comment to visits v6 data checking * checking for bit 3 of invariance info to make sure best_q is a proven win * don't keep the first TB position for rescorer * change recorer logo (#11) Co-authored-by: borg323 <[email protected]> * Make the deblunder transition soft through a width parameter (#13) * included the issue 1308 deblunder mechanism in loop.cc * blunder detection now acts on missed proven wins and unforced proven losses * added comment on missing activeM * removed probabilistic randomization of result rescorer and worked with v6 data instead * included moves left rescore, removed unneeded options * doubled code not needed as final positions aren't special * changed appveyor script to hopefully build rescorer.sln * reverted failed attempt at fixing appveyor * included minimal std::cout for blunders * included blunder counter, added comment to visits v6 data checking * checking for bit 3 of invariance info to make sure best_q is a proven win * don't keep the first TB position for rescorer * added a deblunder width parameter to allow a soft transition * clang formatting * resolve merge conflict * Add nnue plain file output (#12) * GetFen() from pr834 * first version of nnue output * flag to delete fils * address review comments * support pre v6 data * fix sign * correct nnue data misunderstanding Co-authored-by: borg323 <[email protected]> * fix copy-paste error (#15) Co-authored-by: borg323 <[email protected]> * add -t flag (#16) Co-authored-by: borg323 <[email protected]> * Post merge fixes. * Missed cleanup. * Fix input format change bug that can corrupt played_idx and best_idx * Post merge fixes. * fix merge * remove unnecessary options * split out rescore loop * minimize rescorer build * merge rescorer with master * minimize syzygy diff --------- Co-authored-by: Tilps <[email protected]> Co-authored-by: Henrik Forstén <[email protected]> Co-authored-by: borg323 <[email protected]> Co-authored-by: Naphthalin <[email protected]>

* Add a mode to turn lc0 into a chunk data rescorer powered by Tablebase. * Add some stats. * Add secondary rescoring using wdl to reduce back propigation of blunders a bit. * Add policy distribution adjust support to rescorer. * Track the game outcomes, and the change to the start of the game * Add DTZ based assistance for secondary rescoring. * Change move count to a moves remaining to potentially use for modulating target value. * Use DTZ for pawnless 3 piece positions as a substitue for DTM to adjust move_count to be more correct * another fix. * More fixing. * Getting things compiling again. * Make rescorer more obvious. * reorder to match struct order. * Actually update the version when converting to v4 format. * Implement the threading support. * Fix compilation issues on some compilers. * More compilation fixing. * Fix off by one. * Add support for root probe policy boosting for minimum dtz in winning positions. * Fix test compile. * Fix missing option. * Add a counter. * Log if policy boost is for a move labelled illegal. * Add a histogram for total amount of boosted policy per boosted position. * Distribute boost rather than apply to all - also log before and after dists. * Add gaviotatb code for later use in dtm_boost * Fix compile issue on linux. * Prepare logic for dtm policy boost. * Load gaviota tb if specified. * Probe gaviota to decide which 'safe' moves are most deserving of boost based on dtm. * First attempt at supporting arbitrary starting point training data for rescorer. * Fix missing brackets. * Some fixes. * Avoid crashes from walking history before start of provided game information. * Some more merge fixes. * Fix some formatting. * Only process .gz files, don't crash out on invalid files, don't create output until input has been read. * Don't keep partially valid files. * Add basic range validation for input data. * Don't create writer any earlier than needed. * Fix decoding castling moves for the new Move format. * Validate game moves for legality. * Also log illegal move if it passes probability check but fails the real check. * Fix another merge error. * Compile fix for linux. * Plies left in rescorer (LeelaChessZero#1) * Rescore move_count using Gaviota TBs * Fix lczero-common commit * Add condition for Gaviota move_count rescoring * Post merge fixup for the kings/knights change in board. * Rescore tb v5 (LeelaChessZero#2) * Make lc0 output v5 training data. * Finish merge of v5 data into rescorer tb. * Fixes for rescoring v4 data. * Revert some unneeded formatting changes. * Support FRC input_format in rescoring. * Add some very important missing break statements... * Fix merge. * Change movement decode to not rely on there being any history planes filled in. Since that will not always be the case for input type 3. * Minimum changes to make it compile again post merge. * Input format 3 support. * Fix data range checks were incorrect for format 3 and 2. * Fix up bugs with chess 960 castle moves that leave a rook or king in place. * Post merge compile fixups for renames. * Add support for hectoplies and hectoplies armageddon to validate, and fixup the merge of latest code. * More fixes for type 4 and 132. * Add input format conversion support to rescorer. * Better match for training. * Add canonical v2 format to rescorer. * Add a utility for substituting policy from higher quality data into main data. * Fix missing option and add some commented out diagnostic code. * More cleanup in comments. * Handle empty policy-substitutions dir and input dir better. * Don't keep chunks that are marked as not for training. * More fixes for handling files with placeholder chunks. * Add 'deblunderer' Completely untested... * Fix some bugs in deblunder. * simplify windows rescorer build (LeelaChessZero#4) Co-authored-by: borg323 <[email protected]> * Tweak windows build file. * Some updates for writer.h/cc for v6 * Update rescorer loop.cc for V6. * Some additional validations to do with played_idx/best_idx. * make appveyor build the rescorer (LeelaChessZero#7) Co-authored-by: borg323 <[email protected]> * subproject for gaviota tb files (LeelaChessZero#8) Co-authored-by: borg323 <[email protected]> * 'Fix' for build on windows Probably should be fixed some other way... * Fix my breakage. (LeelaChessZero#9) * Update loop.cc * Update meson.build * Use the v6 field played_q to do a more direct blunder rescoring (LeelaChessZero#5) * included the issue 1308 deblunder mechanism in loop.cc * blunder detection now acts on missed proven wins and unforced proven losses * added comment on missing activeM * removed probabilistic randomization of result rescorer and worked with v6 data instead * included moves left rescore, removed unneeded options * doubled code not needed as final positions aren't special * changed appveyor script to hopefully build rescorer.sln * reverted failed attempt at fixing appveyor * included minimal std::cout for blunders * included blunder counter, added comment to visits v6 data checking * checking for bit 3 of invariance info to make sure best_q is a proven win * Fix v5 upgrading for decisive games. * Additional safety. * Add missing brackets. * don't keep the first TB position for the deblundering pass. (LeelaChessZero#10) * included the issue 1308 deblunder mechanism in loop.cc * blunder detection now acts on missed proven wins and unforced proven losses * added comment on missing activeM * removed probabilistic randomization of result rescorer and worked with v6 data instead * included moves left rescore, removed unneeded options * doubled code not needed as final positions aren't special * changed appveyor script to hopefully build rescorer.sln * reverted failed attempt at fixing appveyor * included minimal std::cout for blunders * included blunder counter, added comment to visits v6 data checking * checking for bit 3 of invariance info to make sure best_q is a proven win * don't keep the first TB position for rescorer * change recorer logo (LeelaChessZero#11) Co-authored-by: borg323 <[email protected]> * Make the deblunder transition soft through a width parameter (LeelaChessZero#13) * included the issue 1308 deblunder mechanism in loop.cc * blunder detection now acts on missed proven wins and unforced proven losses * added comment on missing activeM * removed probabilistic randomization of result rescorer and worked with v6 data instead * included moves left rescore, removed unneeded options * doubled code not needed as final positions aren't special * changed appveyor script to hopefully build rescorer.sln * reverted failed attempt at fixing appveyor * included minimal std::cout for blunders * included blunder counter, added comment to visits v6 data checking * checking for bit 3 of invariance info to make sure best_q is a proven win * don't keep the first TB position for rescorer * added a deblunder width parameter to allow a soft transition * clang formatting * resolve merge conflict * Add nnue plain file output (LeelaChessZero#12) * GetFen() from pr834 * first version of nnue output * flag to delete fils * address review comments * support pre v6 data * fix sign * correct nnue data misunderstanding Co-authored-by: borg323 <[email protected]> * fix copy-paste error (LeelaChessZero#15) Co-authored-by: borg323 <[email protected]> * add -t flag (LeelaChessZero#16) Co-authored-by: borg323 <[email protected]> * Post merge fixes. * Missed cleanup. * Fix input format change bug that can corrupt played_idx and best_idx * Post merge fixes. * fix merge * remove unnecessary options * split out rescore loop * minimize rescorer build * merge rescorer with master * minimize syzygy diff --------- Co-authored-by: Tilps <[email protected]> Co-authored-by: Henrik Forstén <[email protected]> Co-authored-by: borg323 <[email protected]> Co-authored-by: Naphthalin <[email protected]> (cherry picked from commit 738c4aa)

fp16 support

145aeab

ankan-ban added 4 commits June 3, 2018 13:48

Merge pull request #1 from LeelaChessZero/master

e28919f

get latest

Merge pull request #2 from LeelaChessZero/master

9e8bdf9

get latest

Merge pull request #3 from LeelaChessZero/master

313ad08

get latest

use templates for fp16 vs fp32 cuda backends

18261b7

remove commented out line

ccb4361

mooskagh previously approved these changes Jun 27, 2018

View reviewed changes

ankan-ban added 3 commits June 28, 2018 17:57

Merge pull request #4 from LeelaChessZero/master

f9f3372

get latest

minor formatting changes

fd05712

clang-format

0f55e24

Delete clang-format-r333363.exe

4680347

mooskagh approved these changes Jul 1, 2018

View reviewed changes

mooskagh merged commit 9399e6d into LeelaChessZero:master Jul 1, 2018

blin00 mentioned this pull request Jul 1, 2018

fix template related build issues with gcc #128

Merged

mooskagh added a commit to mooskagh/lc0 that referenced this pull request Aug 19, 2018

Typo LeelaChessZero#5.

e43317c

mooskagh mentioned this pull request May 26, 2019

use Eigen as a BLAS alternative #858

Merged

borg323 pushed a commit to borg323/lc0 that referenced this pull request Mar 22, 2023

Merge pull request LeelaChessZero#5 from borg323/blas_ab

b356308

batched blas smolgen

uwuplant pushed a commit to uwuplant/lc0 that referenced this pull request Oct 10, 2023

Merge pull request LeelaChessZero#5 from whelanh/uncertainty-weighting

5132f4c

Update for Linux compile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lc0-cudnn : add support for fp16 network eval #5

lc0-cudnn : add support for fp16 network eval #5

ankan-ban commented Jun 1, 2018

jjoshua2 commented Jun 1, 2018

mooskagh commented Jun 1, 2018

ankan-ban commented Jun 1, 2018 •

edited

Loading

jnewlin12345 commented Jun 1, 2018

ankan-ban commented Jun 1, 2018 •

edited

Loading

ankan-ban commented Jun 24, 2018

Error323 commented Jun 24, 2018 •

edited

Loading

ankan-ban commented Jun 24, 2018 •

edited

Loading

Error323 commented Jun 24, 2018

mooskagh Jun 27, 2018

mooskagh Jun 27, 2018

ankan-ban Jun 28, 2018 •

edited

Loading

mooskagh Jun 27, 2018

mooskagh Jun 27, 2018

		reportCUDAErrors(cudaMalloc(&biases, biasSize));

		bool fp16 = std::is_same<half, DataType>::value;

lc0-cudnn : add support for fp16 network eval #5

lc0-cudnn : add support for fp16 network eval #5

Conversation

ankan-ban commented Jun 1, 2018

jjoshua2 commented Jun 1, 2018

mooskagh commented Jun 1, 2018

ankan-ban commented Jun 1, 2018 • edited Loading

jnewlin12345 commented Jun 1, 2018

ankan-ban commented Jun 1, 2018 • edited Loading

ankan-ban commented Jun 24, 2018

Error323 commented Jun 24, 2018 • edited Loading

ankan-ban commented Jun 24, 2018 • edited Loading

Error323 commented Jun 24, 2018

mooskagh Jun 27, 2018

Choose a reason for hiding this comment

mooskagh Jun 27, 2018

Choose a reason for hiding this comment

ankan-ban Jun 28, 2018 • edited Loading

Choose a reason for hiding this comment

mooskagh Jun 27, 2018

Choose a reason for hiding this comment

mooskagh Jun 27, 2018

Choose a reason for hiding this comment

ankan-ban commented Jun 1, 2018 •

edited

Loading

ankan-ban commented Jun 1, 2018 •

edited

Loading

Error323 commented Jun 24, 2018 •

edited

Loading

ankan-ban commented Jun 24, 2018 •

edited

Loading

ankan-ban Jun 28, 2018 •

edited

Loading