[RFC] Provide WDL statistics #2778

vondele · 2020-06-28T07:20:45Z

A number of engines, GUIs and tournaments start to report WDL estimates
along or instead of scores. This patch enables reporting of those stats
in a more or less standard way (http://www.talkchess.com/forum3/viewtopic.php?t=72140)

info depth 59 seldepth 78 multipv 1 score cp 80 wdl 313 675 12 lowerbound nodes 3568397124 nps 2847848 hashfull 1000 tbhits 0 time 1253015 pv f3h4

The model this reporting uses is based on data derived from a few million fishtest LTC games,
given a score and a game ply, a win rate is provided that matches rather closely,
especially in the intermediate range [0.05, 0.95] that data. Some data is shown at
https://github.com/glinscott/fishtest/wiki/UsefulData#win-loss-draw-statistics-of-ltc-games-on-fishtest
Making the conversion game ply dependent is important for a good fit, and is in line
with experience that a +1 score in the early midgame is more likely a win than in the late endgame.

Even when enabled, the printing of the info causes no significant overhead.

Passed STC:
LLR: 2.94 (-2.94,2.94) {-1.50,0.50}
Total: 197112 W: 37226 L: 37347 D: 122539
Ptnml(0-2): 2591, 21025, 51464, 20866, 2610
https://tests.stockfishchess.org/tests/view/5ef79ef4f993893290cc146b

The PR is for discussion, the comments will be updated with a comparison of model and data.

Bench: 4789930

vondele · 2020-06-28T07:21:57Z

This is the data derived from fishtest, showing win rate as a function of score and move number:

This is the corresponding model result

joergoster · 2020-06-28T10:33:06Z

Hmm, I didn't know that we have to provide 3 values. There is quite some computational effort involved now.

I would move this into UCI namespace, maybe UCI::wdl() similar to UCI::value().
How do you handle TB wins/mates? At least mate scores should be given as 1000 0 0, right?

vondele · 2020-06-28T11:01:16Z

yes, could be moved to UCI namespace.

The computational effort is there, but this is only computed once per depth, so negligible at higher depth.

All large scores (including TB and mate) get clamped to +- 1000cp, which should yield 1000 0 0.

vondele · 2020-06-28T11:34:43Z

To make it a little easier to appreciate the fit between data and model, rather than a 2D contour plot, I have a comparison for the winrate based on data and on model, taken at move 30:

vondele · 2020-06-28T11:37:22Z

Finally, the polynomial fit of a, b, which captures the move number dependence of these parameters is:

snicolet · 2020-06-28T11:54:18Z

Can we try to run alpha-beta on the win rate instead of usual eval? Or transform back the win rate to usual eval value (but taking implicitly into account the game ply) and then use our alpha-beta?

snicolet · 2020-06-28T11:55:46Z

Very nice graphics, by the way, I wish we would all have the skills to produce more of graphs like that to ease the discussions!

vondele · 2020-06-28T12:01:31Z

Yes, I think this data should be exploitable, since it suggests that for two positions with similar eval one should use something closer to the root, but the current full form (exp, etc.) is too expensive to call in eval.

(Edit: that's basically, the curve that fits 'a' in the above graph. That's the score needed to have 50% winrate, as a function of move)

Edit2: The transformation can be done with exponentials, it is the two curves as / bs above that are needed.

I had a few tries a couple of days ago with linearized forms see e.g. https://tests.stockfishchess.org/tests/view/5ee92722563bc7aa755ffc48 but that was just 'adjustment by hand', I didn't have the formulae yet.

Note that the data is for the evaluation after search, the positions leading to this eval will be quite a bit deeper.

:-) I'm learning to use mathplotlib...

AlexandreMasta · 2020-06-28T14:52:25Z

Very nice graphics but why nerf the engine for an information that can be calculated by the GUI as TCEC is doing?

I don´t think this kind of information values any nerf to the engine. The engine already gives you the centipawns score.

We need elo patches not nerf patches

Congrats for the graphs and code anyway.

vondele · 2020-06-28T14:58:32Z

@AlexandreMasta takes a couple of weeks of data collection and analysis to do that for one engine. Unlikely GUIs (or TCEC) can put that effort in, this would lead to pretty different results. So far, I have not seen any approaches that take the move number into account, either. But yes, in principle it could be upstreamed to all GUIs. So, that's one point why I put this up for discussion.

AlexandreMasta · 2020-06-28T15:39:59Z

I know you had lots of work. I´m sure everybody here is grateful for your work. It is very nice indeed. I´m just saying I don´t think this should be implemented as default weakening the engine for this info. Maybe as a "turn on / turn off" feature in the UCI options should be much better.

So...it is nice but make it so that user can turn it off to provide the best performance of the engine.

Congrats!

vondele · 2020-06-28T15:44:56Z

have a look at the patch... the option is there. And is doesn't make the engine weaker as the testing shows (and is quite obvious as well from the code).

AlexandreMasta · 2020-06-28T15:49:20Z

have a look at the patch... the option is there. And is doesn't make the engine weaker as the testing shows (and is quite obvious as well from the code).

Very nice then! Maybe just turn it off as default. My sincere congratulations! I really was thinking this would be implemented as always ON. Nice job. With your lead SF is achieving great results!

Cheers

Alayan-stk-2 · 2020-06-28T18:29:59Z

That's very nice data, but one big disclaimer with all WDL approaches is that the draw ratio and the "weak side win" ratio are extremely TC/hardware dependent. Even with the work to get a nice fit from fishtest data, things can change drastically with different test conditions.

Some other limitations :

Fishtest's adjudication might skew data that is close to 400cp.
Position counting bias : games that are won late will see an increasing eval, so eval won't stay for many moves say between 300cp and 400cp. However, a drawn game with a high static eval could stay for dozen of plies inside this band. So WDL statistics derived from the proportion of positions will inflate a lot the draw probability.

vondele · 2020-06-28T18:41:39Z

@Alayan-stk-2 yes, I wanted to point that out. There is likely TC dependence that is not captured in the model. I wonder, however, if TC means the scores are different, or the effect of a given score is different. That's not entirely clear.

Yes, adjudication plays a role. The model is only retaining data with eval < 400cp. There is some effect of adjudication, on win rate, but mostly outside the interval [0.05, 0.95], which i mentioned.

The counting 'bias' is basically related to the definition. The model is correct for the definition (i.e. picking one position and one eval). Other definitions lead to other graphs.

Alayan-stk-2 · 2020-06-28T21:44:46Z

@Alayan-stk-2 yes, I wanted to point that out. There is likely TC dependence that is not captured in the model. I wonder, however, if TC means the scores are different, or the effect of a given score is different. That's not entirely clear.

I think it's both. When a position is clearly winning, eval is higher at longer TC as the engine can search deeper. In non-clearly winning position, the eval gets more reliable with more time, especially 0.00 draws.

But blunders and mistakes are much more likely at shorter TCs, so even if short and long search do happen to give the same eval, WDL will differ. Simplest example is 0.00, as this eval can happen at any TC. At TCEC conditions, a 0.00 is a big statement, the draw probability is really high. At bullet, the expected 3-fold is likely to be avoidable, a blunder later in the game is frequent.

By the way, another issue with WDL is how to rate 0.00. 0.00 is a symmetrical evals but in most positions evaluated as such, unless 100% D, one side has a much bigger win probability than the other... We can't easily extract this information from SF.

The counting 'bias' is basically related to the definition. The model is correct for the definition (i.e. picking one position and one eval). Other definitions lead to other graphs.

Let's take a simplified dataset to illustrate why the counting bias leads to results going against expectations.

1 game with a static fortress +3 eval from move 50 to move 98. This game is drawn.

49 games with a +3 eval at move 50, 51... to 98, each for one move only.

98% of games having reached a +3 eval in the dataset between moves 50 and move 98 were won. So if a +3 happens in a game, the best prediction is 98% win.

But 50% of positions with +3 eval between move 50 and move 98 happened in that one fortress game and the position-based model gives 50% win 50% draw for +3.

nguyenpham · 2020-06-29T02:03:33Z

Good idea. I think chess GUIs can do (calculate WDL) but surely the data can't fit for all engines. Thus it is best if a chess engine does itself.

Just a small suggestion: when working with float/double variants, we may see the negative zero (-0) which is considered as zero and there is no problem if we keep that number as a float/double. However, after rounding up and multi with a large number to convert it into an integer ranging from 0 - 1000, the number may become a small negative integer (such as -1). It looks weird and raises some questions from users. This problem is sometimes observed by Lc0.

We can fix in a simple way by using std::max:

int win_rate_model(int ply, Value v) {
  ...
  return std::max(0, int(0.5 + 1000 / (1 + std::exp((a - x) / b))));
}

  int wdl_d = std::max(0, 1000 - wdl_w - wdl_l);

vondele · 2020-06-29T05:09:42Z

@nguyenpham I think that rounding error should not happen (the clamping of the eval, plus the round to nearest takes care of that already).

nguyenpham · 2020-06-29T06:04:46Z

w, d, l could be zero, leading formulas/variants of the function win_rate_model could be zero too and may be stored as negative zeros. It is only safe if the current, as well as future data which you use to create the formula, have all values of w, d, l far from zero.

BTW, that won't happen frequently (zero is a rare case anyway) and that is not serious issue. You may go ahead and be back to fix later if we see the problem happen ;)

gonzalezjo · 2020-06-29T06:38:48Z

This and your visualizations are fantastic.

hazzl · 2020-06-29T08:29:05Z

How about limiting the output to depths >7 (for example) to alleviate the cost?

vondele · 2020-06-29T11:29:29Z

Computational cost is really overestimated here. The full wdl calculation is about 60 cycles, runs at 42614851 calls per second, so adds about a microsecond overhead per move at TCEC conditions.

Meanwhile, I also verified that on the full valid input range of plies and evals there is no case where the wdl is outside the [0, 1000] interval.

snicolet · 2020-06-29T11:48:10Z

@vondele not a comment about the PR request per se, but do you have a similar model and similar graphs using material left on the board (using piece count, or something very simple like Q=9 R=5 B=N=3 P=1, or even pos.non_pawn_material()) ?

If the violet -> yellow gradient was purely vertical, or even parabolic, then we could probably very easily correct our evaluation function to get comparable winning probabilities for opening and endgames, and that could turn into an Elo gainer..

vondele · 2020-06-29T13:07:49Z

yes, also that's available already... but not analysed to a full model, is more complex in the interesting part:
https://github.com/glinscott/fishtest/wiki/UsefulData#for-positions-grouped-by-material-value-summing-pieces-using-values-1-3-3-5-9

snicolet · 2020-06-29T13:57:53Z

@vondele
Oh, that's cool, the following graph in particular shows that we may gain a lot by recalibrating the evaluation function when material is 22 pawns or less: http://cassio.free.fr/stockfish/win-probability-by-score-and-material.jpg , to avoid the diagonal region at the bottom (this region is the reason why irreversible moves and endgames are often too attractive for SF at the moment).

vondele · 2020-06-29T16:00:29Z

@snicolet patches welcome ;-)

However, one has to be careful with the interpretation of the graphs, i.e. the 'position bias' argument by Alayant. The graphs tell you what the probability is for a random position in fishtest games to be won, for a given material count and score. Won endgames with few pieces are quickly over, draw endgames will drag on, this aspect is part of the graph. Things might look very different for random positions encountered in a typical endgame search. Nevertheless, getting some ideas for patches was the main purpose of these graphs :-), the more people are intrigued by them, the better..

snicolet · 2020-06-29T16:06:21Z

I have pushed some tests, for instance:
https://tests.stockfishchess.org/tests/view/5efa0d76020eec13834a9736

Bench: 4530073

AlexandreMasta · 2020-06-29T17:46:12Z

Tests will show!

Awesome work and findings to come.

MichaelB7 · 2020-06-30T18:16:25Z

Just a note of thanks to everyone who helped pulled this together and especially to @vondele for the lead he has taken to make this happen. This is a very desirable enhancement. +1 seems clearly inadequate , so let’s make it a +1000 😊

Edit : FWIW, Both cutechess and Fritz GUI show the WDL data. Which is excellent as they were enhanced to show that data from LC0 and other NN engines.

vondele · 2020-06-30T19:04:12Z

@MichaelB7 thanks for confirming it works in some GUIs. I'll merge it in the following round.

snicolet · 2020-06-30T20:15:21Z

@vondele

Using doubles instead of floats is very slightly faster for me (for two parallel bench at bench 20).

I have implemented that and corrected some typos in the comments and readme in this commit:
snicolet/Stockfish@b7ccd21
https://github.com/snicolet/Stockfish/commit/b7ccd21605c6acf045d951e39e846cda3f423a65.diff

vondele · 2020-06-30T21:00:51Z

@snicolet thanks for the careful review.... I need clearly to pay more attention when writing comments.

The speed difference won't be measurable with a bench, but using doubles is somewhat more consistent with the rest of the code.

snicolet · 2020-06-30T21:29:49Z

@vondele There were also some typos in the readme :-)

    If enabled, show approximate WDL statistics as part of the engine output.
    These WDL numbers model expected game outcomes for a given evaluation and
    game ply as obtained during fishtest LTC games.

vondele · 2020-07-01T05:35:54Z

Thanks for the discussion and comments.

Coolchessguykevin · 2020-07-01T07:56:08Z

@vondele Aquarium GUI does not support this feature, it seems. And I see it set as ON by default in the new UCI-parameter.

Should Aquarium users off it? Or even with the option turned on, it will not affect the output PV analysis?

Thanks!

vondele · 2020-07-01T08:28:42Z

@Coolchessguykevin in principle, a UCI compliant GUI should have no problems with it, i.e. will just ignore the additional output. Nothing should be affected by it. However, if a GUI fails to deal with this output (e.g. crashes), a user can switch it off, till the GUI is fixed.

official-stockfish@1100688 official-stockfish#2778 official-stockfish#2788

This updates the WDL model based on the LTC statistics in June this year (10M games), so from pre-NNUE to NNUE based results. (for old results see, official-stockfish#2778) As before the fit by the model to the data is quite good. No functional change

This updates the WDL model based on the LTC statistics in June this year (10M games), so from pre-NNUE to NNUE based results. (for old results see, official-stockfish#2778) As before the fit by the model to the data is quite good. closes official-stockfish#3582 No functional change

This updates the WDL model based on the LTC statistics for the last month (8M games). for old results see: official-stockfish#3582 official-stockfish#2778 the model changed a bit from the past, some images to follow in the PR No functional change.

This updates the WDL model based on the LTC statistics for the last month (8M games). for old results see: official-stockfish#3582 official-stockfish#2778 the model changed a bit from the past, some images to follow in the PR closes official-stockfish#3981 No functional change.

This updates the WDL model based on the LTC statistics for the two weeks (3M games). for old results see: official-stockfish#3981 official-stockfish#3582 official-stockfish#2778 closes official-stockfish#4115 No functional change.

This updates the WDL model based on the LTC statistics for the two weeks (3M games). for old results see: official-stockfish/Stockfish#3981 official-stockfish/Stockfish#3582 official-stockfish/Stockfish#2778 closes official-stockfish/Stockfish#4115 No functional change. (cherry picked from commit e639c45)

This updates the WDL model based on the LTC statistics for the last month (8M games). for old results see: official-stockfish#3582 official-stockfish#2778 the model changed a bit from the past, some images to follow in the PR closes official-stockfish#3981 No functional change.

vondele added the discussion needed label Jun 28, 2020

snicolet referenced this pull request in snicolet/Stockfish Jun 29, 2020

Recalibrate late endgame values: take 1 (factor 1/3)

2664a33

Bench: 4530073

vondele added to be merged Will be merged shortly and removed discussion needed labels Jun 30, 2020

Review corrections

cb07cc8

Fixes more typos, some more info

0f68bcb

vondele closed this in 1100688 Jul 1, 2020

tttak added a commit to tttak/Stockfish that referenced this pull request Jul 3, 2020

merge "Provide WDL statistics"

9ce0ef3

official-stockfish@1100688 official-stockfish#2778 official-stockfish#2788

tttak mentioned this pull request Jul 3, 2020

merge "Provide WDL statistics" nodchip/Stockfish#35

Merged

nodchip pushed a commit to nodchip/Stockfish that referenced this pull request Jul 3, 2020

merge "Provide WDL statistics"

5dec3e5

official-stockfish@1100688 official-stockfish#2778 official-stockfish#2788

vondele mentioned this pull request Jun 23, 2021

Update WDL model for NNUE #3582

Closed

vondele mentioned this pull request Apr 15, 2022

Update WDL model for current SF #3981

Closed

yuzisee mentioned this pull request Aug 29, 2023

How does the graph differ from Lichess's? rooklift/nibbler#242

Closed

[RFC] Provide WDL statistics #2778

[RFC] Provide WDL statistics #2778

Conversation

vondele commented Jun 28, 2020 • edited Loading

vondele commented Jun 28, 2020

joergoster commented Jun 28, 2020

vondele commented Jun 28, 2020

vondele commented Jun 28, 2020

vondele commented Jun 28, 2020

snicolet commented Jun 28, 2020 • edited Loading

snicolet commented Jun 28, 2020

vondele commented Jun 28, 2020 • edited Loading

AlexandreMasta commented Jun 28, 2020 • edited Loading

vondele commented Jun 28, 2020

AlexandreMasta commented Jun 28, 2020

vondele commented Jun 28, 2020

AlexandreMasta commented Jun 28, 2020 • edited Loading

Alayan-stk-2 commented Jun 28, 2020

vondele commented Jun 28, 2020

Alayan-stk-2 commented Jun 28, 2020

nguyenpham commented Jun 29, 2020

vondele commented Jun 29, 2020

nguyenpham commented Jun 29, 2020 • edited Loading

gonzalezjo commented Jun 29, 2020

hazzl commented Jun 29, 2020

vondele commented Jun 29, 2020

snicolet commented Jun 29, 2020 • edited Loading

vondele commented Jun 29, 2020 • edited Loading

snicolet commented Jun 29, 2020

vondele commented Jun 29, 2020

snicolet commented Jun 29, 2020

AlexandreMasta commented Jun 29, 2020

MichaelB7 commented Jun 30, 2020 • edited Loading

vondele commented Jun 30, 2020

snicolet commented Jun 30, 2020 • edited Loading

vondele commented Jun 30, 2020

snicolet commented Jun 30, 2020

vondele commented Jul 1, 2020

Coolchessguykevin commented Jul 1, 2020

vondele commented Jul 1, 2020

vondele commented Jun 28, 2020 •

edited

Loading

snicolet commented Jun 28, 2020 •

edited

Loading

vondele commented Jun 28, 2020 •

edited

Loading

AlexandreMasta commented Jun 28, 2020 •

edited

Loading

AlexandreMasta commented Jun 28, 2020 •

edited

Loading

nguyenpham commented Jun 29, 2020 •

edited

Loading

snicolet commented Jun 29, 2020 •

edited

Loading

vondele commented Jun 29, 2020 •

edited

Loading

MichaelB7 commented Jun 30, 2020 •

edited

Loading

snicolet commented Jun 30, 2020 •

edited

Loading