-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Provide WDL statistics #2778
Conversation
Hmm, I didn't know that we have to provide 3 values. There is quite some computational effort involved now. I would move this into UCI namespace, maybe UCI::wdl() similar to UCI::value(). |
yes, could be moved to UCI namespace. The computational effort is there, but this is only computed once per depth, so negligible at higher depth. All large scores (including TB and mate) get clamped to +- 1000cp, which should yield 1000 0 0. |
Can we try to run alpha-beta on the win rate instead of usual eval? Or transform back the win rate to usual eval value (but taking implicitly into account the game ply) and then use our alpha-beta? |
Very nice graphics, by the way, I wish we would all have the skills to produce more of graphs like that to ease the discussions! |
Yes, I think this data should be exploitable, since it suggests that for two positions with similar eval one should use something closer to the root, but the current full form (exp, etc.) is too expensive to call in eval. (Edit: that's basically, the curve that fits 'a' in the above graph. That's the score needed to have 50% winrate, as a function of move) Edit2: The transformation can be done with exponentials, it is the two curves as / bs above that are needed. I had a few tries a couple of days ago with linearized forms see e.g. https://tests.stockfishchess.org/tests/view/5ee92722563bc7aa755ffc48 but that was just 'adjustment by hand', I didn't have the formulae yet. Note that the data is for the evaluation after search, the positions leading to this eval will be quite a bit deeper. :-) I'm learning to use mathplotlib... |
Very nice graphics but why nerf the engine for an information that can be calculated by the GUI as TCEC is doing? I don´t think this kind of information values any nerf to the engine. The engine already gives you the centipawns score. We need elo patches not nerf patches Congrats for the graphs and code anyway. |
@AlexandreMasta takes a couple of weeks of data collection and analysis to do that for one engine. Unlikely GUIs (or TCEC) can put that effort in, this would lead to pretty different results. So far, I have not seen any approaches that take the move number into account, either. But yes, in principle it could be upstreamed to all GUIs. So, that's one point why I put this up for discussion. |
I know you had lots of work. I´m sure everybody here is grateful for your work. It is very nice indeed. I´m just saying I don´t think this should be implemented as default weakening the engine for this info. Maybe as a "turn on / turn off" feature in the UCI options should be much better. So...it is nice but make it so that user can turn it off to provide the best performance of the engine. Congrats! |
have a look at the patch... the option is there. And is doesn't make the engine weaker as the testing shows (and is quite obvious as well from the code). |
Very nice then! Maybe just turn it off as default. My sincere congratulations! I really was thinking this would be implemented as always ON. Nice job. With your lead SF is achieving great results! Cheers |
That's very nice data, but one big disclaimer with all WDL approaches is that the draw ratio and the "weak side win" ratio are extremely TC/hardware dependent. Even with the work to get a nice fit from fishtest data, things can change drastically with different test conditions. Some other limitations :
|
@Alayan-stk-2 yes, I wanted to point that out. There is likely TC dependence that is not captured in the model. I wonder, however, if TC means the scores are different, or the effect of a given score is different. That's not entirely clear. Yes, adjudication plays a role. The model is only retaining data with eval < 400cp. There is some effect of adjudication, on win rate, but mostly outside the interval [0.05, 0.95], which i mentioned. The counting 'bias' is basically related to the definition. The model is correct for the definition (i.e. picking one position and one eval). Other definitions lead to other graphs. |
I think it's both. When a position is clearly winning, eval is higher at longer TC as the engine can search deeper. In non-clearly winning position, the eval gets more reliable with more time, especially 0.00 draws. But blunders and mistakes are much more likely at shorter TCs, so even if short and long search do happen to give the same eval, WDL will differ. Simplest example is 0.00, as this eval can happen at any TC. At TCEC conditions, a 0.00 is a big statement, the draw probability is really high. At bullet, the expected 3-fold is likely to be avoidable, a blunder later in the game is frequent. By the way, another issue with WDL is how to rate 0.00. 0.00 is a symmetrical evals but in most positions evaluated as such, unless 100% D, one side has a much bigger win probability than the other... We can't easily extract this information from SF.
Let's take a simplified dataset to illustrate why the counting bias leads to results going against expectations. 1 game with a static fortress +3 eval from move 50 to move 98. This game is drawn. 49 games with a +3 eval at move 50, 51... to 98, each for one move only. 98% of games having reached a +3 eval in the dataset between moves 50 and move 98 were won. So if a +3 happens in a game, the best prediction is 98% win. But 50% of positions with +3 eval between move 50 and move 98 happened in that one fortress game and the position-based model gives 50% win 50% draw for +3. |
Good idea. I think chess GUIs can do (calculate WDL) but surely the data can't fit for all engines. Thus it is best if a chess engine does itself. Just a small suggestion: when working with float/double variants, we may see the negative zero (-0) which is considered as zero and there is no problem if we keep that number as a float/double. However, after rounding up and multi with a large number to convert it into an integer ranging from 0 - 1000, the number may become a small negative integer (such as -1). It looks weird and raises some questions from users. This problem is sometimes observed by Lc0. We can fix in a simple way by using std::max:
|
@nguyenpham I think that rounding error should not happen (the clamping of the eval, plus the round to nearest takes care of that already). |
w, d, l could be zero, leading formulas/variants of the function BTW, that won't happen frequently (zero is a rare case anyway) and that is not serious issue. You may go ahead and be back to fix later if we see the problem happen ;) |
This and your visualizations are fantastic. |
How about limiting the output to depths >7 (for example) to alleviate the cost? |
Computational cost is really overestimated here. The full wdl calculation is about 60 cycles, runs at 42614851 calls per second, so adds about a microsecond overhead per move at TCEC conditions. Meanwhile, I also verified that on the full valid input range of plies and evals there is no case where the wdl is outside the [0, 1000] interval. |
@vondele not a comment about the PR request per se, but do you have a similar model and similar graphs using material left on the board (using piece count, or something very simple like Q=9 R=5 B=N=3 P=1, or even pos.non_pawn_material()) ? If the violet -> yellow gradient was purely vertical, or even parabolic, then we could probably very easily correct our evaluation function to get comparable winning probabilities for opening and endgames, and that could turn into an Elo gainer.. |
yes, also that's available already... but not analysed to a full model, is more complex in the interesting part: |
@vondele |
@snicolet patches welcome ;-) However, one has to be careful with the interpretation of the graphs, i.e. the 'position bias' argument by Alayant. The graphs tell you what the probability is for a random position in fishtest games to be won, for a given material count and score. Won endgames with few pieces are quickly over, draw endgames will drag on, this aspect is part of the graph. Things might look very different for random positions encountered in a typical endgame search. Nevertheless, getting some ideas for patches was the main purpose of these graphs :-), the more people are intrigued by them, the better.. |
I have pushed some tests, for instance: |
Tests will show! Awesome work and findings to come. |
Just a note of thanks to everyone who helped pulled this together and especially to @vondele for the lead he has taken to make this happen. This is a very desirable enhancement. +1 seems clearly inadequate , so let’s make it a +1000 😊 Edit : FWIW, Both cutechess and Fritz GUI show the WDL data. Which is excellent as they were enhanced to show that data from LC0 and other NN engines. |
@MichaelB7 thanks for confirming it works in some GUIs. I'll merge it in the following round. |
Using doubles instead of floats is very slightly faster for me (for two parallel bench at bench 20). I have implemented that and corrected some typos in the comments and readme in this commit: |
@snicolet thanks for the careful review.... I need clearly to pay more attention when writing comments. The speed difference won't be measurable with a bench, but using doubles is somewhat more consistent with the rest of the code. |
@vondele There were also some typos in the readme :-)
|
Thanks for the discussion and comments. |
@vondele Aquarium GUI does not support this feature, it seems. And I see it set as ON by default in the new UCI-parameter. Should Aquarium users off it? Or even with the option turned on, it will not affect the output PV analysis? Thanks! |
@Coolchessguykevin in principle, a UCI compliant GUI should have no problems with it, i.e. will just ignore the additional output. Nothing should be affected by it. However, if a GUI fails to deal with this output (e.g. crashes), a user can switch it off, till the GUI is fixed. |
This updates the WDL model based on the LTC statistics in June this year (10M games), so from pre-NNUE to NNUE based results. (for old results see, official-stockfish#2778) As before the fit by the model to the data is quite good. No functional change
This updates the WDL model based on the LTC statistics in June this year (10M games), so from pre-NNUE to NNUE based results. (for old results see, official-stockfish#2778) As before the fit by the model to the data is quite good. closes official-stockfish#3582 No functional change
This updates the WDL model based on the LTC statistics in June this year (10M games), so from pre-NNUE to NNUE based results. (for old results see, official-stockfish#2778) As before the fit by the model to the data is quite good. closes official-stockfish#3582 No functional change
This updates the WDL model based on the LTC statistics for the last month (8M games). for old results see: official-stockfish#3582 official-stockfish#2778 the model changed a bit from the past, some images to follow in the PR No functional change.
This updates the WDL model based on the LTC statistics for the last month (8M games). for old results see: official-stockfish#3582 official-stockfish#2778 the model changed a bit from the past, some images to follow in the PR closes official-stockfish#3981 No functional change.
This updates the WDL model based on the LTC statistics for the two weeks (3M games). for old results see: official-stockfish#3981 official-stockfish#3582 official-stockfish#2778 closes official-stockfish#4115 No functional change.
This updates the WDL model based on the LTC statistics for the two weeks (3M games). for old results see: official-stockfish/Stockfish#3981 official-stockfish/Stockfish#3582 official-stockfish/Stockfish#2778 closes official-stockfish/Stockfish#4115 No functional change. (cherry picked from commit e639c45)
This updates the WDL model based on the LTC statistics for the last month (8M games). for old results see: official-stockfish#3582 official-stockfish#2778 the model changed a bit from the past, some images to follow in the PR closes official-stockfish#3981 No functional change.
This updates the WDL model based on the LTC statistics for the last month (8M games). for old results see: official-stockfish#3582 official-stockfish#2778 the model changed a bit from the past, some images to follow in the PR closes official-stockfish#3981 No functional change.
A number of engines, GUIs and tournaments start to report WDL estimates
along or instead of scores. This patch enables reporting of those stats
in a more or less standard way (http://www.talkchess.com/forum3/viewtopic.php?t=72140)
The model this reporting uses is based on data derived from a few million fishtest LTC games,
given a score and a game ply, a win rate is provided that matches rather closely,
especially in the intermediate range [0.05, 0.95] that data. Some data is shown at
https://github.com/glinscott/fishtest/wiki/UsefulData#win-loss-draw-statistics-of-ltc-games-on-fishtest
Making the conversion game ply dependent is important for a good fit, and is in line
with experience that a +1 score in the early midgame is more likely a win than in the late endgame.
Even when enabled, the printing of the info causes no significant overhead.
Passed STC:
LLR: 2.94 (-2.94,2.94) {-1.50,0.50}
Total: 197112 W: 37226 L: 37347 D: 122539
Ptnml(0-2): 2591, 21025, 51464, 20866, 2610
https://tests.stockfishchess.org/tests/view/5ef79ef4f993893290cc146b
The PR is for discussion, the comments will be updated with a comparison of model and data.
Bench: 4789930