Monday, November 6, 2017

Chess Analytics: Predicting rating from average centipawn loss

I have had the idea of trying to derive a player's rating "empirically", through their play rather than through their results (as is currently done). As a starting point, I thought that it might be possible to approximate a player's rating by looking at their average centipawn loss. The average centipawn loss (aCPL) is the amount by which a chess engine's evaluation of the position changes after each of the player's moves. For example, if the score is about +1.20 before White's move, and White plays a small mistake and the evaluation is +0.30 after White's move, then White's centipawn loss for that move is +1.20 - +0.30 = +0.90; the average centipawn loss is the mean CPL over the course of the game. I thought that perhaps stronger players would have a lower aCPL, such that it would be possible to approximate a player's rating from their aCPL.

The Data

To investigate this question, I downloaded one of the databases available on lichess.org. These databases include all rated games played on lichess for a given period. For the purposes of this exercise, I used a subset of the August 2014 database. From these games, I kept only the ones for which computer analysis was available, so that the engine's evaluation--here, Stockfish--is available after each half-move.

I used Python to transform the file of PGNs into a dataset amenable to statistical analysis. In this dataset, each row is a different game (i.e., a different PGN in the database downloaded on lichess). The figure below shows the first few columns and rows of the dataset:



As you can see, all the meta data for each PGN is kept, and the tags (like "WhiteElo", "Event", etc.) are used as column names. I had to perform operations on the "Moves" column in order to extract the evaluation at each ply. The resulting variables (up to 200 half-moves in this particular dataset) look like this:



Here, I've limited my attention to evaluations ranging from -3 to +3 to avoid getting statistics that are too influenced by extreme scores; any evaluations greater than +3 or less than -3 have been removed (however, the results presented below are very similar whether or not the full range is included, so this is more of a detail at present). From there, I've calculated the difference between each evaluation and the preceding one; half of those differences represents the centipawn loss for the white player, and the other half represents the centipawn loss for the black player. The resulting variables look like this:



Finally, for each game, it is possible to calculate the average centipawn loss for each player by averaging those difference variables across rows, yielding the two aCPL variables:


Descriptive Statistics

Now that the data are ready for analysis, I looked at some descriptive statistics for the variables of interest, the white and black ratings and aCPL. For the final dataset, the average rating for the white players was 1637 (SD = 238), and the average rating for the black players was 1630 (SD = 239). The average aCPL for the white and black players were also very similar, with 0.26 (SD = 0.16) for white and 0.24 (SD = 0.14) for black (again, this is only looking at positions in which the evaluation was between -3.00 and +3.00)

Here's a histogram showing the frequency at which each aCPL is observed in the sample (separated by colors), along with the corresponding kernel density estimation plot:



As can be seen in the second figure, the vast majority of players (with these transformed data) had a relatively low aCPL, below 0.50. This is to be expected, since players with even higher aCPLs would quickly lose almost every game.

Visually, it's also possible to inspect the relationship between aCPL and rating through a scatterplot:


The horror! Even before jumping into the modeling, our guess is that there is not a strong relationship between aCPL and rating! Observations are relatively tightly distributed on the x axis (aCPL) for the full range of the y axis (player rating). In other words, the varying aCPLs are observed at all rating levels in this dataset. Knowing a player's aCPL would not help us much in guessing their rating.

Predicting Rating from aCPL

I tested the relationship between aCPL and player rating more formally through linear regression, predicting the player rating from the player's aCPL, separately for white and black players (it would be possible to include both white and black players simultaneously in the same analysis, but linear regression would not be appropriate here because the scores for both players are closely related within each game, so these two sets of observations are not independent; instead, a good choice here to accommodate the dependency in the data would be a mixed-effects model).

As expected given the graphs above, it is rather difficult to predict a player's rating accurately from their aCPL. Results are slightly different for white and black players, such that it's slightly easier to predict a black's player rating from their aCPL than it is to do the same for a white player (in these data anyway). For white players, the regression coefficient for aCPL was -346 [SE = 34.6, t(1843) = -10.00, p < .001, R² = .05, adj. R² = .05], meaning that we would expect two white players who have a 1-point difference between them in terms of aCPL to have a 346-point difference in terms of rating. The corresponding coefficient for black players was -442 [SE = 37.3, t(1844) = -11.83, p < .001, R² = .07, adj. R² = .07], meaning that we would expect two black who differ by 1 point in terms of aCPL to differ by 442 in terms of rating.

Unfortunately these predictions are not very useful, as indicated by the low R²: between 5-7% of the variation in ratings can be accounted for by variation in aCPL, while the remaining 93-95% of the variation in player ratings remains to be accounted for. I've tried to include higher-order polynomials of aCPL as predictors (2nd and 3rd degree), but as you might have guessed given the scatterplot above, this was not helpful (there simply does not seem to be much of a relationship between aCPL and rating, of any form). So my overall conclusion, stated somewhat crudely: I would not use this model to predict a player's rating after having observed one of their games and analyzed it with Stockfish. The search for a better method continues.

What's Next?

There has to be ways to predict a player's rating from their gameplay, since after all their play is what dictates the outcome of the game. The newer lichess databases provide the player's clocks during the game, so it would be possible to examine whether time taken at each move has an effect on the player's rating (or on the outcome of the game). In this analysis I've also included all games without regard to their time control, but perhaps it would be best to exclude shorter time-control (i.e., bullet) games, as well as games involving players with provisional ratings. As a first step I've also combined information from all moves into one summary statistic, the aCPL, but incorporating move-level information might yield better results. We shall see in a future post.

12 comments:

Richard said...

Hi Patrick,

I'm surprised your R^2 values are as low as they are. Looking at R^2 on Chesstempo.com for aCPL to rating, we average 0.10 across all games. Time control has a big impact on the correlation though:

Bullet: 0.05
Blitz: 0.11
Rapid:0.18
Long: 0.18
Correspondence:0.24

Probably one of the issues with why these numbers are not that high is because there is quite a large amount of variability from game to game. Predicting playing strength from one game based on ACPL doesn't look that useful. However if we look at the average ACPL across all of a user's games, and restrict the analysis to players who played>50 games, and using the average rating over their games, then we start to see higher correlation numbers, so R^2 when looking at average ACPL to rating correlation is:
0.54

If we then use the same Average ACPL and >50 game criteria, and factor out the variability from different correlations from different time controls we see the slower the time controls are starting to get quite reasonable correlation levels with R^2:
Bullet:0.22
Blitz: 0.58
Rapid:0.72
Long:0.49
Correspondence:0.64

I'm not sure why Long and Correspondence are lower than Rapid, but the sample sizes are much smaller, with Rapid being nearly 30 times as popular as Long and Correspondence (partly because it takes a long time to get to get to the 50 games minimum requirement, "Long' on CT is >=60 mins.

So at least for Rapid (15 minute to anything below 60 minute estimated time where estimated time is start time + 40*increment), it seems there is a reasonable amount of rating differences explainable by ACPL differences, as long a you look at from an average across all games, and not average in a single game.

Unknown said...

You wrote "There has to be ways to predict a player's rating from their gameplay, since after all their play is what dictates the outcome of the game"

Indeed that was exactly the goal of the Kaggle "Finding Elo" contest: https://www.kaggle.com/c/finding-elo

From a single game, we were able to predict a player's rating with a median error of 155 rating points. https://www.kaggle.com/c/finding-elo/leaderboard

While that may sound like a big range, bear in mind that a player's quality of play can vary quite a bit from game to game. That tendency-to-vary means that from any single game, you are going to have a hard time guessing the player's rating, because the particular game you're looking at might be one of their best games today/this week, or one of their worst, and you don't know which.

Patrick said...

Hi Richard,

Thanks a lot for showing the breakdown by time controls, I didn't do anything yet with the time controls (I thought that at the very least I should remove bullet games). I hope to be able to look at this soon.

I'm impressed with the R²s you got with your >50 games criteria, nice to see (esp. with rapid games). My hope was to be able to derive a rating based on a limited number of games (like 1-2-3). And if that's not possible using CPL, is there an advantage to using CPL as a metric rather than game outcomes to determine a player rating? I guess you could get an estimated rating without playing human opponents.. (By the way, now that you mention it, I might look at the lichess databases and see if there's an easy way to get all the games from the same players rather than the games in a given month)

Patrick said...

Hi David,

Wow I didn't know there had been a Kaggle competition for this, thank you so much for posting the link--I wouldn't have thought that a median error of 155 points was a big range (I should try to calculate mine and I bet it wouldn't be 155!). I did think about individual games being something of a one-shot snapshot with a lot of variance.

Is there a way on the kaggle website to look at how you've achieved that median error (i.e., how you've done your predictions)?

Richard said...

I think it would be difficult to get great 1-3 game snapshots from pure aCPL value alone. Correlation does seem to increase sharply as you increase the number of games for each player, so averaging rating/aCPL over 2 games is already twice as good as 1, and by 10 games you are at 0.47 R^2 which while not terrible, even that might not be enough to get the kind of rating accuracy prediction you are after.

As David's work on the Kaggle competition shows, it is possible to get reasonably tight ranges from just one game, but while not having seen what they used as inputs, I'm guessing it is more than just simple raw aCPL values. If you want to stick to aCPL, differentiating phases of the game seems to be important for scaling the significance of aCPL from what I can see of my own data, and I'd guess being much more clever on context, where obvious retakes versus subtle positional choices between a group of similar moves are explicitly handled might start to approach a useful single game metric.

Getting your hands on some data that allows you to rule out unstable ratings from your calculations is probably important. You might need to analyse a random sample of games yourself too due to the lichess skewing that occurs due to game analysis being triggered by player requests (or cheat detection based sampling decisions which might also exasperate the provisional issue, as it would make sense to check new players more often than you check long term players). The kaggle competitors might have some test data that might be useful. There are some older datasets around, but they will not have the accuracy of analysis than the lichess stockfish analysis has, given the progress engines have made in recent years, but that might not be a critical factor for you.

Tapani said...

Have you guys considered looking at not only ACL but also the ACL standard deviation?

My guess is that when a strong player has a high ACL, it stems from a few big mistakes rather than many medium sized ones (unlike lower rated players?).

//T

Unknown said...

You should probably factor in the aCPL of opponents during each game. A lot of players like to play tricky openings that aren't necessarily strong on their own merits, but induce others to make mistakes. A lot of players like to get into sharp positions and although they may not find the best moves, they are better at making fewer mistakes than opponents. Perhaps also analyze by move to move trends within a game. I suspect there is a lot more correlation of rating with opponent's aCPL than one's own.

Sahar Asghar said...

i quess that when one is winning, there is no need to exhaust oneself any further with trying to play best moves possible. It is better to enjoy slow death of oponents and pursue win with non-complicated way and save your stamina for next match. I think that what identifies strong player is that when his oponent do good move (=no centipawn loss), he will notice and will find good response (=no centipawn loss either so he maintain his advantage). If his oponent is making bad moves, he wont really care. So maybe you can deduce strength of player from centipawn loss, but it needs to take into account both players.

Sahar Asghar said...

i would add ..jump from +10 to +5 is in practice insignificant (white is winning no matter what), but change from +2 to +1 is much worse (your advantage is vanishing). So try not to calculate centipawn loss per se, but only centipawn losses when position is not clearly winning/losing

MarquisSmith said...

Average centipawn loss is meaningless without knowing the strength of the opponent, so of course it only accounts for a few percent of the variance. When your opponent makes bad moves, good replies are easy to find. When your opponents makes good moves, it is much easier to stumble. Without controlling for opponent rating (and/or rating differential) you will find very little of value.

Case in point, my son, a talented junior, has risen about 200 points in the last 7 months in both USCF and on lichess blitz. On lichess, where almost all his matches are against similar strength players, his centipawn loss has remained tightly centered around 65. But the strength of his opponents has risen along with his strength, to about 1775.

John Larkin said...

It may be worth your knowing that other sites are quoting this information in a way that suggests you found no relationship, as opposed to a clear relationship - but not as predictive as you had hoped.

I also appear to be misunderstanding something.
I read the r-value for the correlation being 0.346. To me, that would give an r-squared of something approaching 0.12, yet you quote 0.05. Are the terms different from what I assume?

Also, omitting the large drops does mean that blunders are excluded from analysis, and low-raters are more likely to do that. e.g playing an opening normally, then blundering when they have to think.

Chris Irwin said...

I wonder if median centipawn loss could have a tighter correlation?