I have had the idea of trying to derive a player's rating "empirically", through their play rather than through their results (as is currently done). As a starting point, I thought that it might be possible to approximate a player's rating by looking at their average centipawn loss. The
average centipawn loss (aCPL) is the amount by which a chess engine's evaluation of the position changes after each of the player's moves. For example, if the score is about +1.20 before White's move, and White plays a small mistake and the evaluation is +0.30 after White's move, then White's centipawn loss for that move is +1.20 - +0.30 = +0.90; the average centipawn loss is the mean CPL over the course of the game. I thought that perhaps stronger players would have a lower aCPL, such that it would be possible to approximate a player's rating from their aCPL.
The Data
To investigate this question, I downloaded one of the
databases available on lichess.org. These databases include all rated games played on lichess for a given period. For the purposes of this exercise, I used a subset of the August 2014 database. From these games, I kept only the ones for which computer analysis was available, so that the engine's evaluation--here,
Stockfish--is available after each half-move.
I used Python to transform the file of PGNs into a dataset amenable to statistical analysis. In this dataset, each row is a different game (i.e., a different PGN in the database downloaded on lichess). The figure below shows the first few columns and rows of the dataset:
As you can see, all the meta data for each PGN is kept, and the tags (like "WhiteElo", "Event", etc.) are used as column names. I had to perform operations on the "Moves" column in order to extract the evaluation at each ply. The resulting variables (up to 200 half-moves in this particular dataset) look like this:
Here, I've limited my attention to evaluations ranging from -3 to +3 to avoid getting statistics that are too influenced by extreme scores; any evaluations greater than +3 or less than -3 have been removed (however, the results presented below are very similar whether or not the full range is included, so this is more of a detail at present). From there, I've calculated the difference between each evaluation and the preceding one; half of those differences represents the centipawn loss for the white player, and the other half represents the centipawn loss for the black player. The resulting variables look like this:
Finally, for each game, it is possible to calculate the average centipawn loss for each player by averaging those difference variables across rows, yielding the two aCPL variables:
Descriptive Statistics
Now that the data are ready for analysis, I looked at some descriptive statistics for the variables of interest, the white and black ratings and aCPL. For the final dataset, the average rating for the white players was 1637 (
SD = 238), and the average rating for the black players was 1630 (
SD = 239). The average aCPL for the white and black players were also very similar, with 0.26 (
SD = 0.16) for white and 0.24 (
SD = 0.14) for black (again, this is only looking at positions in which the evaluation was between -3.00 and +3.00)
Here's a histogram showing the frequency at which each aCPL is observed in the sample (separated by colors), along with the corresponding kernel density estimation plot:
As can be seen in the second figure, the vast majority of players (with these transformed data) had a relatively low aCPL, below 0.50. This is to be expected, since players with even higher aCPLs would quickly lose almost every game.
Visually, it's also possible to inspect the relationship between aCPL and rating through a scatterplot:
The horror! Even before jumping into the modeling, our guess is that there is not a strong relationship between aCPL and rating! Observations are relatively tightly distributed on the
x axis (aCPL) for the full range of the
y axis (player rating). In other words, the varying aCPLs are observed at all rating levels in this dataset. Knowing a player's aCPL would not help us much in guessing their rating.
Predicting Rating from aCPL
I tested the relationship between aCPL and player rating more formally through linear regression, predicting the player rating from the player's aCPL, separately for white and black players (it would be possible to include both white and black players simultaneously in the same analysis, but linear regression would not be appropriate here because the scores for both players are closely related within each game, so these two sets of observations are not independent; instead, a good choice here to accommodate the dependency in the data would be a mixed-effects model).
As expected given the graphs above, it is rather difficult to predict a player's rating accurately from their aCPL. Results are slightly different for white and black players, such that it's slightly easier to predict a black's player rating from their aCPL than it is to do the same for a white player (in these data anyway). For white players, the regression coefficient for aCPL was -346 [
SE = 34.6,
t(1843) = -10.00,
p < .001,
R² = .05, adj.
R² = .05], meaning that we would expect two white players who have a 1-point difference between them in terms of aCPL to have a 346-point difference in terms of rating. The corresponding coefficient for black players was -442 [
SE = 37.3,
t(1844) = -11.83,
p < .001,
R² = .07, adj.
R² = .07], meaning that we would expect two black who differ by 1 point in terms of aCPL to differ by 442 in terms of rating.
Unfortunately these predictions are not very useful, as indicated by the low R²: between 5-7% of the variation in ratings can be accounted for by variation in aCPL, while the remaining 93-95% of the variation in player ratings remains to be accounted for. I've tried to include higher-order polynomials of aCPL as predictors (2nd and 3rd degree), but as you might have guessed given the scatterplot above, this was not helpful (there simply does not seem to be much of a relationship between aCPL and rating, of any form). So my overall conclusion, stated somewhat crudely:
I would not use this model to predict a player's rating after having observed one of their games and analyzed it with Stockfish. The search for a better method continues.
What's Next?
There
has to be ways to predict a player's rating from their gameplay, since after all their play is what dictates the outcome of the game. The newer lichess databases provide the player's clocks during the game, so it would be possible to examine whether time taken at each move has an effect on the player's rating (or on the outcome of the game). In this analysis I've also included all games without regard to their time control, but perhaps it would be best to exclude shorter time-control (i.e., bullet) games, as well as games involving players with provisional ratings. As a first step I've also combined information from all moves into one summary statistic, the aCPL, but incorporating move-level information might yield better results. We shall see in a future post.