Estimating Market Shares w/o Sales Data
May 11, 2021
Deep Dives with Lichess II: Transitivity in Chess
September 3, 2021
Estimating Market Shares w/o Sales Data
May 11, 2021
Deep Dives with Lichess II: Transitivity in Chess
September 3, 2021
Show all

Deep Dives with Lichess I: Rating Transitions

Chess has, in one way or another, accompanied humanity for almost 1500 years. The game, with its simple set of rules that give rise to an unfathomably large set of possible games, has attracted the minds of researchers since at least the early 20th century.

IAMECON researchers are passionate chess players who regularly host and participate in internal and external tournaments. Our curiosity in the analysis of strategic actions in the sequential battle of chess, combined with our expertise in analyzing big data and game theoretical interactions, led us to venture into a line of research and sequence of white papers analyzing various aspects of the rich chess data. 

One publicly available data source of chess games is Lichess. The full Lichess database offers a very rich panel database of repeated games at the individual move level from January 2013 to June 2021, and contains 2,366,271,308 (2.36 billion) games and 8,910,525 unique players stored in a total of 4.4 terabytes of pgn files (Portable Game Notation).

The enormous amount of games and the pgn format of the data make analyzing chess data a nontrivial exercise. Although pgn files are easy to read by humans and appropriate for small files, managing a large number of games is challenging due to their difficulty to parse.1 We decompressed the data files and converted them into a friendlier format. More specifically, we parsed through the colossal set of games stashed in the pgn files by implementing a fast pgn parser that ignores certain information in the data. In this pilot analysis, in particular, we let go of a substantial part of the game data: the players’ moves. The simpler game data was then converted into the Apache Parquet format, a columnar data format commonly used  in big data tools such as Apache Spark or Dask.

In our first analysis of the chess data, we investigated the transition of players’ rankings from one month to the next across four time controls available on Lichess: Bullet, Blitz, Rapid, and Classical. Time controls limit the amount of time each player has for the entire game. Although different chess platforms may define these time controls slightly differently, Lichess uses the following criteria:

Time ControlEstimated Duration
(seconds)
Bullet< 179
Blitz< 479
Rapid< 1499
Classical≥ 1500

Lichess did not differentiate between Rapid and Classical during its beginning, and their database also does not make the distinction. Hence, in our analysis, Rapid and Classical games are combined.

To analyze how quickly players improve, as represented by how quickly their ELO ranking move up or down over time, we created three Markov transition matrices that show the transitions between rating brackets from any given month to the next. The Markov transition matrices summarize the movements of players’ skill levels over time and across discrete brackets (e.g. 1400 is labeled for rankings in 1400-1599). The rows and columns of the matrix represent the rating brackets, and the entries denote the probability of going from the row rating to the column rating, from one month to the next. The “N/A” entry means that a given player was not present either in the month (row) or month t + 1 (column).

We see that the behaviors across the time controls are quite similar. Ignoring those that were not present in months t and t + 1 (the “N/A” row and column, respectively), it is clear that the vast majority of players stay within their rank bracket from one month to the next. The matrices show that transitioning to a higher rank bracket is more likely for lower ranked players than those of higher ranks. Moreover, the matrices also show a curious almost “reverse” phenomenon: higher ranked players are more likely to move down in rank from one month to the next than lower ranked players.

This pilot analysis was our first go at analyzing Lichess’ data. In a future post, we will discuss ranking systems and explore some nuances regarding the pairwise ranking of players. Stay tuned for more!

  1. “PGN is the de-facto standard for chess games, especially when it comes to interoperability. Unfortunately, PGN is somewhat misdesigned. Apparently, it’s not just me who thinks so. PGN is designed to make it easy for humans to read PGN files, and edit or write them manually with a text editor. At a cost, namely that it’s difficult to parse it with computers.” –https://buildingjerry.wordpress.com/2020/02/29/towards-fast-pgn-parsing/