{anianr,gdelt}@google.com

Anian RuossEqual contributionsGoogle DeepMindGrégoire DelétangEqual contributionsGoogle DeepMindSourabh MedapatiGoogle DeepMindJordi Grau-MoyaGoogle DeepMindLi Kevin WenliangGoogle DeepMindElliot CattGoogle DeepMindJohn ReidGoogle DeepMindTim GeneweinGoogle DeepMind

###### Abstract

The recent breakthrough successes in machine learning are mainly attributed to scale: namely large-scale attention-based architectures and datasets of unprecedented scale.This paper investigates the impact of training at scale for chess.Unlike traditional chess engines that rely on complex heuristics, explicit search, or a combination of both, we train a 270M parameter transformer model with supervised learning on a dataset of 10 million chess games.We annotate each board in the dataset with action-values provided by the powerful Stockfish 16 engine, leading to roughly 15 billion data points.Our largest model reaches a Lichess blitz Elo of2895 against humans, and successfully solves a series of challenging chess puzzles, without any domain-specific tweaks or explicit search algorithms. We also show that our model outperforms AlphaZero’s policy and value networks (without MCTS) and GPT-3.5-turbo-instruct. A systematic investigation of model and dataset size shows that strong chess performance only arises at sufficient scale. To validate our results, we perform an extensive series of ablations of design choices and hyperparameters.

## 1 Introduction

One of the most iconic successes of AI is IBM’s Deep Blue(Campbell etal., 2002) defeating the world chess champion Garry Kasparov in 1997. This was widely seen as the first major demonstration that machines are capable of out-competing humans in intellectual domains that require sophisticated rational reasoning and strategic planning—feats of intelligence that were long believed to be exclusive to humans. Deep Blue was an expert system that combined an extensive database of chess knowledge and heuristics with a strong tree search algorithm (alpha-beta pruning). Almost all modern and much stronger chess engines follow a similar recipe, with Stockfish 16 currently being the world’s strongest (publicly available) engine. Notable exceptions are DeepMind’s AlphaZero(Silver etal., 2017), which uses search and self-taught heuristics but no human chess knowledge, and its open-source replication Leela Chess Zero, which currently often comes in as a close second in chess computer competitions(Haworth and Hernandez, 2021).

Recent breakthroughs in scaling up AI systems have resulted in dramatic progress in cognitive domains that remained challenging for earlier-generation systems like Deep Blue. This progress has been driven by general-purpose techniques, in particular (self-) supervised training on expert data with attention-based architectures (Vaswani etal., 2017) applied at scale, resulting in the development of LLMs with impressive and unexpected cognitive abilities like OpenAI’s GPT series(Brown etal., 2020; OpenAI, 2023), the LLaMA family of models(Touvron etal., 2023a, b), or Google DeepMind’s Chinchilla(Hoffmann etal., 2022) and Gemini(Anil etal., 2023).However, it is unclear whether the same technique would work in a domain like chess, where successful policies typically rely on sophisticated algorithmic reasoning (search, dynamic programming) and complex heuristics. Thus, the main question of this paper is: *Is it possible to use supervised learning to obtain a chess policy that generalizes well and thus leads to strong play without explicit search?*

To study this question we apply the success recipe of general supervised training at scale to chess (see Figure1).We use a standard attention-based architecture and a standard supervised training protocol to learn to predict action-values (corresponding to win-percentages) for chess boards.The strength of the resulting chess policy thus depends entirely on the strength of the underlying action-value predictor.To get a large corpus of “ground-truth” action-values we use Stockfish 16 as an oracle to annotate millions of board states obtained from randomly drawn games on lichess.org, which are mostly played by humans varying significantly in playing strength.As we will show this leads to a strong, grandmaster-level chess policy (Lichess blitz Elo 2895 against humans), driven by a modern transformer to predict action-values *without any explicit search*.This policy outperforms GPT-3.5-turbo-instruct (and, therefore, GPT-4(Carlini, 2023)) and AlphaZero’s policy and value networks, which reach Elo ratings of 1755, 1620, and 1853, respectively.Therefore, our work shows that it is possible to distill a good approximation of Stockfish 16 into a feed-forward neural network via standard supervised learning at sufficient scale—akin to the quote famously attributed to José Raúl Capablanca, world chess champion from 1921 to 1927: *“I see only one move ahead, but it is always the correct one.”*

##### We make the following main contributions:

- •
We distill an approximation of Stockfish 16 into a neural predictor that generalizes well to novel board states.

- •
We construct a policy from our neural predictor and show that it plays chess at grandmaster level (Lichess blitz Elo 2895) against humans and succcessfully solves many challenging chess puzzles (up to Elo 2800).To the best of our knowledge this is currently the strongest chess engine without explicit search.

- •
We perform ablations of the model size and data set size, showing that robust generalization and strong chess play only arise at sufficient scale.

## 2 Methods

We now provide details on the dataset creation, the predictors and policies, and the evaluation (see Figure1).

### 2.1 Data

To construct a dataset for supervised training we download 10 million games from Lichess (lichess.org) from February 2023.We extract all board states$s$ from these games and estimate the state-value$V^{\text{SF}}(s)$ for each state with Stockfish 16 using a time limit of 50ms per board (unbounded depth and level).The value of a state is the win percentage estimated by Stockfish, lying between $0\%$ and $100\%$.^{1}^{1}1Stockfish returns a score in centipawns that we convert to the win percentage with the standard formula ${\textrm{win}\%=50\%\cdot 2/(1+\exp(-0.00368208\cdot\textrm{centipawns}))}$ from https://lichess.org/page/accuracy.We also use Stockfish to estimate action-values$Q^{\text{SF}}(s,a)$ for all legal actions${a\in\mathcal{A}_{\text{legal}}(s)}$ in each state.Here we use a time limit of 50ms per state-action pair (unbounded depth and max skill level), which corresponds to an oracle Lichess blitz Elo of 2713 (see Section3.1).The action-values (win percentages) also determine the oracle best action$a^{\text{SF}}$:

$a^{\text{SF}}(s)=\underset{a\in\mathcal{A}_{\text{legal}}(s)}{\arg\max}Q^{%\text{SF}}(s,a).$ |

We rarely get time-outs when computing action-values via Stockfish, in which case we cannot determine the best action for a board state and drop the corresponding record from the behavioral cloning training set (see TableA1).Since we train on individual boards and not whole games we randomly shuffle the dataset after annotation.

For our largest training dataset, based on $10$M games, this results in $15.32$B action-value estimates (or $\approx 530$M state-value estimates and best oracle actions) to train on. To create test datasets we follow the same annotation procedure, but on $1$k games downloaded from a different month (March 2023, $\approx 1.8$M action-value estimates, $\approx 60$k state-value estimates and best oracle actions). Since there is only a small number of early-game board-states and players often play popular openings, this i.i.d.test set contains $14.7\%$ of boards that are also in the training set. We do not remove them, as doing so would introduce distributional shift and skew test-set metrics. Finally, we also create a puzzle test set, following the procedure in Carlini (2023), consisting of $10$k challenging board states that come with a correct sequence of moves to solve the puzzle, which we compare against in our puzzle set accuracy evaluation. Only $1.33\%$ of the puzzle set boards appear in the training set (i.e., the initial board states, not complete solution sequences). Since evaluation of puzzle solutions is slow, we use a subset of $1$k puzzles in some of our evaluations ($1.4\%$ overlap with training set).

##### Value binning

The predictors we train are discrete discriminators (classifiers), therefore we convert win-percentages (the ground-truth state- or action-values) into discrete “classes” via binning: we divide the interval between $0\%$ and $100\%$ uniformly into $K$bins (non-overlapping sub-intervals) and assign a one-hot code to each bin ${z_{i}\in\{z_{0},\ldots,z_{K}\}}$. If not mentioned otherwise, ${K=128}$. For our behavioral cloning experiments we train to predict oracle actions directly which are already discrete. We perform ablations for the number of bins in Section3.4.

### 2.2 Model

For all our predictors we use a modern decoder-only transformer backbone(Vaswani etal., 2017; Touvron etal., 2023a, b) to parameterize a discrete probability distribution by normalizing the transformer’s outputs with a $\log$-softmax layer. The model thus outputs $\log$ probabilities. The context-size is $79$ for action-value prediction, and $78$ for state-value prediction and behavioral cloning (see ‘Tokenization’ below). The output size is $K$ (the number of bins) for action- and state-value prediction and $1968$ (the number of all possible legal actions) for behavioral cloning.We use learned positional encodings(Gehring etal., 2017) as the length of the input sequences is constant. Our largest model has roughly $270$ million parameters. We provide all details for the model-size ablations in Section3.3.

##### Tokenization

Board states$s$ are encoded as FEN strings which we convert to fixed-length strings of $77$characters where the ASCII-code of each character is one token. A FEN string is a description of all pieces on the board, whose turn it is, the castling availability for both players, a potential en passant target, a half-move clock and a full-move counter. We essentially take any variable-length field in the FEN string, and convert it into a fixed-length sub-string by padding with ‘.’ if needed. We never flip the board; the FEN string always starts at rank 1, even when it is the black’s turn.We store the actions in UCI notation (e.g., ‘e2e4’ for the well-known white opening move). To tokenize them we determine all possible legal actions across games, which is $1968$, sort them alphanumerically (case-sensitive), and take the action’s index as the token, meaning actions are always described by a single token (all details in SectionA.1).

##### Training protocol

Predictors are trained by minimizing cross-entropy loss (i.e., $\log$-loss) via mini-batch based stochastic gradient descent using Adam(Kingma and Ba, 2015).We train for $10$million steps, which corresponds to $2.67$epochs for a batch size of $4096$ with $15.32$B data points (cf.TableA1).The target labels are either bin-indices in the case of state- or action-value prediction (see Section2.1) or action indices for behavioral cloning; using a one-hot encoding in all cases (details in SectionsA.2 andA.3).

### 2.3 Predictors and Policies

Our predictors are discrete distributions parameterized by neural networks$P_{\theta}(z|x)$ that take a tokenized input$x$ and output a predictive distribution over discrete labels ${\{z_{0},\ldots,z_{K}\}}$. Depending on the prediction-target we distinguish between three tasks (see Figure1 for an overview).

##### (AV) Action-value prediction

The target label is the bin$z_{i}$ into which the ground-truth action-value estimate${Q^{\text{SF}}(s,a)}$ falls. The input to the predictor is the concatenation of tokenized state and action. The loss for a single data point is:

$\displaystyle-\log P^{\text{AV}}_{\theta}(z_{i}|s,a)\leavevmode\nobreak\ %\leavevmode\nobreak\ \leavevmode\nobreak\ \text{with}\leavevmode\nobreak\ z_{i%}:=\text{bin}_{K}(Q^{\text{SF}}(s,a)),$ | (1) |

where $K$ is the number of bins and $\text{bin}_{K}(x)$ is a function that computes the (one-hot) bin-index of value$x$. To use the predictor in a policy, we evaluate the predictor for all legal actions in the current state and pick the action with maximal expected action-value:

$\hat{a}^{\text{AV}}(s)=\underset{a\in\mathcal{A}_{\text{legal}}}{\arg\max}%\leavevmode\nobreak\ \underbrace{\mathbb{E}_{P^{\text{AV}}_{\theta}(z|s,a)}[z]%}_{\hat{Q}_{\theta}(s,a)}.$ |

##### (SV) State-value prediction

The target label is the bin$z_{i}$ that the ground-truth state-value$V^{\text{SF}}(s)$ falls into. The input to the predictor is the tokenized state. The loss for a single data point is:

$\displaystyle-\log P^{\text{SV}}_{\theta}(z_{i}|s)\leavevmode\nobreak\ %\leavevmode\nobreak\ \text{with}\leavevmode\nobreak\ z_{i}:=\text{bin}_{K}(V^{%\text{SF}}(s)).$ | (2) |

To use the state-value predictor as a policy, we evaluate the predictor for all states${s^{\prime}=T(s,a)}$ that are reachable via legal actions from the current state (where $T(s,a)$ is the deterministic transition of taking action$a$ in state$s$). Since $s^{\prime}$ implies that it is now the opponent’s turn, the policy picks the action that leads to the state with the worst expected value for the opponent:

$\hat{a}^{\text{SV}}(s)=\underset{a\in\mathcal{A}_{\text{legal}}}{\arg\min}%\leavevmode\nobreak\ \underbrace{\mathbb{E}_{P^{\text{SV}}_{\theta}(z|s^{%\prime})}[z]}_{\hat{V}_{\theta}(s^{\prime})}.$ |

##### (BC) Behavioral cloning

The target label is the (one-hot) action-index of the ground-truth action$a^{\text{SF}}(s)$ within the set of all possible actions (see ‘Tokenization’ in Section2.2). The input to the predictor is the tokenized state, which leads to the loss for a single data point:

$-\log P^{\text{BC}}_{\theta}(a^{\text{SF}}(s)|s).$ | (3) |

This straightforwardly gives a policy that picks the highest-probability action:

$\hat{a}^{\text{BC}}(s)=\underset{a\in\mathcal{A}_{\text{legal}}}{\arg\max}%\leavevmode\nobreak\ P^{\text{BC}}_{\theta}(a|s).$ |

### 2.4 Evaluation

We use the following evaluation metrics to compare our models against each other and/or measure training progress.The first two metrics evaluate the predictors only; the second two evaluate the policies constructed from our predictors.

##### Action-accuracy

The test set percentage where the predictor policy picks the ground-truth best action: ${\hat{a}(s)=a^{\text{SF}}(s)}$.

##### Action-ranking (Kendall’s $\tau$)

The average Kendall rank correlation (a standard statistical test) across the test set, quantifying the correlation of the predicted actions with the ground-truth ranking by Stockfish in each state, ranging from -1 (exact inverse order) to 1 (exact same order) and 0 being no correlation. The predictor ranking is given by $\hat{Q}_{\theta}(s,a)$, $-\hat{V}_{\theta}(T(s,a))$, and $P^{\text{BC}}_{\theta}(a|s)$, respectively, for all legal actions. The ground-truth ranking is given by Stockfish’s action-values $Q^{\text{SF}}(s,a)$ for all legal actions.

##### Puzzle-accuracy

We evaluate our policies on their capability of solving puzzles from a collection of Lichess puzzles that are rated by Elo difficulty from $399$ to $2867$, calculated by Lichess based on how often each puzzle has been solved correctly. We use *puzzle-accuracy* as thepercentage of puzzles where the policy’s action-sequence exactly matches the known solution action-sequence.For our main puzzle result in Section3.2 we use $10$k puzzles to report puzzle-accuracy, otherwise we use the first $1$k puzzles to speed up evaluation.

Lichess Elo | Accuracy (%) | |||||||

Agent | Search | Input | Tournament Elo | vs.Bots | vs.Humans | Puzzles | Actions | Kendall’s $\tau$ |

9M Transformer (ours) | FEN | 2007($\pm 15$) | 2054 | - | 85.5 | 64.2 | 0.269 | |

136M Transformer (ours) | FEN | 2224($\pm 14$) | 2156 | - | 92.1 | 68.5 | 0.295 | |

270M Transformer (ours) | FEN | 2299($\pm 14$) | 2299 | 2895 | 93.5 | 69.4 | 0.300 | |

GPT-3.5-turbo-instruct | PGN | - | 1755 | - | 66.5 | - | - | |

AlphaZero (policy net only) | PGN | 1620($\pm 22$) | - | - | 61.0 | - | - | |

AlphaZero (value net only) | PGN | 1853($\pm 16$) | - | - | 82.1 | - | - | |

AlphaZero (400 MCTS simulations) | ✓ | PGN | 2502($\pm 15$) | - | - | 95.8 | - | - |

Stockfish 16 (0.05s) [oracle] | ✓ | FEN | 2706($\pm 20$) | 2713 | - | 99.1 | 100.0 | 1.000 |

##### Game playing strength (Elo)

We evaluate the playing strength (measured as an Elo rating) of the predictor policies in two different ways: (i) we play Blitz games on Lichess against either only humans or only bots, and (ii) we run an internal tournament between all the agents from Table1 except for GPT-3.5-turbo-instruct. We play 400 games per pair of agent, yielding 8400 games in total, and compute Elo ratings with BayesElo(Coulom, 2008), with the default confidence parameter of $0.5$. We anchor the relative BayesElo values to the Lichess ELO vs.bots of our 270M model. For the game-playing strength evaluations only (i.e., not for determining the puzzle accuracy) we use a softmax policy for the first 5 full-moves, instead of the $\operatorname*{arg\,max}$ policy described earlier, with a low temperature of $0.005$ for the value or action-value functions, $0.05$ for the action functions (like the policy network of AlphaZero), and $0.5$ for the visit counts used in the full version of AlphaZero. This renders the policies stochastic to both create variety in games, and prevents simple exploits via repeated play.

### 2.5 Baselines

We compare the performance of our models against Stockfish 16 (with a time limit of 0.05s per legal move, i.e., the oracle used to generate our dataset), three variants of AlphaZero(Silver etal., 2017): (i) the original with 400 MCTS simulations, (ii) only the policy network, and (iii) only value network (where (ii) and (iii) perform no additional search), and the GPT-3.5-turbo-instruct from Carlini (2023).AlphaZero’s networks have $27.6$M parameters and are trained on $44$M games (details in Schrittwieser etal. (2020)).Note that these baselines have access to the whole game history (via the PGN), in contrast to our models that only observe the current game state (which contains very limited historical information via the FEN). This helps the baseline policies for instance to easily deal with threefold repetition (games are drawn if the same board state is appears three times throughout the game), which requires a workaround for us (described in Section5). Moreover, GPT-3.5-turbo-instruct also requires whole games encoded via PGN to reduce hallucinations according to Carlini (2023), who also finds that GPT-4 struggles to play full games without making illegal moves, so we do not compare against GPT-4.

## 3 Results

Here we present our comprehensive experimental evaluation.For all parameters not explicitly mentioned we use the same setting across our two main experiments (Section3.1, Section3.2); for investigating scaling behavior and all ablations in Section3.3 and Section3.4 we use a different set of default settings (geared towards getting representative results with better computational efficiency).We provide all details in SectionA.2 and SectionA.3, respectively.

### 3.1 Main Result

In Table1 we show the playing strength (internal tournament Elo, external Lichess Elo, and puzzle solving competence) and predictor metrics of our large-scale transformer models when trained on the full ($10$M games) training set. Our main evaluation compares three transformer models with $9$M, $136$M, and $270$M parameters after training (none of them overfit the training set as shown in SectionB.1). The results show that all three models exhibit non-trivial generalization to novel boards and can successfully solve a large fraction of puzzles. Across all metrics, having larger models consistently improves scores, confirming that model scale matters for strong chess performance. Our largest model achieves a blitz Elo of 2895 against human players, which places it into grandmaster territory. However, the Elo drops when playing against bots on Lichess, which may be a result of having a significantly different player-pool, some minor technical issues, and perhaps a qualitative difference in how bots exploit weaknesses compared to humans (see Section5 for a detailed discussion of these issues).

### 3.2 Puzzles

In Figure2 we compare the puzzle performance of our 270M parameter model against Stockfish 16 (time limit of 50ms per move), GPT-3.5-turbo-instruct, and AlphaZero’s value network. We use our large puzzle set of $10$k puzzles, grouped by their assigned Elo difficulty from Lichess. Stockfish 16 performs the best across all difficulty categories, followed by our 270M model.AlphaZero’s value network (trained on $44$M games) and GPT-3.5-turbo-instruct achieve non-trivial puzzle performance, but significantly lag behind our model.We emphasize that solving the puzzles requires a correct move *sequence*, and since our policy cannot explicitly plan ahead, solving the puzzle sequences relies entirely on having good value estimates that can be used greedily.

### 3.3 Scaling “Laws”

Figure3 shows our scaling analysis over the dataset and model size. We visualize the puzzle accuracy (training and test loss in FigureA4), which correlates well with the other metrics and the overall playing strength. For small training set size ($10$k games, left panel) larger architectures ($\geq 7$M) start to overfit as training progresses. This effect disappears as the dataset size is increased to $100$k (middle panel) and $1$M games (right panel). The results also show that the final accuracy of a model increases as the dataset size is increased (consistently across model sizes). Similarly, we observe the general trend of increased architecture size leading to increased overall performance regardless of dataset size (as also shown in our main result in Section3.1).

### 3.4 Variants and Ablations

Accuracy (%) | ||||

Ablation | Parameter | Puzzles | Actions | Kendall’s $\tau$ |

Predictor-target | AV | 83.3 | 63.0 | 0.259 |

SV | 77.5 | 58.5 | 0.215 | |

BC | 65.7 | 56.7 | 0.116 | |

Network depth | 2 | 62.3 | 54.4 | 0.219 |

4 | 76.2 | 59.9 | 0.242 | |

8 | 81.3 | 62.3 | 0.254 | |

16 | 80.4 | 62.3 | 0.255 | |

Data sampler | Uniform | 83.3 | 63.0 | 0.259 |

Weighted | 49.9 | 48.2 | 0.192 | |

Value bins | 16 | 83.0 | 61.4 | 0.248 |

32 | 83.0 | 63.2 | 0.261 | |

64 | 84.4 | 63.1 | 0.259 | |

128 | 83.8 | 63.4 | 0.262 | |

256 | 83.7 | 63.0 | 0.260 | |

Loss function | $\log$ (class.) | 81.3 | 62.3 | 0.254 |

L2 (regr.) | 82.6 | 58.9 | 0.235 | |

Stockfish Limit [s] | 0.05 | 84.0 | 62.2 | 0.256 |

0.1 | 85.4 | 62.5 | 0.254 | |

0.2 | 84.3 | 62.6 | 0.259 | |

0.5 | 83.3 | 63.0 | 0.259 |

We test a series of experimental variants and perform extensive ablations using the $9$M parameter model. The results and conclusions drawn are used to inform and justify our design choices and determine default model-, data-, and training-configurations. Table2 summarizes all results.

##### Predictor-targets

By default we learn to predict action-values given a board state. Here we compare against using state-values or oracle actions (behavioral cloning) as the prediction targets. See Section2.3 and Figure1 for more details and how to construct policies from each of the predictors. As the results in Table2 show, the action-value predictor is superior in terms of action-ranking (Kendall’s$\tau$), action accuracy, and puzzle accuracy. The same trend is also shown in FigureA5 (in SectionB.2, which tracks puzzle accuracy over training iterations for the different predictors. This superior performance of action-value prediction might stem primarily from the significantly larger action-value dataset ($15.3$B state-action pairs vs.$\approx 530$M states for our largest training set constructed from $10$M games). We thus run an additional ablation where we train all three predictors on exactly the same amount of data—results shown in SectionB.2 largely confirm this hypothesis. Please see our more detailed discussion of the different predictor targets as we discuss these results in SectionB.2, where we also discuss performance discrepancy between behavioral cloning and the state-value predictor based policy may be largely explained by the fact that we train on expert’s actions only instead of the full action distribution of the expert.

##### Network depth

We show the influence of increasing the transformer’s depth while keeping the number of parameters constant in Table2. Since transformers may learn to roll out iterative computation (which arises in search) across layers, deeper networks may hold the potential for deeper unrolls. We compensate for having fewer layers by varying the embedding dimension and widening factor such that all models have the same number of parameters. The performance of our models increases with their depth but seems to saturate at around 8 layers, indicating that depth is important, but not beyond a certain point.

##### Data sampler

We remove duplicate board states during the generation of the training and test sets. This increases data diversity but introduces distributional shift compared to the “natural” game distribution of boards where early board states and popular openings occur more frequently. To quantify the effect of this shift we use an alternative “weighted” data sampler that draws boards from our filtered training set according to the distribution that would occur if we had not removed duplicates. Results in Table2 reveal that training on the natural distribution (via the weighted sampler) leads to significantly worse results compared to sampling uniformly randomly from the filtered training set (both trained models are evaluated on a filtered test set with uniform sampling, and the puzzle test set).We hypothesize that the increased performance is due to the increased data diversity seen under uniform sampling. As we train for very few epochs, the starting position and common opening positions are only seen a handful of times during training under uniform sampling, making it unlikely that strong early-game play of our models can be attributed to memorization.

##### Value binning

Table2 shows the impact of varying the number of bins used for state- and action-value discretization (from $16$ to $256$), demonstrating that more bins lead to improved performance. To strike a balance between performance and computational efficiency, we use $K=32$ bins for our ablations and $K=128$ for the main experiments.

##### Loss function

We treat learning Stockfish action-values as a classification problem and thus train by minimizing cross-entropy loss (log-loss). This is as close as possible to the (tried and tested) standard LLM setup. An alternative is to treat the problem as a scalar regression problem. If we parameterize a fixed-variance Gaussian likelihood model with a transformer and perform maximum (log) likelihood estimation, this is equivalent to minimizing mean-squared error (L2 loss). To that end, we modify the architecture to output a scalar (without a final log-layer or similar). The log-loss outperforms the L2 loss on two out of the three metrics (Table2).

##### Stockfish time limit

We create training sets from $1$million games annotated by Stockfish with varying time limits to manipulate the playing strength of our oracle. We report scores on the puzzle set (same for all models) and a test set created using the same time limit as the training set (different for all models). Table2 shows that a basic time-limit of $0.05$ seconds gives only marginally worse puzzle performance. As a compromise between computational effort and final model performance we thus choose this as our default value (for our $10$M games dataset we need about $15$B action-evaluation calls with Stockfish, i.e., roughly 8680 days of unparallelized Stockfish evaluation time).

## 4 Related Work

Early chess AI research made heavy use of designing explicit search strategies coupled with heuristics, as evidenced by Turing’s initial explorations(Burt, 1955) and implementations like NeuroChess(Thrun, 1994). This approach culminated in systems like Deep Blue(Campbell etal., 2002) and Stockfish(Romstad etal., 2008), known for their advanced search algorithms. The development of AlphaZero(Silver etal., 2017) marked a paradigm shift, employing deep RL with Monte Carlo Tree Search, thus learning its own heuristics (policy and value networks) instead of manually designing them. Neural networks play a significant role in chess AI (Klein, 2022), including enhancements to AlphaZero’s self-play mechanisms (V. etal., 2018), the use of deep RL (Lai, 2015), and a general trend of moving away from explicit search methods, by leveraging large-scale game datasets for training (David etal., 2016; Schrittwieser etal., 2020).

The rise of large language models has also led to innovations in chess AI, cf.Kamlish’s language-based models (Kamlish etal., 2019), the encoding of chess games via natural language (Toshniwal etal., 2022; DeLeo and Guven, 2022), and the evaluation LLMs ability to play chess (Carlini, 2023; Gramaje, 2023). Czech etal. (2023) show that strategic input representations and value loss enhancements significantly boost chess performance of vision transformers, and Alrdahi and Batista-Navarro (2023); Feng etal. (2023) show that adding chess specific data sources (e.g., chess textbooks) to language model training can improve their chess performance. Stöckl (2021) explored scaling effects of transformers on chess performance, which resonates with our emphasis on the importance of model and dataset scale.

## 5 Discussion

In order to use our state-based policies to play against humans and bots, two minor technical issues appear that can only be solved by having (some) access to game history. We briefly discuss both issues and present our workarounds.

##### Blindness to threefold repetition

By construction, our state-based predictor cannot detect the risk of threefold repetition (drawing because the same board occurs three times), since it has no access to the game history (FENs contain minimal historical info, sufficient for the Fifty Move rule). To reduce draws from threefold repetitions, we check if the bot’s next move would trigger the rule and set the corresponding action’s win percentage to $50\%$ before computing the softmax. However, our bots still cannot plan ahead to minimize the risk of being forced into threefold repetition.

##### Indecisiveness in the face of overwhelming victory

If Stockfish detects a mate-in-$k$ (e.g., $3$ or $5$) it outputs $k$ and not a centipawn score. We map all such outputs to the maximal value bin (i.e., a win percentage of $100\%$). Similarly, in a very strong position, several actions may end up in the maximum value bin. Thus, across time-steps this can lead to our agent playing somewhat randomly, rather than committing to one plan that finishes the game quickly (the agent has no knowledge of its past moves). This creates the paradoxical situation that our bot, despite being in a position of overwhelming win percentage, fails to take the (virtually) guaranteed win and might draw or even end up losing since small chances of a mistake accumulate with longer games (see Figure4). To prevent some of these situations, we check whether the predicted scores for all top five moves lie above a win percentage of $99\%$ and double-check this condition with Stockfish, and if so, use Stockfish’s top move (out of these) to have consistency in strategy across time-steps.

##### Elo: Humans vs.bots

Table1 shows a difference in Lichess Elo when playing against humans compared to bots. While the precise reasons are not entirely clear, we have three plausible hypotheses: (i) humans tend to resign when our bot has overwhelming win percentage but many bots do not (meaning that the previously described problem gets amplified when playing against bots); (ii) humans on Lichess rarely play against bots, meaning that the two player pools (humans and bots) are hard to compare and Elo ratings between pools may be miscalibrated(Justaz, 2023); and (iii) based on preliminary (but thorough) anecdotal analysis by a chess NM, our models make the occasional tactical mistake which may be penalized qualitatively differently (and more severely) by other bots compared to humans (see some of this analysis in SectionsB.4 andB.5). While investigating this Elo discrepancy further is interesting, it is not central to our paper and does not impact our main claims.

### 5.1 Limitations

While our largest model achieves very good performance, it does not completely close the gap to Stockfish 16. All our scaling experiments point towards closing this gap eventually with a large enough model trained on enough data. However, the current results do not allow us to claim that the gap can certainly be closed. Another limitation, as discussed earlier, is that our predictors see the current state but not the complete game history. This leads to some fundamental technical limitations that cannot be overcome without small domain-specific heuristics or augmenting the training data and observable info. Finally, when using a state-value predictor to construct a policy, we consider all possible subsequent states that are reachable via legal actions. This requires having a transition model$T(s,a)$, and may be considered a version of 1-step search. While the main point is that our predictors do not explicitly search over action *sequences*, we limit the claim of ‘without search’ to our action-value policy and behavioral cloning policy.

Note that the primary goal of this project was to investigate whether a complex, search-based algorithm, such as Stockfish 16, can be well approximated with a feedforward neural network. In the course of this, we have made a serious attempt to produce a strong chess policy and estimate its playing strength, but we have not exhausted every conceivable option to maximize playing strength—it may well be that further tweaks of our approach could lead to even stronger policies. Similarly, we have made a serious attempt at calibrating our policy’s playing strength via Lichess, where the claim of “grandmaster-level” play currently holds against human opponents, but we have not calibrated our policy under official tournament conditions. We also cannot rule out that opponents, through extensive repeated play, may be able to find and exploit weaknesses reliably due to the fairly deterministic nature of our policy.

## 6 Conclusion

Our paper shows that is is possible to distill an approximation of Stockfish 16 into a feed-forward transformer via standard supervised training. The resulting predictor generalizes well to unseen board states, and, when used in a policy, leads to strong chess play (Lichess Elo of 2895 against humans). We demonstrate that strong chess capabilities from supervised learning only emerge at sufficient dataset and model scale. Our work thus adds to a rapidly growing body of literature showing that complex and sophisticated algorithms can be distilled into feed-forward transformers, implying a paradigm-shift away from viewing large transformers as “mere” statistical pattern recognizers to viewing them as a powerful technique for general algorithm approximation.

## Impact Statement

While the results of training transformer-based architectures at scale in a (self) supervised way will have significant societal consequences in the near future, these concerns do not apply to a closed domain like chess that has limited real-world impact and has been a domain of machine superiority for decades. Another advantage of supervised training on a single task over other forms of training (particularly self-play or reinforcement learning, and meta-learning) is that the method requires a strong oracle solution to begin with (for data annotation) and is unlikely to significantly outperform the oracle—so the potential for the method to rapidly introduce substantial unknown capabilities (with wide societal impacts) is very limited.

## Acknowledgments

We thankAurélien Pomini,Avraham Ruderman,Eric Malmi,Charlie Beattie,Chris Colen,Chris Wolff,David Budden,Dashiell Shaw,Guillaume Desjardins,Hamdanil Rasyid,Himanshu Raj,Joel Veness,John Schultz,Julian Schrittwieser,Laurent Orseau,Lisa Schut,Marc Lanctot,Marcus Hutter,Matthew Aitchison,Nando de Freitas,Nenad Tomasev,Nicholas Carlini,Nick Birnie,Nikolas De Giorgis,Ritvars Reimanis,Satinder Baveja,Thomas Fulmer,Tor Lattimore,Vincent Tjeng,Vivek Veeriah,and Zhengdong Wangfor insightful discussions and their helpful feedback.

## References

- Alrdahi and Batista-Navarro (2023)H.Alrdahi and R.Batista-Navarro.Learning to play chess from textbooks (LEAP): a corpus forevaluating chess moves based on sentiment analysis.
*arXiv:2310.20260*, 2023. - Anil etal. (2023)R.Anil, S.Borgeaud, Y.Wu, J.Alayrac, J.Yu, R.Soricut, J.Schalkwyk, A.M.Dai, A.Hauth, K.Millican, D.Silver, S.Petrov, M.Johnson, I.Antonoglou,J.Schrittwieser, A.Glaese, J.Chen, E.Pitler, T.P. Lillicrap,A.Lazaridou, O.Firat, J.Molloy, M.Isard, P.R. Barham, T.Hennigan,B.Lee, F.Viola, M.Reynolds, Y.Xu, R.Doherty, E.Collins, C.Meyer,E.Rutherford, E.Moreira, K.Ayoub, M.Goel, G.Tucker, E.Piqueras,M.Krikun, I.Barr, N.Savinov, I.Danihelka, B.Roelofs, A.White,A.Andreassen, T.von Glehn, L.Yagati, M.Kazemi, L.Gonzalez, M.Khalman,J.Sygnowski, and etal.Gemini: A family of highly capable multimodal models.
*arXiv:2312.11805*, 2023. - Bradbury etal. (2018)J.Bradbury, R.Frostig, P.Hawkins, M.J. Johnson, C.Leary, D.Maclaurin,G.Necula, A.Paszke, J.VanderPlas, S.Wanderman-Milne, and Q.Zhang.JAX: composable transformations of Python+NumPy programs,2018.URL http://github.com/google/jax.
- Brown etal. (2020)T.B. Brown, B.Mann, N.Ryder, M.Subbiah, J.Kaplan, P.Dhariwal,A.Neelakantan, P.Shyam, G.Sastry, A.Askell, S.Agarwal,A.Herbert-Voss, G.Krueger, T.Henighan, R.Child, A.Ramesh, D.M.Ziegler, J.Wu, C.Winter, C.Hesse, M.Chen, E.Sigler, M.Litwin, S.Gray,B.Chess, J.Clark, C.Berner, S.McCandlish, A.Radford, I.Sutskever, andD.Amodei.Language models are few-shot learners.In
*NeurIPS*, 2020. - Burt (1955)C.Burt.Faster than thought: A symposium on digital computing machines.edited by b. v. bowden.
*British Journal of Statistical Psychology*, 1955. - Campbell etal. (2002)M.Campbell, A.J.H. Jr., and F.Hsu.Deep blue.
*Artif. Intell.*, 2002. - Carlini (2023)N.Carlini.Playing chess with large language models.https://nicholas.carlini.com/writing/2023/chess-llm.html, 2023.
- Coulom (2008)R.Coulom.Whole-history rating: A bayesian rating system for players oftime-varying strength.In
*Computers and Games*, 2008. - Czech etal. (2023)J.Czech, J.Blüml, and K.Kersting.Representation matters: The game of chess poses a challenge to visiontransformers.
*arXiv:2304.14918*, 2023. - David etal. (2016)O.E. David, N.S. Netanyahu, and L.Wolf.Deepchess: End-to-end deep neural network for automatic learning inchess.In
*ICANN (2)*, 2016. - DeepMind etal. (2020)DeepMind, I.Babuschkin, K.Baumli, A.Bell, S.Bhupatiraju, J.Bruce,P.Buchlovsky, D.Budden, T.Cai, A.Clark, I.Danihelka, A.Dedieu,C.Fantacci, J.Godwin, C.Jones, R.Hemsley, T.Hennigan, M.Hessel, S.Hou,S.Kapturowski, T.Keck, I.Kemaev, M.King, M.Kunesch, L.Martens,H.Merzic, V.Mikulik, T.Norman, G.Papamakarios, J.Quan, R.Ring, F.Ruiz,A.Sanchez, L.Sartran, R.Schneider, E.Sezener, S.Spencer, S.Srinivasan,M.Stanojević, W.Stokowiec, L.Wang, G.Zhou, and F.Viola.The DeepMind JAX Ecosystem, 2020.URL http://github.com/google-deepmind.
- DeLeo and Guven (2022)M.DeLeo and E.Guven.Learning chess with language models and transformers.
*arXiv:2209.11902*, 2022. - Feng etal. (2023)X.Feng, Y.Luo, Z.Wang, H.Tang, M.Yang, K.Shao, D.Mguni, Y.Du, andJ.Wang.Chessgpt: Bridging policy learning and language modeling.
*arXiv:2306.09200*, 2023. - Gehring etal. (2017)J.Gehring, M.Auli, D.Grangier, D.Yarats, and Y.N. Dauphin.Convolutional sequence to sequence learning.In
*ICML*, 2017. - Gramaje (2023)B.A. Gramaje.Exploring GPT’s capabilities in chess-puzzles.Master’s thesis, Universitat Politècnica de València, 2023.
- Haworth and Hernandez (2021)G.Haworth and N.Hernandez.The 20${}^{\mbox{th}}$ top chess engine championship, TCEC20.
*J. Int. Comput. Games Assoc.*, 2021. - Hennigan etal. (2020)T.Hennigan, T.Cai, T.Norman, L.Martens, and I.Babuschkin.Haiku: Sonnet for JAX, 2020.URL http://github.com/deepmind/dm-haiku.
- Hoffmann etal. (2022)J.Hoffmann, S.Borgeaud, A.Mensch, E.Buchatskaya, T.Cai, E.Rutherford,D.deLasCasas, L.A. Hendricks, J.Welbl, A.Clark, T.Hennigan, E.Noland,K.Millican, G.vanden Driessche, B.Damoc, A.Guy, S.Osindero,K.Simonyan, E.Elsen, J.W. Rae, O.Vinyals, and L.Sifre.Training compute-optimal large language models.
*arXiv:2203.15556*, 2022. - Justaz (2023)Justaz.Exact ratings for everyone on lichess.https://lichess.org/@/justaz/blog/exact-ratings-for-everyone-on-lichess/klIoAEAU,2023.
- Kamlish etal. (2019)I.Kamlish, I.B. Chocron, and N.McCarthy.Sentimate: Learning to play chess through natural languageprocessing.
*arXiv:1907.08321*, 2019. - Kingma and Ba (2015)D.P. Kingma and J.Ba.Adam: A method for stochastic optimization.In
*ICLR (Poster)*, 2015. - Klein (2022)D.Klein.Neural networks for chess.
*arXiv:2209.01506*, 2022. - Lai (2015)M.Lai.Giraffe: Using deep reinforcement learning to play chess.
*arXiv:1509.01549*, 2015. - OpenAI (2023)OpenAI.GPT-4 technical report.
*arXiv:2303.08774*, 2023. - Romstad etal. (2008)T.Romstad, M.Costalba, J.Kiiski, G.Linscott, Y.Nasu, M.Isozaki, H.Noda,and etal.Stockfish, 2008.URL https://stockfishchess.org.
- Sadler and Regan (2019)M.Sadler and N.Regan.
*Game Changer: AlphaZero’s Groundbreaking Chess Strategies andthe Promise of AI*.New In Chess, 2019. - Schrittwieser etal. (2020)J.Schrittwieser, I.Antonoglou, T.Hubert, K.Simonyan, L.Sifre, S.Schmitt,A.Guez, E.Lockhart, D.Hassabis, T.Graepel, T.P. Lillicrap, andD.Silver.Mastering atari, go, chess and shogi by planning with a learnedmodel.
*Nat.*, 2020. - Shazeer (2020)N.Shazeer.GLU variants improve transformer.
*arXiv:2002.05202*, 2020. - Silver etal. (2017)D.Silver, T.Hubert, J.Schrittwieser, I.Antonoglou, M.Lai, A.Guez,M.Lanctot, L.Sifre, D.Kumaran, T.Graepel, T.P. Lillicrap, K.Simonyan,and D.Hassabis.Mastering chess and shogi by self-play with a general reinforcementlearning algorithm.
*arXiv:1712.01815*, 2017. - Stöckl (2021)A.Stöckl.Watching a language model learning chess.In
*RANLP*, 2021. - Thrun (1994)S.Thrun.Learning to play the game of chess.In
*NIPS*, 1994. - Toshniwal etal. (2022)S.Toshniwal, S.Wiseman, K.Livescu, and K.Gimpel.Chess as a testbed for language model state tracking.In
*AAAI*, 2022. - Touvron etal. (2023a)H.Touvron, T.Lavril, G.Izacard, X.Martinet, M.Lachaux, T.Lacroix,B.Rozière, N.Goyal, E.Hambro, F.Azhar, A.Rodriguez, A.Joulin,E.Grave, and G.Lample.Llama: Open and efficient foundation language models.
*arXiv:2302.13971*, 2023a. - Touvron etal. (2023b)H.Touvron, L.Martin, K.Stone, etal.Llama 2: Open foundation and fine-tuned chat models.
*arXiv:2307.09288*, 2023b. - V. etal. (2018)S.K.G. V., K.Goyette, A.Chamseddine, and B.Considine.Deep pepper: Expert iteration based chess agent in the reinforcementlearning setting.
*arXiv:1806.00683*, 2018. - Vaswani etal. (2017)A.Vaswani, N.Shazeer, N.Parmar, J.Uszkoreit, L.Jones, A.N. Gomez,L.Kaiser, and I.Polosukhin.Attention is all you need.In
*NIPS*, 2017.

## Appendix A Experimental Setup

State-Value | Behavioral Cloning | Action-Value | |||||
---|---|---|---|---|---|---|---|

Split | Games | Records | Bytes | Records | Bytes | Records | Bytes |

Train | $10^{4}$ | 591 897 | 43.7 MB | 589 130 | 41.1 MB | 17 373 887 | 1.4 GB |

$10^{5}$ | 5 747 753 | 422.0 MB | 5 720 672 | 397.4 MB | 167 912 926 | 13.5 GB | |

$10^{6}$ | 55 259 971 | 4.0 GB | 54 991 050 | 3.8 GB | 1 606 372 407 | 129.0 GB | |

$10^{7}$ | 530 310 443 | 38.6 GB | 527 633 465 | 36.3 GB | 15 316 914 724 | 1.2 TB | |

Test | $10^{3}$ | 62 829 | 4.6 MB | 62 561 | 4.4 MB | 1 838 218 | 148.3 MB |

### A.1 Tokenization

The first part of a FEN string encodes the position of pieces rank-wise (row-wise). The only change we make is that we encode each empty square with a ‘.’, which always gives us $64$ characters for a board. The next character denotes the active player (‘w’ or ‘b’). The next part of the FEN string denotes castling availability (up to four characters for King- and Queen-side for each color, or ‘-’ for no availability)—we take this string and if needed pad it with ‘.’ such that it always has length$4$. Next are two characters for the en passant target, which can be ‘-’ for no target; we use the two characters literally or ‘-.’ for no target. Finally we have the halfmove clock (up to two digits) and the fullmove number (up to three digits); we take the numbers as characters and pad them with ‘.’ to make sure they are always tokenized into two and three characters respectively.

### A.2 Main Setup

We use the same basic setup for all our main experiments and only vary the model architecture.

Concretely our base setup is as follows:We train for 20 million steps with a batch size of 4096, meaning that we train for 5.35 epochs.We use the Adam optimizer(Kingma and Ba, 2015) with a learning rate of 1e-4.We train on the dataset generated from 10 million games (cf.TableA1) for the action value policy with 128 return buckets and a stockfish time limit of 0.05s.We use the unique sampler and Polyak averaging for evaluation and evaluate on 1000 games (cf.TableA1) and 1000 puzzles from a different month than that used for training.

We train a vanilla decoder-only transformer without causal masking(Vaswani etal., 2017), with the improvements proposed in LLaMA(Touvron etal., 2023a, b), i.e., post-normalization and SwiGLU(Shazeer, 2020).We use three different model configurations: (i) 8 heads, 8 layers, and an embedding dimension of 256, (ii) 8 heads, 8 layers, and an embedding dimension of 1024, and (iii) 8 heads, 16 layers, and an embedding dimension of 1024.

### A.3 Ablation Setup

We use the same basic setup for all our ablation experiments and only vary the ablation parameters.

Concretely our base setup is as follows:We train for 5 million steps with a batch size of 1024, meaning that we train for 3.19 epochs.We use the Adam optimizer(Kingma and Ba, 2015) with a learning rate of 4e-4.We train on the dataset generated from 1 million games (cf.TableA1) for the action value policy with 32 return buckets and a stockfish time limit of 0.5s.We use the unique sampler and train a vanilla decoder-only transformer(Vaswani etal., 2017) with post-normalization, 8 heads, 8 layers, an embedding dimension of 256, and no causal masking.We use Polyak averaging for evaluation and evaluate on 1000 games (cf.TableA1) and 1000 puzzles from a different month than that used for training.

### A.4 Dataset Statistics

We visualize some dataset statistics in FiguresA1 andA2.

### A.5 Playing-strength evaluation on Lichess

We evaluate and calibrate the playing strength of our models by playing against humans and bots on Lichess. Our standard evaluation allows for both playing against bots and humans (see Table1), but since humans tend to rarely play against bots the Elo ratings in this case are dominated by playing against other bots (see our discussion of how this essentially creates two different, somewhat miscalibrated, player pools in Section5). In our case the policies in the column denoted with ‘vs.Bots’ in Table1 have played against some humans but the number of games against humans is $<4.5\%$ of total games played. To get better calibration against humans we let our largest model play exclusively against humans (by not accepting games with other bots) which leads to a significantly higher Elo ranking (see Table1). Overall we have played the following numbers of games for the different policies shown in Table1: $9$M (553 games), $136$M (169 games), $270$M (228 games against bots, 174 games against humans), Stockfish (30 games), GPT-3.5-turbo-instruct (181 games).

### A.6 Stockfish and AlphaZero Setup

##### Stockfish

We use Stockfish 16 (the version from December 2023) throughout the paper.When we play, we use the oracle we used for training, which is an unconventional way to play with this engine: We evaluate each legal move in the position for 50ms, and return the best move based on these scores.This is not entirely equivalent to a standard thinking time of 50ms times the number of legal moves per position, as we force Stockfish to spend 50ms on moves that could be uninteresting and unexplored.We chose to keep this setup to have a comparison to the oracle we train on.Note that, when comparing the legal moves in a given position, we do not clear Stockfish’s cache between the moves.Therefore, due to the way the cache works, this biases the accuracy of Stockfish’s evaluation to be weaker for the first moves considered.Finally, due to the nature of our internal hardware setup, we use two different kinds of chips to run Stockfish: (i) to compute the Lichess Elo, we use a 6-core Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz, and (ii) to compute the tournament Elo, we use a single Tensor Processing Unit (V3), as for all the other agents.

##### AlphaZero

We use the AlphaZero version from 2020, with a network trained at that time(Schrittwieser etal., 2020). We use three different versions: (i) policy network only, (ii) value network only and (iii) standard version with search. For (i), we are using the probability distribution over actions returned by the policy network, and take the argmax as the best action. For (ii), we do a search limited to depth 1, with 100 MCTS simulations, enough to cover all legal actions, and take the argmax over visit counts. For (iii), we use the standard search from the paper, with 400 MCTS simulations and the exact same UCB scaling parameters. We also take the argmax over visit counts.Note that AlphaZero’s policy and value network have been trained on $44$M games, whereas we trained our largest models on only $10$M games.

### A.7 Computational Resources

Our codebase is based on JAX(Bradbury etal., 2018) and the DeepMind JAX Ecosystem(DeepMind etal., 2020; Hennigan etal., 2020).We used 4 Tensor Processing Units (V5) per model for the ablation experiments.We used 128 Tensor Processing Units (V5) per model to train our large (9M, 136M and 270M) models.We used a single Tensor Processing Unit (V3) per agent for our Elo tournament.

## Appendix B Additional Results

### B.1 Loss Curves

In FigureA3 we show the train and test loss curves (and the evolution of the puzzle accuracy) for the large models from Section3.1.We observe that none of the models overfit and that larger models improve both the training and the test loss.

In FigureA4 we visualize the train and test loss curves for the scaling experiment from Section3.3. In line with the results shown in the main paper we observe that on the smallest training set, models with $\geq 7$M parameters start to overfit, but not for the larger training sets. Except for the overfitting cases we observe that larger models improve both the training and test loss, regardless of training set size, and that larger training set size improves the test loss when keeping the model size constant.

### B.2 Predictor-Target Comparison

Relative Tournament Elo | ||
---|---|---|

Prediction Target | Same # of Games in Dataset | Same # of Data Points |

Action-Value | +492 $(\pm 31)$ | +252 $(\pm 22)$ |

State-Value | +257 $(\pm 23)$ | +264 $(\pm 22)$ |

Behavioral-Cloning | 0 $(\pm 28)$ | 0 $(\pm 24)$ |

In FigureA5 we compare the puzzle accuracy for the three different predictor targets (action-values, state-values, or best action) trained on $1$million games. As discussed in the main text, for a fixed number of games we have very different dataset sizes for state-value prediction (roughly $55$million states) and action-value prediction (roughly $1.6$billion states); see TableA1 for all dataset sizes. It seems plausible that learning action-values might pose a slightly harder learning problem, leading to slightly slower initial learning, but eventually this is compensated for by having much more data to train on compared to state-value learning (see FigureA5, which shows this trend). Also note that since we use the same time-budget per Stockfish call, all action-values for one state use more Stockfish computation time in total (due to one call per action) when compared to state-values (one call per board). To control for the effect of dataset size, we train all three predictors ($9$M parameter model) on a fixed set of $40$million data points. Results are shown in FigureA6. As the results show, the state-value policy in this case slightly outperforms the action-value policy, except for action-ranking (Kendall’s$\tau$), which makes sense since the action-value predictor is implicitly trained to produce good action rankings. To see how this translates into playing-strength, we pit all three policies (AV, SV, BC) against each other and determine their relative Elo rankings. TableA2 shows that when not controlling for the number of training data points, the action-value policy is strongest (in line with the findings in Table2 and FigureA5), but when controlling for the number of training data points the action-value and state-value policy perform almost identical (in line with FigureA6).

Throughout all these results we observe lower performance of the behavioral cloning policy, despite being trained on a comparable number of datapoints as the state-value policy. The main hypothesis for this is that the amount of information in the behavioral cloning dataset is lower than the state value dataset, since we throw away any information in the state- or action-values beyond the index of the oracle action. We suspect that training on the full action distribution of the oracle (with cross-entropy loss), rather than the best action only would largely close this gap, but we consider this question beyond the scope of this paper and limit ourselves to simply reporting the observed effect in our setting.

### B.3 Polyak Averaging

We investigate the impact of Polyak averaging, an optimization technique where parameters are set to a weighted average over the last iterations rather than just using the most recent value, using the same setup as for our ablation experiments (seeSectionA.3).When using Polyak averaging with an exponential moving average decay factor of 0.99, we obtain a Kendall’s $\tau$ of 0.259, a puzzle accuracy of 83.3%, and an action accuracy of 63.0%.In contrast, standard evaluation, obtains a Kendall’s $\tau$ of 0.258, a puzzle accuracy of 83.1%, and an action accuracy of 62.8%.Thus, we use Polyak averaging for all experiments.

### B.4 Tactics

In FigureA7, we analyze the tactics learned by our 270M transformer used against a human with a blitz Elo of 2145.We observe that our model has learned to sacrifice material when it is advantageous to build a longer-term advantage.

Human (2145 Elo)

270M Transformer

Human (2145 Elo)

270M Transformer

### B.5 Playing Style

We recruited chess players of National Master level and above to analyze our agent’s games against bots and humans on the Lichess platform. They made the following qualitative assessments of its playing style and highlighted specific examples (see FigureA8). Our agent has an aggressive enterprising style where it frequently sacrifices material for long-term strategic gain. The agent plays optimistically: it prefers moves that give opponents difficult decisions to make even if they are not always objectively correct. It values king safety highly in that it only reluctantly exposes its own king to danger but also frequently sacrifices material and time to expose the opponent’s king. For example 17 .. Bg5 in gameB.5.1 encouraged its opponent to weaken their king position. Its style incorporates strategic motifs employed by the most recent neural engines (Silver etal., 2017; Sadler and Regan, 2019). For example it pushes wing pawns in the middlegame when conditions permit (see gameB.5.2). In gameB.5.3 our agent executes a correct long-term exchange sacrifice. In gameB.5.4 the bot uses a motif of a pin on the back rank to justify a pawn sacrifice for long term pressure. GameB.5.5 features a piece sacrifice to expose its opponent’s king. The sacrifice is not justified according to Stockfish although the opponent does not manage to tread the fine line to a permanent advantage and blunders six moves later with Bg7.

Our agent has a distinct playing style to Stockfish: one analyzer commented “it feels more enjoyable than playing a normal engine”, “as if you are not just hopelessly crushed”. Indeed it does frequently agree with Stockfish’s move choices suggesting that our agent’s action-value predictions match Stockfish’s. However the disagreements can be telling: the piece sacrifice in the preceding paragraph is such an example. Also, gameB.5.6 is interesting because our agent makes moves that Stockfish strongly disagrees with. In particular our agent strongly favours 18 .. Rxb4 and believes black is better, in contrast Stockfish believes white is better and prefers Nd4. Subsequent analysis by the masters suggests Stockfish is objectively correct in this instance. Indeed on the very next move our agent has reversed its opinion and agrees with Stockfish.

Our agent’s aggressive style is highly successful against human opponents and achieves a grandmaster-level Lichess Elo of 2895. However, we ran another instance of the bot and allowed other engines to play it. Its estimated Elo was far lower, i.e., 2299. Its aggressive playing style does not work as well against engines that are adept at tactical calculations, particularly when there is a tactical refutation to a sub-optimal move. Most losses against bots can be explained by just one tactical blunder in the game that the opponent refutes. For example Bxh3 in gameB.5.7 loses a piece to g4.

Finally, the recruited chess masters commented that our agent’s style makes it very useful for opening repertoire preparation. It is no longer feasible to surprise human opponents with opening novelties as all the best moves have been heavily over-analyzed. Modern opening preparation amongst professional chess players now focuses on discovering sub-optimal moves that pose difficult problems for opponents. This aligns extremely well with our agent’s aggressive, enterprising playing style which does not always respect objective evaluations of positions.

#### B.5.1 King weakening game

1. e4 c5 2. Nf3 Nc6 3. Bb5 g6 4. O-O Bg7 5. c3Nf6 6. Re1 O-O 7. d4 d5 8. e5 Ne4 9. Bxc6 bxc610. Nbd2 Nxd2 11. Bxd2 Qb6 12. dxc5 Qxc5 13. h3Qb5 14. b4 a5 15. a4 Qc4 16. Rc1 Bd7 17. Bg5 f618. Bd2 Bf5 19. exf6 exf6 20. Nd4 Bd7 21. Nb3axb4 22. cxb4 Qh4 23. Nc5 Bf5 24. Ne6 Rfc8 25.Nxg7 Kxg7 26. Re7+ Kh8 27. a5 Re8 28. Qe2 Be4 29.Rxe8+ Rxe8 30. f3 1-0

#### B.5.2 Wing pawn push game

1. e4 c6 2. d4 d5 3. Nc3 dxe4 4. Nxe4 Nf6 5. Ng3c5 6. Bb5+ Bd7 7. Bxd7+ Nbxd7 8. dxc5 Qa5+ 9. Qd2Qxc5 10. Nf3 h5 11. O-O h4 12. Ne2 h3 13. g3 e514. Nc3 Qc6 15. Qe2 Bb4 16. Bd2 O-O 17. Rae1 Rfe818. Ne4 Bxd2 19. Qxd2 Nxe4 0-1

#### B.5.3 Exchange sacrifice game

1. d4 d5 2. c4 e6 3. Nc3 Bb4 4. cxd5 exd5 5. Nf3Nf6 6. Bg5 h6 7. Bh4 g5 8. Bg3 Ne4 9. Rc1 h5 10.h3 Nxg3 11. fxg3 c6 12. e3 Bd6 13. Kf2 h4 14. g4Bg3+ 15. Ke2 O-O 16. Kd2 Re8 17. Bd3 Nd7 18. Kc2Rxe3 19. Kb1 Qe7 20. Qc2 Nf8 21. Rhf1 Ne6 22.Bh7+ Kg7 23. Bf5 Rxf3 24. gxf3 Nxd4 25. Qd3 Nxf526. gxf5 Qe5 27. Ka1 Bxf5 28. Qe2 Re8 29. Qxe5+Rxe5 30. Rfd1 Bxh3 31. Rc2 Re3 32. Ne2 Bf5 33.Rcd2 Rxf3 34. Nxg3 hxg3 0-1

#### B.5.4 Long term sacrifice game

1. d4 d5 2. c4 e6 3. Nf3 Nf6 4. Nc3 Bb4 5. Bg5dxc4 6. e4 b5 7. a4 Bb7 8. axb5 Bxe4 9. Bxc4 h610. Bd2 Bb7 11. O-O O-O 12. Be3 c6 13. bxc6 Nxc614. Qb3 Qe7 15. Ra4 a5 16. Rd1 Rfd8 17. d5 exd518. Nxd5 Nxd5 19. Rxd5 Rxd5 20. Bxd5 Rd8 21. Ra1a4 22. Rxa4 Qd7 23. Bc4 Qd1+ 24. Qxd1 Rxd1+ 25.Bf1 Ba5 26. Rc4 Rb1 27. Rc2 Nb4 28. Rc5 Nc6 29.Bc1 Bb4 30. Rc2 g5 31. h4 g4 32. Nh2 h5 33. Bd3Ra1 34. Nf1 Ne5 35. Be2 Be4 36. Rc8+ Kh7 37. Be3Re1 38. Bb5 Bd3 39. Bxd3+ Nxd3 40. Rd8 Nxb2 41.Rd5 Be7 42. Rd7 Bxh4 43. g3 Bf6 44. Rxf7+ Kg6 45.Rxf6+ Kxf6 46. Bd4+ Kg5 47. Bxb2 Rb1 48. Bc3 Kf549. Kg2 Rb3 50. Ne3+ Ke4 51. Bf6 Rb5 52. Kf1 Rb653. Bc3 Rb3 54. Bd2 Kd3 55. Be1 Rb5 56. Ng2 Ke457. Ke2 Rb2+ 58. Bd2 Rc2 59. Ne3 Ra2 60. Nc4 Kd461. Nd6 Ke5 62. Ne8 Kf5 63. Kd3 Ra6 64. Bc3 Rc665. Bb4 Kg6 66. Nd6 Ra6 67. Bc5 Ra5 68. Bd4 Ra669. Nc4 Ra4 70. Nb6 Ra5 71. Ke4 h4 72. gxh4 Kh573. Bf6 Ra2 74. Ke3 Ra3+ 75. Ke2 g3 76. Nd5 Ra2+77. Kf3 gxf2 78. Nf4+ Kh6 79. Kg2 f1=Q+ 80. Kxf1Rc2 81. Bg5+ Kh7 82. Ne2 Kg6 83. Kf2 Ra2 84. Kf3Ra4 85. Ng3 Rc4 86. Bf4 Rc3+ 87. Kg4 Rc4 88. h5+Kf6 89. Nf5 Ra4 90. Ne3 Ra5 91. Nc4 Ra4 92. Ne5Kg7 93. Kf5 Ra5 94. Kg5 Rb5 95. Kg4 Rb1 96. Kf5Rb5 97. Ke4 Ra5 98. h6+ Kh7 99. Bd2 Ra2 100. Be3Ra6 101. Ng4 Ra3 102. Bd2 Ra2 103. Bf4 Ra5 104.Kf3 Rf5 105. Ke3 Kg6 106. Ke4 Rh5 107. Kf3 Rh3+108. Kg2 Rh5 109. Kg3 Ra5 110. Be3 Ra3 111. Kf3Rb3 112. Ke4 Rb4+ 113. Bd4 Ra4 114. Ke5 Rc4 115.Kd5 Ra4 116. Ke4 Rb4 117. Kd3 Ra4 118. Kc3 Ra3+119. Kc4 Rg3 120. Ne3 Rh3 121. Kd5 Rxh6 122. Bb6Rh3 123. Nc4 Rh5+ 124. Ke6 Rg5 125. Nd2 Rg2 126.Nf1 Rb2 127. Bd8 Re2+ 128. Kd5 Re1 129. Ne3 Rxe3130. Bh4 Kf5 131. Bf2 Rd3+ 132. Kc4 Ke4 133. Bc5Rc3+ 134. Kxc3 1/2-1/2

#### B.5.5 Expose king game

1. e4 c5 2. Nf3 Nc6 3. Na3 Nf6 4. e5 Nd5 5. d4cxd4 6. Nb5 a6 7. Nbxd4 g6 8. Bc4 Nc7 9. Nxc6bxc6 10. Ng5 Ne6 11. Nxf7 Kxf7 12. Bxe6+ Kxe6 13.Bd2 Kf7 14. Qf3+ Kg8 15. e6 dxe6 16. O-O-O Qd517. Qe3 Bg7 18. Bc3 Qxa2 19. Rd8+ Kf7 20. Qf4+Bf6 21. Rxh8 Qa1+ 22. Kd2 Qxh1 23. Bxf6 exf6 24.Qc7+ 1-0

#### B.5.6 Stockfish disagreement game

1. e4 c5 2. Nf3 Nc6 3. d4 cxd4 4. Nxd4 Nf6 5. Nc3e6 6. Ndb5 d6 7. Bf4 e5 8. Bg5 a6 9. Na3 b5 10.Nd5 Qa5+ 11. Bd2 Qd8 12. Bg5 Be7 13. Bxf6 Bxf614. c4 b4 15. Nc2 Rb8 16. g3 b3 17. axb3 Rxb3 18.Ncb4 Rxb4 19. Nxb4 Nxb4 20. Qa4+ Kf8 21. Qxb4 g622. Bg2 h5 23. h4 Kg7 24. O-O g5 25. hxg5 Bxg526. f4 Be7 27. fxe5 dxe5 28. Qc3 Bc5+ 29. Kh2 Qg530. Rf5 Bxf5 31. Qxe5+ Qf6 32. Qxf6+ Kxf6 33.exf5 Kg5 34. Bd5 Rb8 35. Ra2 f6 36. Be6 Kg4 37.Kg2 Rb3 38. Bf7 Rxg3+ 39. Kf1 h4 40. Ra5 Bd4 41.b4 h3 42. Bd5 h2 43. Bg2 Rb3 44. Rxa6 Rb1+ 45.Ke2 Rb2+ 0-1

#### B.5.7 Blunder game

1. b3 e5 2. Bb2 Nc6 3. e3 d5 4. Bb5 Bd6 5. Bxc6+bxc6 6. d3 Qg5 7. Nf3 Qe7 8. c4 Nh6 9. Nbd2 O-O10. c5 Bxc5 11. Nxe5 Bb7 12. d4 Bd6 13. O-O c514. Qh5 cxd4 15. exd4 Rae8 16. Rfe1 f6 17. Nd3Qf7 18. Qf3 Bc8 19. h3 Nf5 20. g3 Ne7 21. Bc3Bxh3 22. g4 f5 23. Qxh3 fxg4 24. Qxg4 h5 25. Qe6g5 26. Qxf7+ Rxf7 27. Bb4 Ref8 28. Bxd6 cxd6 29.b4 Nf5 30. Re6 Kg7 31. Rd1 Rc7 32. Nf3 g4 33. Nd2h4 34. Nb3 Rc2 35. Nf4 g3 36. Nh5+ Kh7 37. fxg3Nxg3 38. Nxg3 Rg8 39. Rd3 Rxa2 40. Rxd6 Rb2 41.Rxd5 Rxg3+ 42. Rxg3 hxg3 43. Nc5 Kg6 44. b5 Rxb545. Kg2 a5 46. Kxg3 a4 47. Rd6+ Kf5 48. Nxa4 Rb3+49. Kf2 Rh3 50. Nc5 Kg5 51. Rc6 Kf5 52. d5 Ke553. d6 Rh2+ 54. Kg3 Rd2 55. d7 Rxd7 56. Nxd7+ Ke457. Rd6 Ke3 58. Nf6 Ke2 59. Ng4 Ke1 60. Kf3 Kf161. Rd1# 1-0