Summary of AlphaGo Approach, Pulling Directly from the Paper

If you can handle it, read the paper on AlphaGo here: https://gogameguru.com/i/2016/03/deepmind-mastering-go.pdf.  I tried to pull out a summary of technique and results from the paper into less than three pages so I can reread the highlights over time to see if sinks in…

What is Hard About Go

Enormous search space, difficult to assess board positions.

Approach

Introduction of Value Networks and Policy Networks – deep neural networks

Novel combination of supervised learning and reinforcement learning

This thing plays itself to improve

New search algorithm that combines Monte-Carlo simulation with the value & policy networks

Results

99.8% winning rate against other programs, defeated European Go champ 5 to nothing.

Note – this was thought to be a decade away

Perfect Information and Outcomes

In games of perfect information, there is an optimal value function, determining the outcome of the game from every board position, under perfect play.

Recursive Computation of Optimal Value Function

You can solve the game by recursively computing the optimal value function in a search tree containing possible sequences of moves, depending on the game’s breadth (number of legal moves per position) and its depth (game length).

In Go, exhaustive search is not feasible.

Reducing the Search Space

1> position evaluation: truncating the search tree at state s and replacing the subtree below s by an approximate value function that predicts the outcome from state s.

2> sampling actions from a policy that is a probability distribution over possible moves in position s.

Monte-Carlo Tree Search (MCTS)

Uses Monte-Carlo rollouts to estimate the value of each state in a search tree.

As more simulations are executed, the search tree grows larger and the relevant values become more accurate.

The policy used to select actions during search is also improved over time, by selecting children with higher values.

These policies are used to narrow the search to a beam of high probability actions, and to sample actions during rollouts.

Deep Convolutional Neural Networks

Deep convolutional neural networks use many layers of neurons, each arranged in overlapping tiles, to construct increasingly abstract, localised representations of an image.

Pass in the board position as a 19 × 19 image and use convolutional layers

to construct a representation of the position, then use these neural networks to reduce the effective depth and breadth of the search tree: evaluating positions using a value network, and sampling actions using a policy network.

Training with Pipeline

Train the neural networks using a pipeline consisting of several stages of machine learning.

First by training a supervised learning (SL) policy network directly from expert human moves. This provides fast, efficient learning updates with immediate feedback and high quality gradients.

Also train a fast policy that can rapidly sample actions during rollouts.

Next, train a reinforcement learning (RL) policy network that improves the SL policy network by optimising the final outcome of games of self-play.

This adjusts policy towards correct goal of winning games, rather than predictive accuracy. Finally, train a value network that predicts the winner of games played by the RL policy network against itself.

AlphaGo efficiently combines the policy and value networks with MCTS.

Supervised Learning of Policy Networks

The SL policy network pσ(a|s) alternates between convolutional layers with weights σ, and rectifier non-linearities. A final softmax layer outputs a probability distribution over all legal moves a. The input s to the policy network is a simple representation of the board state.

The policy network is trained on randomly sampled state-action pairs (s, a), using stochastic gradient ascent to maximize the likelihood of the human move a selected in state s.

Reinforcement Learning of Policy Networks

This stage aims at improving the policy network by policy gradient reinforcement learning. The RL policy network pρ is identical in structure to the SL policy network, and its weights ρ are initialised to the same values, ρ = σ.

They play games between the current policy network pρ and a randomly selected previous iteration of the policy network. Randomising from a pool of opponents stabilises training by preventing overfitting to the current policy.

They use a reward function r(s) that is zero for all non-terminal time-steps t < T. The outcome zt = ±r(sT ) is the terminal reward at the end of the game from the perspective of the current player at time-step t: +1 for winning and −1 for losing. Weights are then updated at each time-step t by stochastic gradient ascent in the direction that maximizes expected outcome.

Reinforcement Learning of Value Networks

Last stage of training pipeline focuses on position evaluation, estimating a value function that predicts the outcome from position s of games played by using policy p for both players.

Ideally, would like to know the optimal value function under perfect play, in practice, instead estimate the value function for our strongest policy, using the RL policy network pρ.

approximate the value function using a value network vθ(s) with weights θ, vθ(s) ≈ vpρ (s) ≈ v∗(s) – This neural network has a similar architecture to the policy network, but outputs a single prediction instead of a probability distribution.

Train the weights of the value network by regression on state-outcome pairs (s, z), using stochastic gradient descent to minimize the mean squared error (MSE) between the predicted value vθ(s), and the corresponding outcome z.

Overfitting by Predicting Outcomes from Complete Games

The naive approach of predicting game outcomes from data consisting of complete games leads to overfitting.

The problem is that successive positions are strongly correlated, differing by just one stone, but the regression target is shared for the entire game.

When trained on the KGS dataset in this way, the value network memorised the game outcomes rather than generalising to new positions.

To mitigate this problem, generated a new self-play data-set consisting of 30 million distinct positions, each sampled from a separate game. Each game was played between the RL policy network and itself until the game terminated.

Searching with Policy and Value Networks

AlphaGo combines the policy and value networks in an MCTS algorithm that selects actions by lookahead search.

Each edge (s, a) of the search tree stores an action value Q(s, a), visit count N (s, a), and prior probability P (s, a).

The tree is traversed by simulation starting from the root state. At each time-step t of each simulation, an action at is selected from state st, so as to maximize action value plus a bonus u(s, a) ∝ P (s,a) that is proportional to the prior 1+N (s,a) probability but decays with repeated visits to encourage exploration.

When the traversal reaches a leaf node at step L, the leaf node may be expanded. The leaf position is processed just once by the SL policy network. The output probabilities are stored as prior probabilities P for each legal action.

The leaf node is evaluated in two very different ways: first, by the value network and second, by the outcome of a random rollout played out until terminal step T using the fast rollout policy pπ; these evaluations are combined, using a mixing parameter λ, into a leaf evaluation V (sL),

At the end of simulation n, the action values and visit counts of all traversed edges are updated. Each edge accumulates the visit count and mean evaluation of all simulations passing through that edge

Once the search is complete, the algorithm chooses the most visited move from the root position.

Computational Resources

To efficiently combine MCTS with deep neural networks, AlphaGo uses an asynchronous multi-threaded search that executes simulations on CPUs, and computes policy and value networks in parallel on GPUs. The final version of AlphaGo used 40 search threads, 48 CPUs, and 8 GPUs.