While teaching reinforcement learning I kept asking myself what AlphaZero teaches us about RL. That question has lead to this post. This post generalizes AlphaZero to a larger class of RL algorithms by reinterpreting AlphaZero’s policy network as a belief function — as the probability that is the optimal action at state . This gives rise to a “belief gradient” algorithm.

A fundamental question is whether belief gradient is a better conceptual framework for RL than policy gradient. I have always felt that there is something wrong with policy gradient methods. The optimal policy is typically deterministic while policy gradient methods rely on significant exploration (policy stochasticity) to compute a gradient. It just seems paradoxical. The belief gradient approach seems to resolve this paradox.

**The action belief function and a belief gradient algorithm. **For a given state in a given Markov decision process (MDP) with a finite action space there exists a deterministic optimal policy mapping each state to an optimal action . For each decision (in life?) there is some best choice that is generally impossible to know. I propose reinterpreting AlphaZero’s policy network as giving an estimator for the probability that is the best choice — that . To make this reinterpretation explicit I will change notation and write for the belief that is the optimal choice. The notation is generally interpreted as specifying stochastic (erratic?) behavior.

To make the mathematics clearer I will assume that the belief is computed by a softmax

where is a vector representation of the state . The softmax is not as important here as the idea that provides only incomplete knowledge of . Further observations on are possible. For example, we can grow a search tree rooted as . For a given distribution on states we then get a distribution on the pairs . We assume that we also have some way of computing a more informed belief . AlphaZero computes a replay buffer containing pairs where the stored distribution is . Ideally the belief would match the marginal beliefs over the more informed search tree results. This motivates the following were denotes the replay buffer.

(1)

This is just the gradient of a cross-entropy loss from the replay buffer to the belief .

**Some AlphaZero details.** For the sake of completeness I will describe a few more details of the AlphaZero algorithm. The algorithm computes rollouts to the end of an episode (game) where each action in the rollout is selected based on tree search. Each rollout adds a set of pairs to the replay buffer where is a state in that rollout and is a recording of . Each rollout also has a reward (was the game won or lost). AlphaZero stores a second set of replay values for each rollout state state where is the reward of that rollout. The replay pairs are used to train a value function estimating the reward that will be achieved from state . The value function acts as a kind of static board evaluator in growing the tree .

**Improved generality.** The abstract mathematical formulation given in equation (1) is more general than tree search. In the belief we can have that is any information about that usefully augments the information provided in .

**A belief gradient theorem.** While I have reinterpreted policies as beliefs, and have recast AlphaZero’s algorithm in that light, I have not provided a belief gradient theorem. The simplest such theorem is for imitation learning. Assume an expert that labels states with actions. Optimizing the cross-entropy loss for this labeled data yields a probability . Learning a probability does not imply that the agent should make a random choice. Belief is not the same as choice.