Reinterpreting AlphaZero

While teaching reinforcement learning I kept asking myself what AlphaZero teaches us about RL. That question has lead to this post.  This post generalizes AlphaZero to a larger class of RL algorithms by reinterpreting AlphaZero’s policy network as a belief function — as the probability that a is the optimal action at state s.  This gives rise to a “belief gradient” algorithm.

A fundamental question is whether belief gradient is a better conceptual framework for RL than policy gradient.  I have always felt that there is something wrong with policy gradient methods.  The optimal policy is typically deterministic while policy gradient methods rely on significant exploration (policy stochasticity) to compute a gradient.  It just seems paradoxical.  The belief gradient approach seems to resolve this paradox.

The action belief function and a belief gradient algorithm. For a given state in a given Markov decision process (MDP) with a finite action space {\cal A} there exists a deterministic optimal policy \pi^* mapping each state to an optimal action \pi^*(s) \in {\cal A}. For each decision (in life?) there is some best choice that is generally impossible to know. I propose reinterpreting AlphaZero’s policy network \pi_\Phi(a|s) as giving an estimator for the probability that a is the best choice — that a = \pi^*(s).  To make this reinterpretation explicit I will change notation and write B_\Phi(a|s) for the belief that a is the optimal choice.  The notation \pi_\Phi(a|s) is generally interpreted as specifying stochastic (erratic?) behavior.

To make the mathematics clearer I will assume that the belief B_\Phi(a|s) is computed by a softmax

B_\Phi(a|s) = \mathrm{softmax}_a \;s_\Phi(a|O_\Phi(s))

where O_\Phi(s) is a vector representation of the state s.  The softmax is not as important here as the idea that O_\Phi(s) provides only incomplete knowledge of s.  Further observations on s are possible.  For example, we can grow a search tree T_\Phi(s) rooted as s.  For a given distribution on states s we then get a distribution on the pairs (O_\Phi(s),T_\Phi(s)).  We assume that we also have some way of computing a more informed belief B_\Phi(a\;|\;O_\Phi(s),T_\Phi(s)). AlphaZero computes a replay buffer containing pairs (s,B(a)) where the stored distribution B(a) is B_\Phi(a\;|\;O_\Phi(s),\;T_\Phi(s)).  Ideally the belief B_\Phi(a\;|\;O_\Phi(s)) would match the marginal beliefs over the more informed search tree results. This motivates the following were R denotes the replay buffer.

(1) \Delta \Phi \propto E_{s,B(a) \sim R,\;a\sim B(a)}\;\nabla_\Phi\;\ln B_\Phi(a\;|\;O_\Phi(s))

This is just the gradient of a cross-entropy loss from the replay buffer to the belief B_\Phi(a|s).

Some AlphaZero details. For the sake of completeness I will describe a few more details of the AlphaZero algorithm.  The algorithm computes rollouts to the end of an episode (game) where each action in the rollout is selected based on tree search. Each  rollout adds a set of pairs (s,\;B(a)) to the replay buffer where s is a state in that rollout and B(a) is a recording of B_\Phi(a\;|\;O_\Phi(s),T_\Phi(s)). Each rollout also has a reward z \in \{-1,1\} (was the game won or lost).  AlphaZero stores a second set of replay values (s,z) for each rollout state state s  where z is the reward of that rollout. The replay pairs (s,z) are used to train a value function V_\Phi(s) estimating the reward that will be achieved from state s.  The value function V_\Phi(s) acts as a kind of static board evaluator in growing the tree T_\Phi(s).

Improved generality. The abstract mathematical formulation given in equation (1) is more general than tree search.  In the belief B_\Phi(a\;|\;O_\Phi(s),\;T_\Phi(s)) we can have that T_\Phi(s) is any information about s that usefully augments the information provided in O_\Phi(s).

A belief gradient theorem. While I have reinterpreted policies as beliefs, and have recast AlphaZero’s algorithm in that light, I have not provided a belief gradient theorem.  The simplest such theorem is for imitation learning.  Assume an expert that labels states with actions.  Optimizing the cross-entropy loss for this labeled data yields a probability B_\Phi(a|s).  Learning a probability does not imply that the agent should make a random choice. Belief is not the same as choice.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a comment