While teaching reinforcement learning I kept asking myself what AlphaZero teaches us about RL. That question has lead to this post. This post generalizes AlphaZero to a larger class of RL algorithms by reinterpreting AlphaZero’s policy network as a belief function — as the probability that is the optimal action at state
. This gives rise to a “belief gradient” algorithm.
A fundamental question is whether belief gradient is a better conceptual framework for RL than policy gradient. I have always felt that there is something wrong with policy gradient methods. The optimal policy is typically deterministic while policy gradient methods rely on significant exploration (policy stochasticity) to compute a gradient. It just seems paradoxical. The belief gradient approach seems to resolve this paradox.
The action belief function and a belief gradient algorithm. For a given state in a given Markov decision process (MDP) with a finite action space there exists a deterministic optimal policy
mapping each state to an optimal action
. For each decision (in life?) there is some best choice that is generally impossible to know. I propose reinterpreting AlphaZero’s policy network
as giving an estimator for the probability that
is the best choice — that
. To make this reinterpretation explicit I will change notation and write
for the belief that
is the optimal choice. The notation
is generally interpreted as specifying stochastic (erratic?) behavior.
To make the mathematics clearer I will assume that the belief is computed by a softmax
where is a vector representation of the state
. The softmax is not as important here as the idea that
provides only incomplete knowledge of
. Further observations on
are possible. For example, we can grow a search tree
rooted as
. For a given distribution on states
we then get a distribution on the pairs
. We assume that we also have some way of computing a more informed belief
. AlphaZero computes a replay buffer containing pairs
where the stored distribution
is
. Ideally the belief
would match the marginal beliefs over the more informed search tree results. This motivates the following were
denotes the replay buffer.
(1)
This is just the gradient of a cross-entropy loss from the replay buffer to the belief .
Some AlphaZero details. For the sake of completeness I will describe a few more details of the AlphaZero algorithm. The algorithm computes rollouts to the end of an episode (game) where each action in the rollout is selected based on tree search. Each rollout adds a set of pairs to the replay buffer where
is a state in that rollout and
is a recording of
. Each rollout also has a reward
(was the game won or lost). AlphaZero stores a second set of replay values
for each rollout state state
where
is the reward of that rollout. The replay pairs
are used to train a value function
estimating the reward that will be achieved from state
. The value function
acts as a kind of static board evaluator in growing the tree
.
Improved generality. The abstract mathematical formulation given in equation (1) is more general than tree search. In the belief we can have that
is any information about
that usefully augments the information provided in
.
A belief gradient theorem. While I have reinterpreted policies as beliefs, and have recast AlphaZero’s algorithm in that light, I have not provided a belief gradient theorem. The simplest such theorem is for imitation learning. Assume an expert that labels states with actions. Optimizing the cross-entropy loss for this labeled data yields a probability . Learning a probability does not imply that the agent should make a random choice. Belief is not the same as choice.