## Thoughts from TTIC31230: Hyperparameter Conjugacy

I have just finished teaching TTIC31230: fundamental of deep learning.  This is the first of a series of posts on thoughts arising from teaching this class.

While this post is about hyperparameter search, I want to mention in passing some issues that do not seem to rise to level of a full post.

Teaching backprop. When we teach backprop we should stop talking about “computational graphs” and talk instead about  programs defined by a sequence of in-line assignments.  I present backprop as an algorithm that runs on in-line code and I give a loop invariant for the backward loop over the assignment statements.

Teaching Frameworks. My class provides a 150 line implementation of a framework (the educational framework EDF) written in Python/NumPy.  The idea is to teach what a framework is at a conceptual level rather than teaching the details of any actual industrial strength framework.  There are no machine problems in my class — the class is an “algorithms course” in contrast to a “programming course”.

I also have a complaint about the way that PyTorch handles parameters. In EDF modules take “parameter packages” (python objects) as arguments.  This simplifies parameter sharing and maintains object structure over the parameters rather than reducing them to lists of tensors.

Einstein Notation. When we teach deep learning we should be using Einstein notation.  This means that we write out all the indices.  This goes very well with “loop notation”.  For example we can apply a convolution filter $W[\Delta x,\Delta y,i,j]$ to a layer $L_{\mathrm{in}}[x,y,j]$ using the following program.

for $x$, $y$, $i$: $L'_{\mathrm{out}}[x,y,i]=0$

for $x$, $y$, $\Delta x$, $\Delta y$, $i$, $j$: $L'_{\mathrm{out}}[x,y,i] \;+\!\!= \; W[\Delta x,\Delta y,i,j]L_{\mathrm{in}}[x+\Delta x,y+\Delta y,j]$

for $x$, $y$, $i$: $L_{\mathrm{out}}[x,y,i] \;= \sigma(L'[x,y,i])$

Of course we can also draw pictures. The initialization to zero can be left implicit — all data tensors other than the input are implicitly initialized to zero.  The body of the “tensor contraction loop” is just a product of two scalers. The back propagation on a product of two scalars is trivial.  To back-propagate to the filter we have.

for $x$, $y$, $\Delta x$, $\Delta y$, $i$, $j$: $W.\mathrm{grad}[\Delta x,\Delta y,i,j,] \;+\!\!=\;L'_{\mathrm{out}}.\mathrm{grad}[x,y,i]L_{\mathrm{in}}[x+\Delta x,y+\Delta y,j]$

Since essentially all deep learning models consist of tensor contractions and scalar nonlinearities, we do not have to discuss Jacobian matrices.  Also, in Einstein notation we have mnemonic variable names such as $x$, $y$ and $i$ for tensor indices which, for me, greatly clarifies the notation.  Yet another point is that we can easily insert a batch index into the data tensors when explaining minibatching.

Of course NumPy is most effective when using the index-free tensor operations.  The students see that style in the EDF code.  However, when explaining models conceptually I greatly prefer “Einstein loop notation”.

Hyperparameter Conjugacy.  We now come to the real topic of this post. A lot has been written about hyperparameter search.  I believe that hyperparameter search can be greatly facilitated by simple  reparameterizations that greatly improve hyperparameter conjugacy.  Conjugacy means that changing the value of a parameter $x$ does not change (or only minimally influences) the optimal value of another parameter $y$.  More formally, for a loss ${\cal L}$ we would like to have

### $\frac{\partial^2 {\cal L}}{\partial x \partial y} = 0.$

Perhaps the simplest example is the relationship between batch size and learning rate.  It seems well established now that the learning rate should be scaled up linearly in batch size as one moves to larger batches.  See Don’t Decay the Learning Rate, Increase the Batch Size by Smith et al. But note that if we simply redefine the learning rate parameter to be the learning rate appropriate for a batch size of 1, and simply change the minibatching convention to update the model by the sum of gradients rather than the average gradient, then the learning rate and the batch size become conjugate — we can optimize the learning rate on a small machine and leave it the same when we move to a bigger machine allowing a bigger batch. We can also think of this as giving the learning rate a semantics independent of the batch size.  A very simple argument for this particular conjugacy is given in my SGD slides.

The most serious loss of conjugacy is the standard parameterization of momentum. The standard parameterization strongly couples the momentum parameter with the learning rate. For most fameworks we have $\begin{array}{rcl} v & = & \left(1-\frac{1}{N}\right) v + \eta \hat{g} \\ \Phi & -\!\!= & v\end{array}$

where $(1- 1/N)$ is the momentum parameter, $\eta$ is the learning rate, $\hat{g}$ is the gradient of a single minibatch, and $\Phi$ is the system of model parameters.  This can be rewritten as $\begin{array}{rcl} v & = & \left(1-\frac{1}{N}\right) v + \frac{1}{N}(N\eta \hat{g}) \\ \Phi & -\!\!= & v\end{array}$

A recurrence of the form $y_{t+1}= (1-1/N)y_t + (1/N)x_t$ yields that $y$ is a running average of $x$.  The running average of $x$ is linear in $x$ so the above formulation of momentum can be rewritten as $\begin{array}{rcl} \mu & = & \left(1-\frac{1}{N}\right)\mu + \frac{1}{N}\;\hat{g} \\ \Phi & -\!\!= & N \eta \mu\end{array}$

Now we begin to see the problem.  It can be shown that each individual gradient makes a total contribution to $\Phi$ of size $N \eta$.  If the parameter vector $\Phi$ remains relatively constant on the time scale of $N$ updates (where $N$ is typically 10) then all we have done by adding momentum is to change the learning rate from $\eta$ to $N\eta$.  Pytorch starts from a superficially different but actually equivalent definition of the momentum and suffers from the same coupling of the momentum term with the learning rate. But a trivial patch is to use the following more conjugate formulation of momentum. $\begin{array}{rcl} \mu & = & \left(1-\frac{1}{N}\right)\mu + \frac{1}{N}\;\hat{g} \\ \Phi & -\!\!= & \eta \mu\end{array}$

The slides summarize a fully conjugate (or approximately so) parameterization of the learning rate, batch size and momentum parameters.

There appears to be no simple reparameterization of the adaptive algorithms RMSProp and Adam that is conjugate to batch size.  The problem is that the gradient variances are computed from batch gradients and variance estimates computed from large batches, even if “corrected”, contain less information that variances estimated at batch size 1.  A patch would be to keep track of $\sum \;\hat{g}^2$ for each each individual $\hat{g}$ within a batch.  But this involves a modification to the backpropagation calculation. This is all discussed in more detail in the slides.

The next post will be on Langevin dynamics.

This entry was posted in Uncategorized. Bookmark the permalink.

### 3 Responses to Thoughts from TTIC31230: Hyperparameter Conjugacy

1. Michael R Douglas says:

Good points. But wouldn’t the appropriate term be “independent” or “orthogonal” rather than “conjugate,” a word used for related pairs of variables such as (x,p) in mechanics or (p,V) in thermodynamics.

2. McAllester says:

I am using the term “conjugate” in the sense of the conjugate gradient method.

• Michael R Douglas says: