Guidance and Art.

Self-guidance, or simply guidance, is fundamental to systems for image generation including both diffusion-based image generators, such as the well known DALLE generator, and autoregressive image generators such as CM3leon. A technical definition of guidance is given below. While Guidance is fundamental to image generation, the semantics is unclear. This post argues that guidance causes image generation to capture “essence”. This is also a fundamental motivations of human drawing and painting (human art). An empirical test of this claim is to apply guidance to a system trained only on photographs. The guidance-as-essence hypothesis would predict that, even when trained only on photographs, a model that generates images through guidance can generate drawing-like images. Drawings capture essence.

To define guidance I will assumed we have trained a model that computes a probability of images given textual descriptions P(I|T) where I is an image and T is some textual description of the image such as “an astronaut riding a horse”. We also assume that we have a model for the marginal distribution P(I) — the a-priori probability of image I independent of any text label. Both diffusion image models and autoregressive image models define these probabilities. In guided image generation we modify the generation probability as follows for s \geq 0.

P_s(I|T) \propto \frac{P(I|T)^{s+1}}{P(I)^s}

The idea here is that we want the distribution to favor images corresponding the text T but which are a-priori (unconditionally) unlikely. Note that for s =0 the distribution is the unmodified P(I|T). Also note that for s > 0 we have that s acts as an inverse temperature — larger s corresponds to a more focused (lower temperature) distribution. To sample from this distribution we must incorporate guidance into the incremental generation process. Some form of this is done in both diffusion models and autoregressive models. I will ignore the details.

Now the question is whether the modified distribution P_s(I|T) is related to the idea of “capturing the essense” of the text T. The concrete prediction is that an image sampled from P_s(I|T) will be similar in style to an artists drawing of T. This is related to art in field guides for identifying birds. Drawings are generally considered to be better than photographs for capturing the “essence” of being, for example, a robin. In this example “capturing the essence” means that the drawing facilitates field identification.

I will also suggest that the use of guidance interacts with the quantitative metrics for image generation. The most natural metric for any generative model (in my opinion) is cross entropy. Base models are trained by optimizing cross entropy loss in both image models and language models. For language models it is traditional to measure cross entropy as perplexity or bits per byte. For images it is traditional to measure cross entropy in bits per channel as defined by the following where C is the number of numbers (channels) in the representation of image I and P_\Phi(I|T) is the probability that a model with parameter setting \Phi assigns to image I given text T.

\mathrm{Loss}(\Phi) = \frac{1}{C}\; E_{I,T \sim \mathrm{Pop}}[\;-\log_2\;P_\Phi(I|T)]

Bits per channel measures the ability of the model to represent the true conditional distribution P(I|T) as defined by a joint population distribution (\mathrm{Pop}) on I and T. However, bits per channel does not correspond to the models ability to draw the essence of a concept T. For some reason it seems that the FID scores more closely corresponds to the ability to draw essence. This may be because the features used in computing FID are trained to capture the information relevant to the textual description (label) of the image.

That said, I still feel that the FID metric is not well motivated. I would suggest measuring both the conditional bits per channel as defined above and the unconditional bits per channel and summing those scores as a measure of the “correctness” of guided image generation.

The paper on CM3leon has a nice discussion relating guidance in image generation to a form of top-k sampling in language generation. Top-k sampling also modifies the generation process in a way that is not faithful to simply sampling from a conditional distribution. Top-k sampling in language generation exhibits the same coherence-diversity trade of (as a function of k) as does guidance (as a function of guidance strength s) in image generation.

The bottom line for this post is the hypothetical relationship between guidance and art and the prediction that guidance can result in drawing-like images even when the generating model is trained only on photographs.

This entry was posted in Uncategorized. Bookmark the permalink.

1 Response to Guidance and Art.

  1. Mark Johnson says:

    I haven’t kept up with text-to-image generation; is it the case that current models produce a line drawing or a sketch when $s$ is turned way up? Or does it produce a more proto-typical example of the object than you’d expect to see in real life (like your example of the drawings in bird-watcher’s field guides)?

    I am very sympathetic to the Cognitive Semantics approaches which claim that simple spatial metaphors underlie lots of language, as proposed by people like George Lakoff and (a different to me) Mark Johnson. They draw abstract sketches to show (metaphorical) spatial relationships; see https://en.wikipedia.org/wiki/Image_schema for a few examples. It would be very cool if a machine could produce such sketches; and truly amazing if they “emerged” from a text-to-image generator with the appropriate parameter settings.

    On a vaguely related issue: Melanie Mitchell pointed out to me that all our models learn concepts by abstracting them from complex language, while children almost certainly have all or most of the concepts before they learn language.

Leave a comment