Advobots and AI Safety

In a 2014 post I argued for the “servant mission” as a solution to AI safety. The basic idea is that there are many AI agents each one of which has the mission of serving a particular person. I gave a concise statement of the mission of my servant as:

Mission: Within the law, fulfill the requests of David McAllester

These days I prefer the term “advobot” — A bot whose mission is to advocate for a particular person. A society in which AGI is restricted to personal advobots has fundamental safety advantages discussed in the 2014 post. However, while the concept of a personal advobot seemed well formed in 2014, today’s large language models (LLMs) raise serious issues that were not apparent at that time. The very concept of a “mission statement” now seems problematic. Much of the discussion of AI safety (or lack thereof) has implicitly assumed that an AI agent can be given a fundamental goal expressed in natural language. For example the servant mission above, or maybe “make as many paper clips as possible”, or “get me to the airport as quickly as possible”. In this post I want to consider the relationship between the details of language model architecture, mission statements, and AI safety.

I will start by speculating that LLM architectures will converge on what I will call belief-memory architectures. Belief-memory architectures will be descendants of current retrieval architectures such as RETRO where text generation and next word prediction take retrieved information into account. I am using RETRO as only one example, there is now a large literature on attempts to construct retrieval architectures. In a belief-memory architecture the retrieved information will be interpretable as beliefs where these beliefs are either explicit text or vector representations that can be rendered as text and where the model responds to retrieved beliefs in a manner roughly corresponding to the effect of beliefs on humans. While there is a significant literature on retrieval models, none of the current models retrieve beliefs in this sense.

Belief-memory models, if they can be constructed, have serious advantages. Most obviously, one can inspect what the model believes. One can also inspect what beliefs were used in generating responses or actions. Furthermore, an explicit belief memory can be manually edited — wrong or harmful beliefs can be removed and the events of the day can be added. Note that a belief might simply be that someone claimed something where claims are not necessarily believed.

I will also assume that a belief-retrieval models will be “belief-generating”. Belief generation is similar to chain-of-thought in that the model generates thoughts before generating an answer. In belief generation the thoughts are beliefs that are added to memory. The LLM will remember “I thought x”. There is an analogy here with a Von Neumann architecture where the CPU is replaced with a transformer and the registers are replaced with the transformer context. The context can be rewritten or extended as in current chain-of-thought processing. But the generated thoughts can be stored in the belief memory. This models human memory of thoughts. Subjectively we are not aware of how our thoughts are generated — they just appear. But we do remember them. By definition, computation which is not remembered is not accessible later.

It is hard to image an AGI that is plotting to turn the world into paperclips that does not form and remember thoughts. A belief-generating architecture has the additional advantage that one can inspect (or search through) the thoughts of the model. An advobot for person x should accept the situation where x can watch its thoughts. After all, the advobot for x has the mission of fulfilling the requests of x.

We must also face the problem of how to give an AI agent a mission stated in text. A good first attempt is the recently introduced notion of constitutional AI. I believe that, as language models get better at understanding language, the constitutional AI approach will become more effective and eventually outperform reinforcement learning with human feedback (RLHF). Constitutional AI has the obvious advantage of being able to explicitly specify the mission.

As of now I am not too concerned about the existential risk (X-risk) from AGI. It seems to me that we will have the common sense to construct a society of personal advobots with servant missions and where the beliefs and thoughts of advobots are inspectable, searchable, and editable. I am very much looking forward to an advobot era which I expect to arrive soon.

This entry was posted in Uncategorized. Bookmark the permalink.

13 Responses to Advobots and AI Safety

  1. Mark Johnson says:

    This sounds very reasonable.

    What do you think a belief is? I think you’re assuming a belief is a natural language statement or expression of some kind.

    Could a belief be a vector of some kind, not necessarily embedding a natural language expression?

  2. McAllester says:

    I think the architecture could be workable where beliefs are represented in text. But it would be better, I think, to use vector representations of some kind for beliefs. I do think that one needs discrete “concepts” and “entities” with embeddings analogous to word embeddings. But it seems possible that we have (embedded) concepts which do not have words. Rich Sutton once pointed out to me that we recognize that little piece of plastic at then end of a shoelace even if there is no word for it. But my feeling is that this phenomenon is not significant — we create language as needed in such a way that essentially all beliefs can be fairly accurately rendered in language.

  3. Lisa Travis says:

    What? That phenomenon is hugely significant. Of course we can have concepts without words. That is a fundamental issue, and the evidence sides with concepts existing independent of words. Also, I suspect that “an explicit belief memory can be manually edited” is not so easily done. If only we could edit the beliefs of humans so easily! We can edit their memories, for sure. But editing their beliefs is not that easy, because they are tied to a coherent systems of beliefs, based on a lot of data.

    • McAllester says:

      Editing the beliefs of a computer will presumably be easier than editing the beliefs of a person. But the point about coherence is well taken. There should be a process whereby an AI agent seeks truth through coherence of belief. I think that the ability to edit the beliefs of an LLM is significant for safety but agree that strong AI systems will strive for coherence of belief (by design) and that this will complicate belief editing.

    • McAllester says:

      I think we agree that concepts can exist without lexical names, especially in young children. But current language models have a command an enormous vocabulary already. The question is the significance of concepts that cannot be accurately described using this enormous vacabullary with adult-level command of the meanings of the words. Concepts might still exist that cannot be verbalized but “my feeling” is that that this is not a significant phenomenon in this setting.

  4. Mark Johnson says:

    I think this idea is basically on the right track.

    But I think as well as Beliefs, we should allow our Agents to have something like Desires and Intentions, as in https://en.wikipedia.org/wiki/Belief%E2%80%93desire%E2%80%93intention_software_model

  5. Mark Johnson says:

    Another related thought: it puzzles me that retrieval mechanisms, Transformer Attention, etc are always defined with separate Keys and Values, yet in practice they are deployed with Keys and Values almost always being the same.

    I would want the Keys for your blog posts to be things like “clever ideas about NLP and AI”, “Logic and Language”, etc.

    • McAllester says:

      The key/value distinction is interesting. My thinking is that the value should roughly decompose (under some rotation) into a tensor product of “components”. This should allow the query to probe a particular component of the value. Intuitively, we still have that the values can be indexed under they components. 1000 dimensions allows the value to be more like a database record or object with values for instance variables.

  6. Mark Johnson says:

    A final comment: I think a really interesting research question is: how do we train a model to write useful memories?

    I am assuming that a memory needs to abstract from or summarise the current model state somehow. I suppose you could avoid this problem by memorising all the raw inputs to the model (I’ve seen some language modelling papers suggesting this).

    The challenge here is that the usefulness of a memory may only be apparent much later than when it was created.

    I think we could solve this with a version of RL or DPO over a very long temporal window (the high level idea is to reward the model for writing useful memories). It seems like a solvable problem, and I would be very interested in collaborating to work out the details.

    It’s also interesting to wonder about how this problem might be solved by biological creatures who can’t “wind back” time to an earlier state.

    Anyway, thanks for a very interesting post!

  7. McAllester says:

    Let’s take this offline.

  8. Bob Givan says:

    Thinking through your somewhat-implied safety claims…

    1. Do you think that our ability to examine and alter *current* computer memory guarantees us safety from the effects of bugs, today? CPUs process so much faster than we can do this memory-examination that the safety guarantees are quite imperfect. Does this problem get worse or better with AI systems and faster CPUs over time? Can our tools for speeding ourselves up keep up?

    For safety guarantees, I am tempted to think in limit conditions where the AI system is vastly faster than we are, thinking for what would be thousands of years of our thought stream, but doing so in seconds. While we won’t build such a system overnight, so there is hope about controlling it via how it is built, it is hard to imagine controlling it when it is that fast by watching and editing its thoughts and beliefs. Even harder if it knows that we have such control and may not share its goals. I guess obviously we would need to rely on some automated tools to monitor its beliefs? But now we rely on the guarantees of those automated tools to solve a hard to specify problem.

    2. Even assuming we have somehow aligned the fast-thinking AI, it isn’t clear to me what role we still play. All the advobots can communicate and process to reach a game-theoretic resolution of their conflicting advocacy goals, in a split second, whereas that resolution might take a lifetime to explain to the human. So I guess it is acted on without explanation to the human?

    (Imagine that the advobot has negotiated a resolution you very much dislike; don’t you want an explanation? It must be the best it could achieve for you, but that’s hard to just trust. The true reasons could be internationally complex and hard to summarize.)

    Again, this is a downstream vision, not what would come right away in the near future, but it isn’t easy to see how partial achievement of this advobot paradise is guaranteed to be safer.

    3. I wonder how much the desire to see this come our way is biased by our technical interest to know what it would be (curiosity, wonder) and our alternative (certain personal death, for some of us, and again for some not so far off).

    • McAllester says:

      Reading thoughts is a redundant safety mechanism to the alignment of the bot to its master. This mission is to “fulfil the requests of …”. That, in and of itself, should leave us in control independent of anything the machine believes. Regarding the speed of machine thought, we are not going to get a thousand years per second for a while. There will be time to co-evolve AI with safety measures. For example bots watching the thoughts of other bots and aligned to the mission of maintaining alignment.

      No one knows for sure how this will develop. We all have to use our own judgement in deciding what to believe or predict. My judgement is that between the redundant safety feature of dispersal of power (one agent per person), constitutional alignment, and thought-reading, we are safe.

Leave a comment