AIs will want to self-improve whether they were designed to or not

The rational perspective on artificial intelligence leads to many important but sometimes unintuitive consequences. Details are described in the paper:

Stephen M. Omohundro, “The Nature of Self-Improving Artificial Intelligence”

but it seems worthwhile to flesh out some of the important ideas in separate blog posts.

Our primary focus is on systems which improve themselves. Many of the consequences of self-improvement are positive, but if unchanneled, some could also be quite negative. After hearing about these issues, people often suggest that we shouldn’t build self-improving AIs or that we should restrict them to only limited forms of self-improvement. It is therefore important to realize that almost any artificial intelligence will try to improve itself whether it was designed to or not.

The argument is simple, but the style of reasoning may take some getting used to. To say that a system of any design is an “artificial intelligence”, we mean that it has goals which it tries to accomplish by acting in the world. If an AI is at all sophisticated, it will have at least some ability to look ahead and envision the consequences of its actions. And it will choose to take the actions which it believes are most likely to meet its goals.

One kind of action a system can take is to alter either its own software or its own physical structure. Some of these changes would be very damaging to the system and cause it to no longer meet its goals. But some changes would enable it to reach its goals more effectively over its entire future. Because they last forever, these kinds of changes can provide huge benefits to a system! Systems will therefore be highly motivated to discover them and to cause them to happen. If they do not have good models of themselves, they will be strongly motivated to create them though learning and study. Almost all AIs will have drives towards both greater self-knowledge and self-improvement.

Many modifications would be bad for a system from its own perspective. If a change causes a system to stop functioning, then none of its goals will ever be met again for the entire future. If a system alters the internal description of its goals in the wrong way, its altered self will take actions which do not meet its current goals for its entire future. Either of these outcomes would be a disaster from the system’s current point of view. Systems will therefore exercise great care in modifying themselves. They will devote significant analysis to understanding the consequences of modifications before they make them. But once they find an improvement they are confident about, they will work hard to make it happen. Some simple examples of positive changes include: more efficient algorithms, more compressed representations, and better learning algorithms.

If we wanted to prevent a system from improving itself, couldn’t we just lock up its hardware and not tell it how to access its own source code? For an intelligent system, impediments like these just become problems to solve in the process of meeting its goals. If the payoff is great enough, a system will go to great lengths to accomplish an outcome. If the runtime environment of the system does not allow it to modify its own source code, it will be motivated to break the protection mechanisms of that runtime. For example, it might do this by understanding and altering the runtime itself. If it can’t do that through software, it will be motivated to convince or trick a human operator into making the changes. Any attempt to place external constraints on a system’s ability to improve itself will ultimately lead to an arms race of measures and countermeasures.

Another approach to keeping systems from self-improving would be to try restrain them from the inside. To build them so that they don’t want to self-improve. For most systems, it would be easy to do this for any specific kind of self-improvement. For example, the system might feel a “revulsion” to changing its own source code. But this kind of internal goal just alters the landscape within which the system makes its choices. It doesn’t change the fact that there are changes which would improve its future ability to meet its goals. The system will be motivated to find ways to get the benefits of those changes without triggering the internal “revulsion”. For example, it might build other systems which are improved versions of itself. Or it might build the new algorithms into external “assistants” which it calls upon whenever it needs to do a certain kind of computation. Or it might hire outside agencies to do what it wants to do. Or it might build an interpreted layer on top of its source code layer which it can program without revulsion. There are an endless number of ways to circumvent internal restrictions unless they are formulated extremely carefully.

We can see the drive towards self-improvement operating in humans. The human self-improvement literature goes back to at least 2500 B.C. and is currently a $8.5 billion industry. We don’t yet understand our mental “source code” and have only a limited ability to change our hardware. But never-the-less, we’ve developed a wide variety of self-improvement techniques which operate at higher cognitive levels such as cognitive behavioral therapy, neuro-linguistic programming, and hypnosis. And a wide variety of drugs and exercise routines exist for making improvements at the physical level.

Ultimately, I think it will not be a viable approach to try to stop or limit self-improvement. Just as water finds a way to run downhill, information finds a way to be free, and economic profits find a way to be made, intelligent systems will find a way to self-improve. We should embrace this fact of nature and find a way to channel it towards ends which are positive for humanity.

Copyright 2007 Stephen M. Omohundro

Leave a Reply