Among techniques employing recursion, Recurrent Neural Nets, Context Free Grammar Discovery, Genetic Algorithms, and Genetic Programming have been prominent.
I will describe the Universal Distribution, a method of induction that is guaranteed to discover any describable regularities in a body of data, using a relatively small sample of the data. While the incomputability of this distribution has sharply limited its adoption by the machine learning community, I will show that paradoxically, this incomputability imposes no limitation at all on its application to practical prediction.
My recent work has centered mainly on two systems for machine learning. The first might be called "The Baby Machine" We start out with the machine having little problem specific knowledge, but a very good learning algorithm. At first we give it very simple problems. It uses its solutions to these problems to devise a probability distribution over function space to help search for solutions to harder problems. We give it harder problems and it updates its probability distribution on their solutions. This continues recursively, solving more and more difficult problems.
The task of writing a suitable training sequence has been made much easier by Moore's Law, which gives us enough computer speed to enable large conceptual jumps between problems in the sequence.
A more difficult task is that of updating the probability distribution when new problems are solved. I will be using some of the current techniques for machine learning in this area: PPM and Grammar Discovery are particularly promising. It is my impression that the probability models need not be recursive for the initial learning. Later the system can, with suitable training, work on the problem of updating its own probability distribution, using fully recursive models.
Genetic Programming is my second area of recent research. Koza's Genetic Programming system has been able to solve many difficult problems of very varied kinds. The system itself is extremely slow, however -- it has taken a cluster of 1000 Pentiums about a month to solve some of the more difficult problems. I have found a large number of modifications of the system that I expect will dramatically improve its speed and broaden the scope of problems it can solve.
The Future --- The cost halving constant in Moore's law is now about 18 months. When A. I. pays a significant role in the reduction of this time constant, we begin to move toward the singularity. At the present time I believe we have a good enough understanding of machine learning, for this to take place. While I hesitate to guess as to just when the singularity will occur, I would be much surprised if it took as much as 20 years. As for the next 50 years of A.I., I feel that predicting what will happen after this singularity is much like guessing how the universe was before the Big Bang -- It's a real discontinuity!
The first kind is Function Inversion. These are the P and NP problems of computational complexity theory. They include theorem proving, solution of equations, symbolic integration, etc.
The second kind of problem is Time Limited Optimization. Inductive inference of all kinds, surface reconstruction, and image restoration are a few examples of this kind of problem. Designing an automobile in 6 months satisfying certain specifications and having minimal cost, is another.
In the following discussion, we will be using the term ``Probability'' in a special sense: i.e. the estimate given by the best probabilistic model for the available data that we can find in the available time.
Our system starts out with a small set of Problem Solving Techniques (PST's) and a simple General Conditional Probability Distribution (GCPD). When the system is given a problem, the description of this problem is the ``Condition'' for the GCPD. Its output is a probability distribution on PST's - the likelihood that each of them will be the best way to solve the problem. It uses these PST's and their associated probabilities to solve the first problem.
Next, it executes its Update Algorithm: The PST's are modified, new ones may be added, some may be deleted. The GCPD is modified. These changes attempt to incorporate into the system, information obtained in solving recent problems and prepare it to solve harder problems.
The next problem presentation and the system's solution is followed by the corresponding update and so on. After the nth update, the system is usually able to work problems more difficult than the nth. Giving the system a suitable training sequence of problems of progressive difficulty, makes it possible for the system to eventually solve very hard problems.
One critical component in the system is the initial set of PST's. These include Levin's Universal Search Algorithm (Lsearch). Simple Lsearch is only practical for very small problems, since its solution time is exponential in problem size. In the update phase, we modify the probability distribution that is used to guide Lsearch. This effectively reduces the sizes of more difficult problems, so that Lsearch becomes feasible for them.
The update algorithm is another critical component. It has been designed to facilitate ``transfer learning'', so that the system can utilize any similarities between problems in the same or disparate domains. It has been possible to express this algorithm as an inductive inference problem of a type the system normally solves - so we can simply ask the system to update itself as one of its regular problems.
Perhaps the most critical component of all is the training sequence - To devise a sequence of problems so that each of them challenges the system to devise new concepts to solve it, yet keeps the challenges within the capacity of the machine. For PST's that use simple Lsearch, this is not hard to do. It is always possible to get a good estimate of the time needed for the system to find a known solution to a problem. For certain modifications of Lsearch this is not so easy and designing training sequences for them can become as difficult as teaching people!
The system's generality invites comparison with Genetic Programming. We will discuss differences in the two approaches to optimization problems, and suggest a way in which Genetic Programming might be made a part of the present system.
Another way to look at the system is to think of it as a ``Wrapper'' system - a system that modifies its parameters in view of its experience with earlier problem solving. The present system is an ``Exteme Wrapper System'' in that - through the Magic of Levin's Universal Search - it is possible for the entire system to effectively replace itself by one that is more efficient.
A very general formal solution to the first kind of problem is well known and much work has been done in obtaining good approximations to it [ Sol 64a, Sol 64b,Wal 68, Sol 78, Ris 78, Ris 89, Li 93].
Though the second kind of problem is of much practical importance, no general solution has been published. We present two general solutions for unordered data.
We also show how machines can be constructed to summarize sequential and unordered data in optimum ways.
For practical induction we use a set of approximations of increasing power to approach ALP. This set is called Resource Bounded Probability (RBP), and it constitutes a general solution to the problem of practical induction. Some of its properties are quite different from those of ALP.
The rest of the paper will discuss philosophical and practical implications of the properties of ALP and RBP.
It should be noted that the only argument that need be considered for the use of these techniques in induction is their effectiveness in getting good probability values for future events. Whether their properties are in accord with our intuitions about induction is a peripheral issue. The main question is ``Do they work?'' and we show that they do work.
Current programs for machine learning are limited in the kinds of concepts accessible to them, the kinds of problems they can learn to solve, and in the efficiency with which they learn - both in computation time needed and/or in amount of data needed for learning.
Algorithmic Probability Theory provides a general model of the learning process that enables us to understand and surpass many of these limitations.
Starting with a machine containing a small set of concepts, we use a carefully designed sequence of problems of increasing difficulty to bring the machine to a high level of problem solving skill.
The use of training sequences of problems for machine knowledge acquisition promises to yield Expert Systems that will be easier to train and free of the brittleness that characterizes the narrow specialization of present day systems of this sort.
It is also expected that the present research will give needed insight in the design of training sequences for human learning.
Some examples are worked out to show the application of the method
to specific problems. Applications of the method to curve fitting
and other continuous problems are discussed to some extent. Some
alternative theories of inductive inference are presented whose
validities appear to be corollaries of the validity of the first
A Preliminary Report on
A General Theory
Of Inductive Inference
Some preliminary work is presented on a very general new theory
of inductive inference. The extrapolation of an ordered sequence of
symbols is implemented by computing the apriori probabilities of
various sequences of symbols. The apriori probability of a
sequence is obtained by considering a universal Turing machine
whose output is the sequence in question. An approximation to the
apriori probability is given by the shortest input to the machine
that will give the desired output. A more exact formulation is
given, and it is made somewhat plausible that extrapolation
probabilities obtained will be largely independent of just which
universal Turing machine was used, providing that the sequence to
be extrapolated has an adequate amount of information in it.
Some examples are worked out to show the application of the method to specific problems. Applications of the method to curve fitting and other continuous problems are discussed to some extent. Some alternative theories of inductive inference are presented whose validities appear to be corollaries of the validity of the first method described.