The first kind extrapolates a sequence of strings and/or numbers.
The second extrapolates an unordered set of strings and/or numbers.
The third extrapolates an unordered set of ordered pairs of elements that may be strings and/or numbers. Given the first element of a new pair, to get a probability distribution on possible second elements of the pair.
Each of the three kinds of problems is solved using an associated universal distribution. In each case a corresponding convergence theorem is given, showing that as sample size grows, the expected error in probability estimate decreases rapidly.
The solutions given are very general and cover a great variety of induction problems. Time series prediction, grammar discovery (for both formal and natural languages), curve fitting, the identification problem and the categorization problem, are a few of the kinds of problems amenable to the methods described.
Among techniques employing recursion, Recurrent Neural Nets, Context Free Grammar Discovery, Genetic Algorithms, and Genetic Programming have been prominent.
I will describe the Universal Distribution, a method of induction that is guaranteed to discover any describable regularities in a body of data, using a relatively small sample of the data. While the incomputability of this distribution has sharply limited its adoption by the machine learning community, I will show that paradoxically, this incomputability imposes no limitation at all on its application to practical prediction.
My recent work has centered mainly on two systems for machine learning. The first might be called "The Baby Machine" We start out with the machine having little problem specific knowledge, but a very good learning algorithm. At first we give it very simple problems. It uses its solutions to these problems to devise a probability distribution over function space to help search for solutions to harder problems. We give it harder problems and it updates its probability distribution on their solutions. This continues recursively, solving more and more difficult problems.
The task of writing a suitable training sequence has been made much easier by Moore's Law, which gives us enough computer speed to enable large conceptual jumps between problems in the sequence.
A more difficult task is that of updating the probability distribution when new problems are solved. I will be using some of the current techniques for machine learning in this area: PPM and Grammar Discovery are particularly promising. It is my impression that the probability models need not be recursive for the initial learning. Later the system can, with suitable training, work on the problem of updating its own probability distribution, using fully recursive models.
Genetic Programming is my second area of recent research. Koza's Genetic Programming system has been able to solve many difficult problems of very varied kinds. The system itself is extremely slow, however -- it has taken a cluster of 1000 Pentiums about a month to solve some of the more difficult problems. I have found a large number of modifications of the system that I expect will dramatically improve its speed and broaden the scope of problems it can solve.
The Future --- The cost halving constant in Moore's law is now about 18 months. When A. I. pays a significant role in the reduction of this time constant, we begin to move toward the singularity. At the present time I believe we have a good enough understanding of machine learning, for this to take place. While I hesitate to guess as to just when the singularity will occur, I would be much surprised if it took as much as 20 years. As for the next 50 years of A.I., I feel that predicting what will happen after this singularity is much like guessing how the universe was before the Big Bang -- It's a real discontinuity!
First, the Universal Probability Distribution and some of its properties: its accuracy, its incomputability, its subjectivity.
Second, how to use this distribution to create very intelligent machines.
The report shows how the Universal Probability Distribution --- the probability distribution on all possible output strings of a universal computer with random input --- solves many kinds of prediction problems and resolves serious difficulties in the foundations of Bayesian Statistics.
It discusses the Convergence Theorem and how the universal distribution gives very good probability estimates with varied types of data.
It explains what incomputability is and it's value.
Then discusses the important difference between the Universal Distribution and a common interpretation of "Universality"--- where universal would imply that we can usefully employ the same Universal Distribution for all problems. However, to get good predictions it is usually necessary to use a different Universal Distribution for each Problem Domain. The meaning and value of this is clarified.
Then the report discusses how all these aspects may be used in developing intelligent machines. Notes the delayed, gradual acceptance of subjective Bayesian probability in the AI community. Outlines the general system, how it solves problems with both probabilistic and deterministic answers and how learning is an integral part of it.
The first kind is Function Inversion. These are the P and NP problems of computational complexity theory. They include theorem proving, solution of equations, symbolic integration, etc.
The second kind of problem is Time Limited Optimization. Inductive inference of all kinds, surface reconstruction, and image restoration are a few examples of this kind of problem. Designing an automobile in 6 months satisfying certain specifications and having minimal cost, is another.
In the following discussion, we will be using the term ``Probability'' in a special sense: i.e. the estimate given by the best probabilistic model for the available data that we can find in the available time.
Our system starts out with a small set of Problem Solving Techniques (PST's) and a simple General Conditional Probability Distribution (GCPD). When the system is given a problem, the description of this problem is the ``Condition'' for the GCPD. Its output is a probability distribution on PST's - the likelihood that each of them will be the best way to solve the problem. It uses these PST's and their associated probabilities to solve the first problem.
Next, it executes its Update Algorithm: The PST's are modified, new ones may be added, some may be deleted. The GCPD is modified. These changes attempt to incorporate into the system, information obtained in solving recent problems and prepare it to solve harder problems.
The next problem presentation and the system's solution is followed by the corresponding update and so on. After the nth update, the system is usually able to work problems more difficult than the nth. Giving the system a suitable training sequence of problems of progressive difficulty, makes it possible for the system to eventually solve very hard problems.
One critical component in the system is the initial set of PST's. These include Levin's Universal Search Algorithm (Lsearch). Simple Lsearch is only practical for very small problems, since its solution time is exponential in problem size. In the update phase, we modify the probability distribution that is used to guide Lsearch. This effectively reduces the sizes of more difficult problems, so that Lsearch becomes feasible for them.
The update algorithm is another critical component. It has been designed to facilitate ``transfer learning'', so that the system can utilize any similarities between problems in the same or disparate domains. It has been possible to express this algorithm as an inductive inference problem of a type the system normally solves - so we can simply ask the system to update itself as one of its regular problems.
Perhaps the most critical component of all is the training sequence - To devise a sequence of problems so that each of them challenges the system to devise new concepts to solve it, yet keeps the challenges within the capacity of the machine. For PST's that use simple Lsearch, this is not hard to do. It is always possible to get a good estimate of the time needed for the system to find a known solution to a problem. For certain modifications of Lsearch this is not so easy and designing training sequences for them can become as difficult as teaching people!
The system's generality invites comparison with Genetic Programming. We will discuss differences in the two approaches to optimization problems, and suggest a way in which Genetic Programming might be made a part of the present system.
Another way to look at the system is to think of it as a ``Wrapper'' system - a system that modifies its parameters in view of its experience with earlier problem solving. The present system is an ``Exteme Wrapper System'' in that - through the Magic of Levin's Universal Search - it is possible for the entire system to effectively replace itself by one that is more efficient.
A very general formal solution to the first kind of problem is well known and much work has been done in obtaining good approximations to it [ Sol 64a, Sol 64b,Wal 68, Sol 78, Ris 78, Ris 89, Li 93].
Though the second kind of problem is of much practical importance, no general solution has been published. We present two general solutions for unordered data.
We also show how machines can be constructed to summarize sequential and unordered data in optimum ways.
Algorithmic Probability is a general technique for solving mathematical problems and for developing and measuring scientific theories. The problems are related to induction and in particular, induction and prediction when the model may be probabilistic.
We describe motivations, major influences and the steps taken to develop this system of induction and prediction. Among others, major influences include Carnap, Chomsky, Shannon, McCarthy and Minsky. We discuss the importance of Turing Machines and new computer coding ideas. Bayes rule is especially important for the development of quantitative values for prediction, and prediction in complex situations. The Search method developed by Leonid Levin is discussed.
For practical induction we use a set of approximations of increasing power to approach ALP. This set is called Resource Bounded Probability (RBP), and it constitutes a general solution to the problem of practical induction. Some of its properties are quite different from those of ALP.
The rest of the paper will discuss philosophical and practical implications of the properties of ALP and RBP.
It should be noted that the only argument that need be considered for the use of these techniques in induction is their effectiveness in getting good probability values for future events. Whether their properties are in accord with our intuitions about induction is a peripheral issue. The main question is ``Do they work?'' and we show that they do work.
Current programs for machine learning are limited in the kinds of concepts accessible to them, the kinds of problems they can learn to solve, and in the efficiency with which they learn - both in computation time needed and/or in amount of data needed for learning.
Algorithmic Probability Theory provides a general model of the learning process that enables us to understand and surpass many of these limitations.
Starting with a machine containing a small set of concepts, we use a carefully designed sequence of problems of increasing difficulty to bring the machine to a high level of problem solving skill.
The use of training sequences of problems for machine knowledge acquisition promises to yield Expert Systems that will be easier to train and free of the brittleness that characterizes the narrow specialization of present day systems of this sort.
It is also expected that the present research will give needed insight in the design of training sequences for human learning.
The first topic: Algorithmic Probability --- the motivation for defining it, how it overcomes difficulties in other formulations of probability, some of its characteristic properties and successful applications. Algorithmic Probability of a string is the probability of that particular string being produced as the output of a reference universal Turing machine with random input. This revision of our definition of probability resolves many problems of statistics that have plagued classical concepts of probability. Discusses aspects of algorithmic probability, such as completeness and incomputability.
The second topic: Discusses how to apply it to problems in A.I. --- where it promises to give near optimum search procedures for two very broad classes of problems. First, machine inversion problems: given a string, s and a machine, M, that maps strings to strings, find in minimum time, a string, s, such that M(x) = s. Solving algebraic equations is an example. Second, time limited optimization problems. Given a time limit T, and a machine M that maps strings to real numbers, find within time T the string x such that M(x) is as large as possible. Many engineering problems are of this sort --- for example, designing an automobile in 6 months.
Discusses how Algorithmic Complexity theory can find general laws in masses of data. Different problem applications and a special search procedure, Levin Search, are discussed, as well as the current state of the general system using algorithmic probability.
Estimates are made for when these milestones will occur, followed by some suggestions for the more effective utilization of the extremely rapid technological growth that is expected.
Levin has shown that if the expression
~
(P)'M(x) is an unnormalized form of this measure, and
P(x) is any computable probability measure
on strings, x, then
~
(P)'M(x) is greater than or equal to CP(x)
where C is a constant independent of x. The corresponding result for the
normalized form of this measure (P)'M(x) is directly derivable from Willis'
probability measures on non-universal machines. If the conditional probabilities
of (P)'M(x) are used to approximate those of P, then the expected value
of the total squared error in these conditional probabilities is bounded by
-(l/2) ln C.
With this error criterion, and when used as the basis of a
universal gambling scheme, (P)'M(x) is superior to Cover’s measure b*. When
H* ≡ -log2, (P)'M(x) is used to define the entropy of a finite sequence, the
equation
H*(x,y)= H*(x)+ H,*(y) holds exactly, in contrast to Chaitin’s
entropy definition, which has a nonvanishing error term in this equation.
Several problems in pattern recognition and A.I. are investigated through these methods. The theory is used to obtain the a priori probabilities that are necessary in the application of stochastic languages to pattern recognition. A simple, quantitative solution is presented for part of Winston's problem of learning structural descriptions from examples. In contrast to work in non-probabilistic prediction, the present methods give probability values that can be used with decision theory to make critical decisions.
An approach to defining randomness of a finite sequence is that all future continuations of the sequence are equally likely.
The systems described make it possible to put such a definition into an exact form and analyze its properties.
The most important open problem is the practical realization of systems like those described.
Attempts are made to implement induction theories through a general approach to networks of tasks. Applications of induction theory to man--to--animal communication are considered, and some new approaches to interspecies communication are suggested.
It is now shown that the induction theories developed for discrete symbols can be applied to prediction of continuous data.
Computer implementation of induction through use of definitions of subsequences of symbols may prove to be a valuable line of effort. Supervisory problems in dealing with very intelligent machines are discussed, parallels are shown to the problem of dealing with ambitious subordinates, and a remedy for the case of the machines is proposed.
Some strong heuristic arguments have been obtained for the equivalence of the last three models. One of these models is equivalent to a Bayes formulation, in which a priori probabilities are assigned to sequences of symbols on the basis of the lengths of inputs to a universal Turing machine that are required to produce the sequence of interest as output.
Though it seems likely, it is not certain whether the first of the four models is equivalent to the other three.
Few rigorous results are presented. Informal investigations are made of the properties of these models. There are discussions of their consistency and meaningfulness, of their degree of independence of the exact nature of the Turing machine used, and of the accuracy of their predictions in comparison to those of other induction methods.
In Part II these models are applied to the solution of three problems.
Though some approximations are used, the first of these problems is treated must rigorously. The result is Laplace's rule of succession.
The solution to the second problem uses less certain approximations, but the properties of the solution that are discussed, are fairly independent of these approximations.
The third application, using phrase structure grammars, is least exact of the three. First a formal solution is presented. Though it appears to have certain deficiencies, it is hoped that presentation of this admittedly inadequate model will suggest acceptable improvements in it. This formal solution is then applied in an approximate way to the determination of the "optimum" phrase structure grammar for a given set of strings. The results that are obtained are plausible, but subject to the uncertainties of the approximation used.
Some examples are worked out to show the application of the method
to specific problems. Applications of the method to curve fitting
and other continuous problems are discussed to some extent. Some
alternative theories of inductive inference are presented whose
validities appear to be corollaries of the validity of the first
method described.
It discusses Cholmsky's language types and their complexity; how inductive inference can be best applied, and how to decide on the "goodness of fit" of possible predicted sentences to be included in the language.
In the very extensive Second Appendix, the a priori likelihood of a language is used as a quantification of complexity, the difference between simple and complex languages discussed, and the formula given for the probability that a particular symbol sequence is generated by a particular stochastic language that generated other members of the symbol sequence set. Bayes rule is used in discovering the the best conditional probability for the full set of sequences generated by the language.
A Preliminary Report on
A General Theory
Of Inductive Inference (1960) Abstract
Some preliminary work is presented on a very general new theory
of inductive inference. The extrapolation of an ordered sequence of
symbols is implemented by computing the apriori probabilities of
various sequences of symbols. The apriori probability of a
sequence is obtained by considering a universal Turing machine
whose output is the sequence in question. An approximation to the
apriori probability is given by the shortest input to the machine
that will give the desired output. A more exact formulation is
given, and it is made somewhat plausible that extrapolation
probabilities obtained will be largely independent of just which
universal Turing machine was used, providing that the sequence to
be extrapolated has an adequate amount of information in it.
--------------------------------------------
A Progress Report on Machines to Learn to Translate Languages
and Retrieve Information
(1959) Abstract
This describes recent work on inductive inference techniques applied to a formal language defined to be a finite or infinite set of finite sequences of symbols. It defines grammar for such a language to be a set of rules by which all sentences in the language can be generated, or a test to tell if the sentences are part of the language.
--------------------------------------------