Abstracts of Selected Publications
of Ray Solomonoff

Contact Information:
Grace Solomonoff; Box 400404;
Cambridge, Mass. 02140; U.S.A.
prettyvivo@gmail.com

---------------------------------------------

Algorithmic Probability:
Theory and Applications

     Abstract

We first define Algorithmic Probability, an extremely powerful method of inductive inference. We discuss its completeness, incomputability, diversity and subjectivity and show that its incomputability in no way inhibits its use for practical prediction. Applications to Bernoulli sequence prediction and grammar discovery are described. We conclude with a note on its employment in a very strong AI system for very general problem solving.

---------------------------------------------

Three Kinds of Probabilistic Induction:
Universal Distributions and Convergence Theorems

     Abstract

We will describe three kinds of probabilistic induction problems, and give general solutions for each , with associated convergence theorems that show they tend to give good probability estimates.

The first kind extrapolates a sequence of strings and/or numbers.

The second extrapolates an unordered set of strings and/or numbers.

The third extrapolates an unordered set of ordered pairs of elements that may be strings and/or numbers. Given the first element of a new pair, to get a probability distribution on possible second elements of the pair.

Each of the three kinds of problems is solved using an associated universal distribution. In each case a corresponding convergence theorem is given, showing that as sample size grows, the expected error in probability estimate decreases rapidly.

The solutions given are very general and cover a great variety of induction problems. Time series prediction, grammar discovery (for both formal and natural languages), curve fitting, the identification problem and the categorization problem, are a few of the kinds of problems amenable to the methods described.

The Probability of "Undefined" (Non-converging) Output
in Generating the Universal Probability Distribution

     Abstract

In order to generate a universal probability distribution to extrapolate a binary string x of length i , we feed random bits into a universal device, M . When we find an input string that gives an output matching x , we continue the successful input with random bits until M produces a zero or one as output. The relative probabilities of these two continuations can give a normalized prediction for the probability of the symbol following x. There is, however, a probability, P_{i+1}(u) that the continued random input string will not generate any output for the (i+1)th symbol. We will show that the expected value of the sum of these P_i(u)'s is ≤ to kln2, k being the Kolomogorov complexity with respect to M, of the distribution that generated x.

---------------------------------------------

Machine Learning --- Past and Future

     Abstract

I will first discuss current work in machine learning -- in particular, feedforeword Artificial Neural Nets (ANN), Boolean Belief Nets (BBN), Support Vector Machines (SVM), Radial Basis Functions (RBF) and Prediction by Partial Matching (PPM). While they work quite well for the types of problems for which they have been designed, they do not use recursion at all and this severely limits their power.

Among techniques employing recursion, Recurrent Neural Nets, Context Free Grammar Discovery, Genetic Algorithms, and Genetic Programming have been prominent.

I will describe the Universal Distribution, a method of induction that is guaranteed to discover any describable regularities in a body of data, using a relatively small sample of the data. While the incomputability of this distribution has sharply limited its adoption by the machine learning community, I will show that paradoxically, this incomputability imposes no limitation at all on its application to practical prediction.

My recent work has centered mainly on two systems for machine learning. The first might be called "The Baby Machine" We start out with the machine having little problem specific knowledge, but a very good learning algorithm. At first we give it very simple problems. It uses its solutions to these problems to devise a probability distribution over function space to help search for solutions to harder problems. We give it harder problems and it updates its probability distribution on their solutions. This continues recursively, solving more and more difficult problems.

The task of writing a suitable training sequence has been made much easier by Moore's Law, which gives us enough computer speed to enable large conceptual jumps between problems in the sequence.

A more difficult task is that of updating the probability distribution when new problems are solved. I will be using some of the current techniques for machine learning in this area: PPM and Grammar Discovery are particularly promising. It is my impression that the probability models need not be recursive for the initial learning. Later the system can, with suitable training, work on the problem of updating its own probability distribution, using fully recursive models.

Genetic Programming is my second area of recent research. Koza's Genetic Programming system has been able to solve many difficult problems of very varied kinds. The system itself is extremely slow, however -- it has taken a cluster of 1000 Pentiums about a month to solve some of the more difficult problems. I have found a large number of modifications of the system that I expect will dramatically improve its speed and broaden the scope of problems it can solve.

The Future --- The cost halving constant in Moore's law is now about 18 months. When A. I. pays a significant role in the reduction of this time constant, we begin to move toward the singularity. At the present time I believe we have a good enough understanding of machine learning, for this to take place. While I hesitate to guess as to just when the singularity will occur, I would be much surprised if it took as much as 20 years. As for the next 50 years of A.I., I feel that predicting what will happen after this singularity is much like guessing how the universe was before the Big Bang -- It's a real discontinuity!

---------------------------------------------

The Universal Distribution and Machine Learning

     Abstract

This report covers two main topics:

First, the Universal Probability Distribution and some of its properties: its accuracy, its incomputability, its subjectivity.

Second, how to use this distribution to create very intelligent machines.

The report shows how the Universal Probability Distribution --- the probability distribution on all possible output strings of a universal computer with random input --- solves many kinds of prediction problems and resolves serious difficulties in the foundations of Bayesian Statistics.

It discusses the Convergence Theorem and how the universal distribution gives very good probability estimates with varied types of data.

It explains what incomputability is and it's value.

Then discusses the important difference between the Universal Distribution and a common interpretation of "Universality"--- where universal would imply that we can usefully employ the same Universal Distribution for all problems. However, to get good predictions it is usually necessary to use a different Universal Distribution for each Problem Domain. The meaning and value of this is clarified.

Then the report discusses how all these aspects may be used in developing intelligent machines. Notes the delayed, gradual acceptance of subjective Bayesian probability in the AI community. Outlines the general system, how it solves problems with both probabilistic and deterministic answers and how learning is an integral part of it.

---------------------------------------------

Progress in Incremental Machine Learning
Preliminary Report for NIPS 2002 Workshop

     Abstract

We will describe recent developments in a system for machine learning that we've been working on for some time (Sol 86, Sol 89). It is meant to be a ``Scientist's Assistant'' of great power and versatility in many areas of science and mathematics. It differs from other ambitious work in this area in that we are not so much interested in knowledge itself, as we are in how it is acquired - how machines may learn. To start off, the system will learn to solve two very general kinds of problems. Most, but perhaps not all problems in science and engineering are of these two kinds.

The first kind is Function Inversion. These are the P and NP problems of computational complexity theory. They include theorem proving, solution of equations, symbolic integration, etc.

The second kind of problem is Time Limited Optimization. Inductive inference of all kinds, surface reconstruction, and image restoration are a few examples of this kind of problem. Designing an automobile in 6 months satisfying certain specifications and having minimal cost, is another.

In the following discussion, we will be using the term ``Probability'' in a special sense: i.e. the estimate given by the best probabilistic model for the available data that we can find in the available time.

Our system starts out with a small set of Problem Solving Techniques (PST's) and a simple General Conditional Probability Distribution (GCPD). When the system is given a problem, the description of this problem is the ``Condition'' for the GCPD. Its output is a probability distribution on PST's - the likelihood that each of them will be the best way to solve the problem. It uses these PST's and their associated probabilities to solve the first problem.

Next, it executes its Update Algorithm: The PST's are modified, new ones may be added, some may be deleted. The GCPD is modified. These changes attempt to incorporate into the system, information obtained in solving recent problems and prepare it to solve harder problems.

The next problem presentation and the system's solution is followed by the corresponding update and so on. After the nth update, the system is usually able to work problems more difficult than the nth. Giving the system a suitable training sequence of problems of progressive difficulty, makes it possible for the system to eventually solve very hard problems.

One critical component in the system is the initial set of PST's. These include Levin's Universal Search Algorithm (Lsearch). Simple Lsearch is only practical for very small problems, since its solution time is exponential in problem size. In the update phase, we modify the probability distribution that is used to guide Lsearch. This effectively reduces the sizes of more difficult problems, so that Lsearch becomes feasible for them.

The update algorithm is another critical component. It has been designed to facilitate ``transfer learning'', so that the system can utilize any similarities between problems in the same or disparate domains. It has been possible to express this algorithm as an inductive inference problem of a type the system normally solves - so we can simply ask the system to update itself as one of its regular problems.

Perhaps the most critical component of all is the training sequence - To devise a sequence of problems so that each of them challenges the system to devise new concepts to solve it, yet keeps the challenges within the capacity of the machine. For PST's that use simple Lsearch, this is not hard to do. It is always possible to get a good estimate of the time needed for the system to find a known solution to a problem. For certain modifications of Lsearch this is not so easy and designing training sequences for them can become as difficult as teaching people!

The system's generality invites comparison with Genetic Programming. We will discuss differences in the two approaches to optimization problems, and suggest a way in which Genetic Programming might be made a part of the present system.

Another way to look at the system is to think of it as a ``Wrapper'' system - a system that modifies its parameters in view of its experience with earlier problem solving. The present system is an ``Exteme Wrapper System'' in that - through the Magic of Levin's Universal Search - it is possible for the entire system to effectively replace itself by one that is more efficient.

---------------------------------------------

Two Kinds of Probabilistic Induction

     Abstract

Problems in probabilistic induction are of two general kinds. In the first, we have a linearly ordered sequence of symbols that must be extrapolated. In the second we want to extrapolate an unordered set of finite strings.

A very general formal solution to the first kind of problem is well known and much work has been done in obtaining good approximations to it [ Sol 64a, Sol 64b,Wal 68, Sol 78, Ris 78, Ris 89, Li 93].

Though the second kind of problem is of much practical importance, no general solution has been published. We present two general solutions for unordered data.

We also show how machines can be constructed to summarize sequential and unordered data in optimum ways.

---------------------------------------------

The Discovery of Algorithmic Probability

     Abstract

A detailed autobiographical and technical description of the path to the discovery of Algorithmic Probability.

Algorithmic Probability is a general technique for solving mathematical problems and for developing and measuring scientific theories. The problems are related to induction and in particular, induction and prediction when the model may be probabilistic.

We describe motivations, major influences and the steps taken to develop this system of induction and prediction. Among others, major influences include Carnap, Chomsky, Shannon, McCarthy and Minsky. We discuss the importance of Turing Machines and new computer coding ideas. Bayes rule is especially important for the development of quantitative values for prediction, and prediction in complex situations. The Search method developed by Leonid Levin is discussed.

---------------------------------------------

Does Algorithmic Probability Solve the Problem of Induction?

     Abstract

We will begin with a definition of Algorithmic Probability (ALP), and discuss some of its properties. From these remarks it will become clear that it is extremely effective for computing probabilities of future events - the best technique we have. As such, it gives us an ideal theoretical solution to the problem of inductive inference. I say ``theoretical'' because any device as accurate as ALP must necessarily be incomputable.

For practical induction we use a set of approximations of increasing power to approach ALP. This set is called Resource Bounded Probability (RBP), and it constitutes a general solution to the problem of practical induction. Some of its properties are quite different from those of ALP.

The rest of the paper will discuss philosophical and practical implications of the properties of ALP and RBP.

It should be noted that the only argument that need be considered for the use of these techniques in induction is their effectiveness in getting good probability values for future events. Whether their properties are in accord with our intuitions about induction is a peripheral issue. The main question is ``Do they work?'' and we show that they do work.

---------------------------------------------

A System for Incremental Learning Based on
Algorithmic Probability

     Abstract

We have employed Algorithmic Probability Theory to construct a system for machine learning of great power and generality. The principal thrust of present research is the design of sequences of problems to train this system.

Current programs for machine learning are limited in the kinds of concepts accessible to them, the kinds of problems they can learn to solve, and in the efficiency with which they learn - both in computation time needed and/or in amount of data needed for learning.

Algorithmic Probability Theory provides a general model of the learning process that enables us to understand and surpass many of these limitations.

Starting with a machine containing a small set of concepts, we use a carefully designed sequence of problems of increasing difficulty to bring the machine to a high level of problem solving skill.

The use of training sequences of problems for machine knowledge acquisition promises to yield Expert Systems that will be easier to train and free of the brittleness that characterizes the narrow specialization of present day systems of this sort.

It is also expected that the present research will give needed insight in the design of training sequences for human learning.

---------------------------------------------

The Application of Algorithmic Probability
to Problems in Artificial Intelligence

     Abstract

This covers two topics:

The first topic: Algorithmic Probability --- the motivation for defining it, how it overcomes difficulties in other formulations of probability, some of its characteristic properties and successful applications. Algorithmic Probability of a string is the probability of that particular string being produced as the output of a reference universal Turing machine with random input. This revision of our definition of probability resolves many problems of statistics that have plagued classical concepts of probability. Discusses aspects of algorithmic probability, such as completeness and incomputability.

The second topic: Discusses how to apply it to problems in A.I. --- where it promises to give near optimum search procedures for two very broad classes of problems. First, machine inversion problems: given a string, s and a machine, M, that maps strings to strings, find in minimum time, a string, s, such that M(x) = s. Solving algebraic equations is an example. Second, time limited optimization problems. Given a time limit T, and a machine M that maps strings to real numbers, find within time T the string x such that M(x) is as large as possible. Many engineering problems are of this sort --- for example, designing an automobile in 6 months.

Discusses how Algorithmic Complexity theory can find general laws in masses of data. Different problem applications and a special search procedure, Levin Search, are discussed, as well as the current state of the general system using algorithmic probability.

---------------------------------------------

The Time Scale of Artificial Intelligence:
Reflections on Social Effects

     Abstract

Six future milestones in AI are discussed. These range from the development of a very general theory of problem solving to the creation of machines with capacities well beyond those of a single human.

Estimates are made for when these milestones will occur, followed by some suggestions for the more effective utilization of the extremely rapid technological growth that is expected.

---------------------------------------------

Complexity-Based induction Systems:
Comparisons and Convergence Theorems

     Abstract

In 1964 the author proposed as an explication of a prior probability, the probability measure induced on output strings by a universal Turing machine with unidirectional output tape and a randomly coded unidirectional input tape.

Levin has shown that if the expression
  ~
(P)'M(x) is an unnormalized form of this measure, and P(x) is any computable probability measure on strings, x, then

  ~
(P)'M(x) is greater than or equal to CP(x)

where C is a constant independent of x. The corresponding result for the normalized form of this measure (P)'M(x) is directly derivable from Willis' probability measures on non-universal machines. If the conditional probabilities of (P)'M(x) are used to approximate those of P, then the expected value of the total squared error in these conditional probabilities is bounded by
-(l/2) ln C.

With this error criterion, and when used as the basis of a universal gambling scheme, (P)'M(x) is superior to Cover’s measure b*. When H* ≡ -log2, (P)'M(x) is used to define the entropy of a finite sequence, the equation
H*(x,y)= H*(x)+ H,*(y) holds exactly, in contrast to Chaitin’s entropy definition, which has a nonvanishing error term in this equation.

---------------------------------------------

Inductive Inference Theory --- A Unified Approach
to Problems in Pattern Recognition and Artificial Intelligence

     Abstract

Recent results in induction theory are reviewed that demonstrate the general adequacy of the induction system of Solomonoff and Willis.

Several problems in pattern recognition and A.I. are investigated through these methods. The theory is used to obtain the a priori probabilities that are necessary in the application of stochastic languages to pattern recognition. A simple, quantitative solution is presented for part of Winston's problem of learning structural descriptions from examples. In contrast to work in non-probabilistic prediction, the present methods give probability values that can be used with decision theory to make critical decisions.

---------------------------------------------

The Adequacy of Complexity Models of Induction

     Abstract

This report compares different models that use complexity measures for prediction and defining randomness: discusses the earliest induction method by Solomonoff, Willis' development of this system, Kolmogorov's and Chaitin's defining of randomness, and Cover's system.

An approach to defining randomness of a finite sequence is that all future continuations of the sequence are equally likely.

The systems described make it possible to put such a definition into an exact form and analyze its properties.

The most important open problem is the practical realization of systems like those described.

---------------------------------------------

Inductive Inference Research Status, Spring 1967

     Abstract

Previously, four theories of inductive inference were proposed, and the present work is concerned with putting these theories on a firmer theoretical footing. The four theories are now found to be effectively equivalent, at varying levels of rigor. It is shown that for sufficiently long sequences of symbols, the theories actually do give the correct probabilities for future symbols.

Attempts are made to implement induction theories through a general approach to networks of tasks. Applications of induction theory to man--to--animal communication are considered, and some new approaches to interspecies communication are suggested.

It is now shown that the induction theories developed for discrete symbols can be applied to prediction of continuous data.

Computer implementation of induction through use of definitions of subsequences of symbols may prove to be a valuable line of effort. Supervisory problems in dealing with very intelligent machines are discussed, parallels are shown to the problem of dealing with ambitious subordinates, and a remedy for the case of the machines is proposed.

---------------------------------------------

A Formal Theory of Inductive Inference, Part I

     Abstract

In Part I, four ostensibly different theoretical models of induction are presented, in which the problem dealt with is the extrapolation of a very long sequence of symbols---presumably containing all of the information to be used in the induction. Almost all, if not all problems in induction can be put in this form.

Some strong heuristic arguments have been obtained for the equivalence of the last three models. One of these models is equivalent to a Bayes formulation, in which a priori probabilities are assigned to sequences of symbols on the basis of the lengths of inputs to a universal Turing machine that are required to produce the sequence of interest as output.

Though it seems likely, it is not certain whether the first of the four models is equivalent to the other three.

Few rigorous results are presented. Informal investigations are made of the properties of these models. There are discussions of their consistency and meaningfulness, of their degree of independence of the exact nature of the Turing machine used, and of the accuracy of their predictions in comparison to those of other induction methods.

In Part II these models are applied to the solution of three problems.

---------------------------------------------

A Formal Theory of Inductive Inference, Part II

     Abstract

In Part II, the four theoretical models of induction that were presented in Part I are applied to the solution of three problems---prediction of the Bernoulli sequence, extrapolation of a certain kind of Markov chain, and the use of phrase structure grammars for induction.

Though some approximations are used, the first of these problems is treated must rigorously. The result is Laplace's rule of succession.

The solution to the second problem uses less certain approximations, but the properties of the solution that are discussed, are fairly independent of these approximations.

The third application, using phrase structure grammars, is least exact of the three. First a formal solution is presented. Though it appears to have certain deficiencies, it is hoped that presentation of this admittedly inadequate model will suggest acceptable improvements in it. This formal solution is then applied in an approximate way to the determination of the "optimum" phrase structure grammar for a given set of strings. The results that are obtained are plausible, but subject to the uncertainties of the approximation used.

--------------------------------------------

A Preliminary Report on
A General Theory Of Inductive Inference

     Abstract

Some preliminary work is presented on a very general new theory of inductive inference. The extrapolation of an ordered sequence of symbols is implemented by computing the apriori probabilities of various sequences of symbols. The apriori probability of a sequence is obtained by considering a universal Turing machine whose output is the sequence in question. An approximation to the apriori probability is given by the shortest input to the machine that will give the desired output. A more exact formulation is given, and it is made somewhat plausible that extrapolation probabilities obtained will be largely independent of just which universal Turing machine was used, providing that the sequence to be extrapolated has an adequate amount of information in it.

Some examples are worked out to show the application of the method to specific problems. Applications of the method to curve fitting and other continuous problems are discussed to some extent. Some alternative theories of inductive inference are presented whose validities appear to be corollaries of the validity of the first method described.

--------------------------------------------

A Progress Report on Machines to Learn to Translate Languages and Retrieve Information

     Abstract

This describes recent work on inductive inference techniques applied to a formal language defined to be a finite or infinite set of finite sequences of symbols. It defines grammar for such a language to be a set of rules by which all sentences in the language can be generated, or a test to tell if the sentences are part of the language.

It discusses Cholmsky's language types and their complexity; how inductive inference can be best applied, and how to decide on the "goodness of fit" of possible predicted sentences to be included in the language.

In the very extensive Second Appendix, the a priori likelihood of a language is used as a quantification of complexity, the difference between simple and complex languages discussed, and the formula given for the probability that a particular symbol sequence is generated by a particular stochastic language that generated other members of the symbol sequence set. Bayes rule is used in discovering the the best conditional probability for the full set of sequences generated by the language.

--------------------------------------------

An Inductive Inference Machine

     Abstract

A machine is described which is designed to operate as human beings seem to. Inductive inferences are made by classifying events and their outcomes within categories that have been useful in the past, and by means of a small set of transformations, the system derives new categories that are likely to be useful in the future.

These are tested empirically for usefulness in prediction, and useful ones combined with older useful categories to create new categories. These in turn are tested and the process is repeated again and again.

A simplified system has been developed; it's attributes are described, and some future aspects, such as a system to improve itself are considered.

A preliminary analysis of the relation of such systems to the work of Chomsky on English grammar is discussed.

--------------------------------------------

An Inductive Inference Machine

     Abstract

A machine is described which is designed to operate as human beings seem to. Inductive inferences are made by classifying events and their outcomes within categories that have been useful in the past, and by means of a small set of transformations, the system derives new categories that are likely to be useful in the future.

These are tested empirically for usefulness in prediction, and useful ones combined with older useful categories to create new categories. These in turn are tested and the process is repeated again and again.

A simplified system has been developed; it's attributes are described, and some future aspects, such as a system to improve itself are considered.

A preliminary analysis of the relation of such systems to the work of Chomsky on English grammar is discussed.

--------------------------------------------

---------------------------------------------