\documentclass[twocolumn]{article}
\begin{document}
\title{{\small 29 June 2003}\\
Three Kinds of Probabilistic Induction: Universal Distributions and
Convergence Theorems }
\author{Ray J. Solomonoff}
\date{{\small Visiting Professor, Computer Learning Research Centre \\
Royal Holloway, University of London \\[2mm] IDSIA, Galleria 2,
CH--6928 Manno--Lugano, Switzerland
\\ rjsolo@ieee.org\hspace{3em}http://world.std.com/\~{ }rjs/pubs.html}}
\maketitle
\begin{abstract}
We will describe three kinds of probabilistic induction problems, and
give general solutions for each , with associated convergence
theorems that show they tend to give good probability estimates.
The first kind extrapolates a sequence of strings and/or numbers.
The second extrapolates an unordered set of strings and/or numbers.
The third extrapolates an unordered set of ordered pairs of elements
that may be strings and/or numbers. Given the first element of a new
pair, to get a probability distribution on possible second elements
of the pair.
Each of the three kinds of problems is solved using an associated
universal distribution. In each case a corresponding convergence
theorem is given, showing that as sample size grows, the expected
error in probability estimate decreases rapidly.
The solutions given are very general and cover a great variety of
induction problems. Time series prediction, grammar discovery (for
both formal and natural languages), curve fitting, the identification
problem and the categorization problem, are a few of the kinds of
problems amenable to the methods described.
\end{abstract}
\section*{Introduction}
Problems in probabilistic induction are of three general kinds:
In the first, we are given a linearly ordered sequence of symbols to
extrapolate. There is a very general solution to this problem using
the universal probability distribution, and much has been written on
finding good approximations to it ( Sol60, Sol64a, Sol64b, Wal68,
Wal87, Wil70, Ris78, Ris87 ). It has been shown that for long
sequences, the expected error in probability estimates converge
rapidly toward zero (Sol78).
In the second kind of problem, we want to extrapolate an
{\it unordered} sequence of finite strings and/or numbers. A
universal distribution has been defined that solves this problem (Sol
99). We will give a convergence theorem that shows it to give small
errors as the number of examples increases --- just as with
sequential predictions.
In the third kind, operator induction, we have an unordered sequence of ordered
pairs of elements ($Q_i, A_i$) (that may be strings and/or numbers).
Given a new $Q_i$, to obtain the probability distribution over
possible $A_i$'s. The $Q$'s can be questions in some formal or
natural language, the $A$'s can be associated answers. The $Q$'s can
be inputs to some unknown stochastic device and the $A$'s can be
outputs (The Identification Problem). The $Q$'s can be description of
mushrooms, the $A$'s can tell if they are edible or poisonous (The
Categorization Problem). The $Q$'s can be numbers and the $A$'s can
be exact or noisy values of some unknown function of those numbers
(The Curve Fitting Problem).
We will give two solutions to this problem based on universal
distributions, and give associated convergence theorems that affirm
their precision in prediction.
Section 1 deals with the sequential prediction and its
universal distribution. This is followed by a convergence theorem
for the normalized distribution and some more recent generalizations
of it.
Section 2 deals with the extrapolation of a set of unordered
strings and/or numbers, and gives an associated convergence
theorem.
Section 3 deals with Operator Induction, and gives the
associated convergence theorem.
\section{Sequential prediction}
The universal distribution for sequential prediction is a
probability distribution on strings obtained by assuming the strings
are the output of a universal machine with random input. We will at
first consider only universal Turing machines with binary
unidirectional input and output tapes and an infinite bidirectional
work tape. It is possible to get equivalent distributions using more
general kinds of universal devices with less constrained input and
output.
How can we use this definition to get an expression of the
probability of a particular finite string, $x$?
Let $[S_k]$ be the set of all binary programs for our
reference machine, $M$, such that $M(S_k)$ gives an output with
$x$ as prefix. To prevent double counting we have an additional
constraint on the set $[S_k]$ : dropping the last bit of the
string $S_k$, will give a program with output that does not have
$x$ as prefix. With this condition the probability of $x$ becomes
the sum of the probabilities of all of its programs:
\begin{equation} P_M(x) = \sum_k 2^{-|S_k|}
\end{equation}
$|S_k|$ is the number of bits in $S_k$ and $2^{-|S_k|}$ , the
probability of an input that has $S_k$ as prefix.
Because certain of the codes, $S_k$ do not result in useful
output (i.e. the machine prints out part of $x$, but continues to
calculate without printing anything else), the resultant probability
distribution is not a measure, but a semimeasure --- i.e.,
\[ P_M(x0) + P_M(x1) < P_M(x)
\]
For our first prediction method, we will normalize $P_M$ to create
$P'_M$
\[ P'_M(x0)=\frac{P_M(x0)}{P_M(x0)+P_M(x1)}P'_M(x)
\]
\begin{equation} P'_M(x1)=\frac{P_M(x1)}{P_M(x0)+P_M(x1)}P'_M(x)
\end{equation}
\[ P'_M( \wedge)= 1
\]
Though there are other possible
methods of normalization, it is not difficult to show that the
equations of (2) give us maximum $P'_M(x)/P_M(x)$ for all $x$. Later
we will show that this condition minimizes the expected prediction
error of $P'_M$.
It is easy to use $P'_M$ for prediction:
\[ P(x1|x) = P'_M(x1)/ P'_M(x) \quad\mbox{and}
\]
\begin{equation} P(x0|x) = P'_M(x0)/ P'_M(x)
\end{equation}
Just how accurate are the predictions of $P'_M$?
Suppose you have a device $\mu$, generating binary sequences according to
some finitely describable stochastic rules. It gives a probability
for each of the bits it generates. If you use the universal
distribution to get probabilities for each of the bits, there will be
a difference between the two probabilities. If you square these
probability differences and add them up, the expected value of the
sum is bounded by $-1/2 \ln P'_{M,\, \mu}$. $P'_{M,\, \mu}$ is the
probability that the universal distribution assigns to $\mu$, the
generator of the data (Sol 78,p.426).
More exactly:
$\mu(x_{n+1} = 1|x_1 \: x_2 \: x_3 ....x_n)$ is the conditional
probability distribution according to $\mu$ that the $(n+1)th$ bit of
a binary sting is 1, given the previous $n$ bits, $x_1 \: x_2 \: x_3
... x_n$.
$P^\prime_M(x_{n+1} = 1|x_1 \: x_2 \: x_3 ....x_n)$ is the corresponding probability
for $P^\prime_M$
$x = x_1 \: x_2 \: x_3....x_n$ is a string constructed using $\mu$ as
a stochastic source.
Both $\mu$ and $P^\prime_M$ are able to assign
probabilities to the occurrence of the symbol 1 at any point in the
sequence $x$ based on the previous symbols in $x$.
The Convergence Theorem says that the total expected
squared error between $\mu$ and $P^\prime_M$ is given by
\[ E \raisebox{-1.5ex}{\makebox[0pt][r]{$\scriptstyle \mu\:$}} \:
\sum^n_{m=1} (P^\prime_M(x_{m+1} = 1|x_1 \: x_2 \: x_3....x_m)\]
\begin{equation} -\mu(x_{m+1} = 1|x_1 \: x_2 \: x_3....x_m))^2
< -1/2 \ln P'_{M,\,\mu}
\end{equation}
The expected value is with respect to probability distribution,
$\mu$.
$ln P'_{M,\,\mu}$ is dependent on just what universal device
generated the universal distribution. It is approximately $K ln2$,
where $K$ is the Kolmogorov complexity of the generator -- the length
of the shortest program needed to describe it
Since this total error is independent of the size of the data string
being predicted it is clear that the errors in the individual bits
must decrease more rapidly than 1/n, n being the length of the data
sequence.
This is a very powerful result. It is clear that the universal
distribution gives \emph{very good} probability estimates.
\ The truth of (4) hinges on the fact that if $\mu$ is a
computable probability measure then there exists a positive constant
$P'_{M,\,\mu}$ such that
\[ P'_M(x)/\mu(x) > P'_{M,\,\mu}
\]
and that while $P'_{M,\,\mu}$ will depend on $\mu(\cdot)$ and
$P'_M(\cdot)$, it will be independent of $x$.
Eq. (4) can be
usefully generalized :
IF
$P_1$ and $P_2$ are any normalized measures on $x$.
$x(n)$ is a string of length $n$.
\begin{displaymath}P_2(x(n))/P_1(x(n)) > \alpha(n) > 0
\end{displaymath}
--where $\alpha(n)$ is a function of $P_1(\cdot)$, $P_2(\cdot)$ and
$n$, but not of $x$
THEN
\[ E \raisebox{-1.5ex}{\makebox[0pt][r]{$\scriptstyle P_2\:$}} \:
\sum^n_{m=1} (P_1(x_{m+1} = \\1|x_1 \: x_2 \: x_3....x_m)\]
\begin{equation} -P_2(x_{m+1} = 1|x_1 \: x_2 \: x_3....x_m))^2
< -1/2 \ln \alpha(n) \end{equation}
The Convergence Theorem of (4) is true if $P'_M$ is a $normalized$
universal measure. Peter Gacs (Gac 97) has shown it to be true for
the $unnormalized$ semimeasure, $P_M$, but the associated convergence
constant $-1/2 \ln P_{M,\,\mu}$ is much larger than the corresponding
constant, $-1/2 \ln P'_{M,\,\mu}$ for $P'_M$.
The difference between them is
\begin{displaymath}1/2 \ln (P'_{M,\,\mu}/P_{M,\,\mu})
\end{displaymath}
$P'_{M,\,\mu}/P_{M,\,\mu}$ is the ratio of the values of the normalization factors for $n=\infty$. We have
selected a normalization technique to make it as large as possible.
The result is that the probability errors for the normalized measure,
$P'_M(\cdot) $ can converge much more rapidly than those for the
semimeasure, $P_M(\cdot)$.
Gacs (ibid) also shows that the generalization corresponding to
eq. 5 holds if $P_2(\cdot)$ is an unnormalized semimeasure.
Marcus Hutter (Hut 02 ) shows that these results hold if we use
alphabets with any finite number of symbols.
In the forgoing convergence theorems the total squared probability
difference is used as loss function. The proofs of the theorems also
show the same convergence for the Kullback--Liebler loss function
(which is greater than or equal to the square loss function --
resulting in stronger theorems).
Hutter (ibid) considers more general loss functions for the universal
distribution and obtains associated convergence theorems.
\section{Induction on Unordered Sets}
\subsection{The Problem and a Solution}
We have an unordered set of $n$ finite strings of symbols, $D_1 ,
D_2 \ldots D_n$ . Given a new string, $D_{n+1}$ , what is the
probability that it belongs to the set? Or --- given a string, $a$ ,
how must it be completed so it is most likely to be a member of the
set? Or, given a string $a$ and a set of possible completions,
$[ab_j]$, what is the relative probability of each of these
completions?
A common example of unordered set prediction occurs in natural
and formal languages. We are given a set of examples of strings that
are acceptable sentences. Given a new string, what is the
probability that it is acceptable? A common solution technique is to
devise a well fitting stochastic grammar for the known set of
strings. The universal distribution gives a criterion for goodness
of fit of such grammars [Hor 71, Sol 64b pp.240-251].
The Universal Distribution $P_M$, is a weighted sum of all
finitely describable semimeasures on finite strings:
\begin{equation} P_M([D_i])=\sum_j \alpha_j \prod^n_{i=1} P_j(D_i)
\end{equation}
$n$ is the number of strings in the set $[D_i]$
$\alpha_j$ is the weight of the $j^{th}$ semimeasure on finite
strings.
$\alpha_j = 2^{-|a_j|}$ where $a_j$ is the shortest description of
$P_j(\cdot)$ and $|a_j|$ is the number of bits in $a_j$
The $M$ subscript of $P_M$ indicates that the functions $P_j$ are
to be described with reference to machine, $M$. Since $M$ is
universal, it can be used to describe any describable function.
Suppose that $[D_i]\:\: i=1. . . n$ is a set of $n$ strings generated
by some unknown stochastic device, $\mu(\cdot)$. What is the
probability that our universal distribution assigns to a new
string, $D_{n+1}$ ?
It is just
\begin{equation} P(D_{n+1})=P_M([D_i] \bigcup D_{n+1})/P_M([D_i])
\end{equation}
The probability assigned to $[D_i]$ by it's creator,$\mu(\cdot)$, is
\begin{equation} \mu([D_i]) = \prod^n_{i=1}\mu(D_i)
\end{equation}
For a suitable set of strings, $[D_i]$, the probability assigned by
$P_M$ in (6) can be very close to those assigned by $\mu(\cdot)$, the
generator of $[D_i]$, in (8). In section 3, we will discuss Operator
Induction and prove an associated convergence theorem. Section 3.3
shows that this convergence theorem implies a convergence theorem for
(6), insuring small expected errors between the probability estimates
of $P_M(\cdot)$ and those of $\mu(\cdot)$.
\section{Operator Induction}
In the Operator Induction problem, we are given an unordered set of
$n$ strings and/or number pairs, $[Q_i,A_i]$. Given a new $Q_{n+1}$,
what is the probability distribution over all possible $A_{n+1}$? We
will give two solutions.
\subsection{ }
In the first, we consider the problem to be an extrapolation of an
unordered set of finite strings, $D_i$, in which $D_i = (Q_i, A_i)$
Eq. 6 is used to obtain a probability distribution on all unordered
sets of $Q_i,A_i$ pairs and (7) gives us a probability
distribution over $(Q_{n+1},A_{n+1})$ --- i.e. $P(Q_{n+1},A_{n+1})$
for all possible $A_{n+1}$.
Then
\begin{equation} P(A_{n+1}) = P(Q_{n+1},A_{n+1})/\sum_i P(Q_{n+1},A_i)
\end{equation}
\subsection{}
For the second solution to the Operator Problem, we express the
probability of an arbitrary $A_{n+1}$ directly as a function of the
data set, $[Q_i,A_i]$. For this data set, the probability
distribution of $A_{n+1}$ is
\begin{equation} \sum_{j=1} \; a^j_0 \; \prod^{n+1}_{i=1}
\; O^j (A_i|Q_i)
\end{equation}
Here $O^j(\cdot|\cdot)$ is the $j^{th}$ possible conditional
probability distribution relating its two arguments. $O^j (A_i|Q_i)$
is the probability of $A_i$, given $Q_i$, in view of the function
$O^j$
We would like to sum over all {\em total} recursive functions, but
since this set of functions is not effectively enumerable, we will
instead sum over all {\em partial} recursive functions, which {\em
are} effectively enumerable.
$a^j_0$ is the a priori probability of the function
$O^j(\cdot|\cdot)$ . It is approximately $2^{-l(O^j)}$ where
$l(O^j)$ is the length in bits of the shortest description of $O^j$.
We can rewrite (10) in the equivalent form
\begin{equation} \sum_{j=1} a^j_n O^j(A_{n+1}|Q_{n+1})
\end{equation}
Here,
\begin{displaymath} a^j_n = a^j_0 \prod^n_{i=1} O^j (A_i|Q_i)
\end{displaymath}
In (11), the distribution of $A_{n+1}$ is a weighted sum of all of
the $O^j$ distributions --- the weight of each $O^j$ being the
product of its a priori probability and the probability of the
observed data in view of $O^j$.
Section 3.3 shows that even with a relatively short sequence of Q,A
pairs, these distributions tend to be very accurate. If we use the
$a^j_0$ to express all of our a priori information about the data,
they are perhaps the most accurate possible.
Since we cannot compute this infinite sum using finite resources, we
approximate it using a finite number of large terms --- terms that in
(11) have large $a^j_n$ values. While it would seem ideal to include
the terms of maximum weight, it has been shown to be impossible to
{\em know} if a particular term is of maximum weight. The best we
can do is to find a set of terms of largest total weight in whatever
time we have available.
{\em We can completely characterize the problem of operator
induction to be finding, in whatever time is available, a set of
functions, $O^j(\cdot|\cdot)$ such that $\sum_j a^j_n $ is as
large as possible.}
\subsection { }
We will show that for an adequate sequence of $(Q_i, A_i)$ pairs, the
predictions obtained by the probability distribution of (10) can be
expected to be extremely good.
To do this, we hypothesize that the sequence of $A_i$ answers that
have been observed, were created by a probabilistic algorithm,
$\mu(A_i|Q_i)$ and that $\mu$ can be described with $k$ bits.
Any probability distribution that assigns probabilities to every
possible $A_i$, must also assign probabilities to each bit of $A_i$:
Suppose that $a_r$ is a string of the first $r$ bits of $A_i$. Then
the probability given by $\mu$ that the $(r+1)^{th}$ bit of $A_i$ is
$1$ is
\begin{displaymath}
\sum\limits_j\mu(a_r 1 x^j|Q_i) {\Big/} \sum\limits_j\mu(a_r x^j|Q_i)
\end{displaymath}
$x^j$ ranges over all finite strings.
Similarly, $P(\cdot)$ the algorithm of (10), can be used to assign
a probability to every bit of every $A_i$. We will represent the
sequence of $A_i$'s by a string, $Z$, that is formed by
concatenating these $A_i$'s then separating them by the symbols,
$s$ --- denoting ``space''. Z, then, is a sequence of symbols from
the ternary alphabet 0, 1, s. Using an argument similar to the
foregoing, it is clear that both $\mu$ and $P$ are able to assign
probabilities to the space symbol, $s$ as well as to 0, and 1,
since each of them must be able to specify when each $A_i$ string
terminates.
We have, then, two probability distributions on the ternary
strings, $Z$. In the first distribution, $\mu$ is the creator of
the observed sequence, and in the second distribution, $P$,
through (10), tries to predict the symbols of $Z$.
For two such probability distributions on {\em ternary} strings,
we can apply Hutter's (ibid) generalization to arbitrary alphabet,
of the generalized convergence theorem, (5) : The expected value,
with respect to $\mu$ (the ``generator''), of the sum of the
squares of the differences in probabilities assigned by $\mu$ and
$P$ to the symbols of the string are less than $-\ln c$, $c$ being
the largest positive number such that $P/\mu > c$ for all
arguments of $P$ and $\mu$.
More exactly:
\begin{equation}
\sum_l \mu(Z_l) \sum^n_{i=1}
\sum^{h^l_i+1}_{j=0} \sum_{t=0,1,s} (P^l_{i,j}(t) - \mu^l_{i,j}
(t))^2 < k \ln 2
\end{equation}
$l$ sums over all strings $Z_l$ that consist of $n$ finite binary
strings separated by $s$ symbols (spaces).
$A^l_i$ is the $i^{th}$ $A$ of $Z_l$
$P^l_{i,j}(t)$ is the probability as given by $P$ that the j$th$
symbol of $A^l_i$ will be $t$, conditional on previous symbols of
$A^l_i$'s in the sequence, $Z_l$ and the corresponding Q's.
$t$ takes the values 0,1 and $s$.
$\mu^l_{i,j} (t)$ is defined similarly to $P^l_{i,j}(t)$, but it is
independent of previous $A^l_i$'s in the sequence.
$h^l_i$ is the number of bits in $A^l_i$. The $(h^l_i + 1)^{th}$
symbol of $A^l_i$ is always $s$.
The total number of symbols in $Z_l$ is $\sum^n_{i=1}(h^l_i+1)$.
$\mu(Z_l)$ is the probability that $\mu$ assigns to $Z_l$ in view of
the sequence of Q's.
$k$ is the length in bits of the shortest description of $\mu$.
This implies that the expected
value with respect to $\mu$ of the squared ``error'' between $P$ and
$\mu$, summed over the individual symbols of all of the $A_i$, will
be less than $k \ln 2$
Since the total number of symbols in all of the answers can be very
large for even a small number of questions, the error per symbol
decreases rapidly as $n$, the number of $Q,A$ pairs increases.
Equation (12) gives a very simple measure of the accuracy of equation
(10). There are no ``order of one'' constant factors or additive
terms. A necessary uncertainty is in the value of $k$. We normally
will not know its value, but if the generator of the data has a long
and complex description, we are not surprised that we should need
more data to make good predictions
--- which is just what (12) specifies.
The value of the constant, $k$, depends critically on just what
universal reference machine is being used to assign a priori
probability to the $O_j$ and to $\mu$. Any a priori information that
a researcher may have can be expressed as a modification of the
reference machine --- by inserting low cost definitions of concepts
that are believed to be useful in the needed induction
--- resulting in a shorter codes for the $O_j(\cdot)$, for $\mu$, ( a smaller $k$ ), and
less error.
{\em We believe that if all of the needed a priori information is
put into the reference machine, then (10) is likely to be the
best probability estimate possible.}
At first glance, this result may seem unreasonable: Suppose we ask
the system many questions about Algebra, until it's mean errors are
quite small --- then we suddenly begin asking questions about
Linguistics --- certainly we would not expect the small errors to
continue! However, what happens when we switch domains suddenly, is
that $k$ suddenly increases. A $\mu$ that can answer questions on
both Algebra and Linguistics has a much longer description than one
familiar with Algebra only. This sudden increase in $k$ accommodates
large expected errors in a new domain in which only a few questions
have been asked.
\subsection{}
If we set $Q_i=\bigwedge$ ($i=1...n$) in(10), it becomes clear
that the equation (6) for induction on unordered sets is a special
case of Operator Induction, and that the Convergence Theorem (12)
holds for (6) as well. This also assures convergence of the
Operator Induction technique of Section 2.1.
Is there any advantage in using (9) rather than (10) for
Operator Induction?
(9) exploits regularities in the set $[Q_i,A_i]$. It includes
regularities in the set $[Q_i]$ --- which we do not use --- so it
would seem that we are doing more work than is necessary. In (10),
we only find regularities in functions relating $Q_i$ to $A_i$. Such
regularities may be easier to find than regularities in the more
complex object $[Q_i,A_i]$. In general, however, the finding of
regularities for either of the techniques will depend critically on
just what problem is being solved.
\begin{thebibliography}{99}
\bibitem{Gac}(Gac 97) G\'acs, Peter, ``Theorem 5.2.1,'' in {\sl An Introduction to
Kolmogorov Complexity and Its Applications} , Li, M. and Vit\'anyi,
P. , Springer-Verlag, N.Y., 1997, pp. 328-331
\bibitem{hor} (Hor 71) Horning, J. ``A Procedure for Grammatical
Inference,'' {\sl Proceedings of the IFIP Congress 71}, Amsterdam,
North Holland, pp. 519--523, 1971.
\bibitem{hut} (Hut 02) Hutter, M.,``Optimality of Universal
Bayesian Sequence\\ Prediction for General Loss and Alphabet,''\\
http://www.idsia.ch/\~{ }marcus/ai/
\bibitem{li} (Li 97) Li, M. and Vit\'anyi, P. {\sl An
Introduction to Kolmogorov Complexity and Its Applications},
Springer-Verlag, N.Y., 1997.
\bibitem{ris78} (Ris 78) Rissanen, J. ``Modeling by the Shortest
Data Description,'' {\sl Automatica}, 14:465--471, 1978.
\bibitem{ris87} (Ris 87) Rissanen, J. ``Stochastical
Complexity,'' The Journal of the Royal Statistical Society, Series
B, Methodogy, 49, 3, 1987, pp 223-239.
\bibitem{sol60} (Sol 60) Solomonoff, R.J. ``A Preliminary Report on
a General Theory of Inductive Inference,'' Report ZBT-138, Zator Co.,
Cambridge, Mass, Nov 1960. http://world.std.com/\~{ }rjs/pubs.html
\bibitem{sol64a} (Sol 64a) Solomonoff, R.J. ``A Formal Theory of
Inductive Inference,'' {\sl Information and Control}, Part I: Vol 7,
No. 1, pp. 1--22, March 1964. http://world.std.com/\~{ }rjs/pubs.html
\bibitem{sol64b} (Sol 64b) Solomonoff, R.J. ``A Formal Theory
of Inductive Inference,'' {\sl Information and Control}, Part II: Vol
7, No. 2, pp. 224--254, June 1964. http://world.std.com/\~{
}rjs/pubs.html
\bibitem{sol78} (Sol 78) Solomonoff, R.J. ``Complexity-Based
Induction Systems: Comparisons and Convergence Theorems,'' {\sl
IEEE Trans. on Information Theory}, Vol IT--24, No. 4, pp. 422
-432, July 1978. http://world.std.com/\~{ }rjs/pubs.html
\bibitem{sol99} (Sol 99) Solomonoff, R.J. ``Two Kinds of
Probabilistic Induction'' {\sl Computer Journal}, 64:256--259,
1999. http://world.std.com/\~{ }rjs/pubs.html
\bibitem{wal68} (Wal 68) Wallace, C.S and Boulton, D.M. ``An
Information Measure for Classification,'' {\sl Computer Journal},
11:185--194, 1968.
\bibitem{wal87} (Wal 87) Wallace, C.S and Freeman, P.R., ``Estimation
and Inference by Compact Coding'', The Journal of the Royal
Statistical Society, Series B, Methodogy, 49, 3, 1987, pp 240-252.
\bibitem{wil} (Wil 70) Willis, D.G. ``Computational Complexity
and Probability Constructions,'' {\sl Journal of the Assoc. of Comp.
Mach.}, pp. 241--259, April 1970.
\end{thebibliography}
\end{document}