Given an infinite sequence of 0's and 1's, how can one best predict the next element of the sequence? Of course, much depends on the definition of what a "best" prediction means, but within the context of this article, "best" means within the historical context of the sequence of 0's and 1's that had preceeded the prediction. This is the same as finding the most likely, as inferred from past behaviour.

To see how the predicition algorithm works, a digression is necessary to see how sequences of 0's and 1's can be built. At heart, a model is a probabilistic finite state automata with the arcs between states being 0's and 1's with associated probabilities of observation. These models can be followed for any finite number of steps to generate a sequence with that number of digits. At that point, a new model could be followed with the new sequence of digits appended to the first sequence. This change of models, or a regime change, can happen many times, but only the most recent model is relevent for prediction. The prediction algorithm attemps to identify the models that sequentially give rise to a sequence, and to use the most recent model for the prediction of the next digit.

This approach is quite different from many other approaches in that it does not assume a particular form for the underlying dynamics to which parameters are to be fitted. Rather, it attempts to infer a computation algorithm from the observed sequence as the best explanation of that sequence. This means that the model identification process has to be initially infinite dimensional and then reduced to some finite but unknown number of explanatory variables. The particular finite state automata are also unusual in that they are capable of a high degree of expressiveness, both being able to model perfect randomness and perfect determinism. Uniquely, this algorithm is not self-incriminating. A criticism of most algorithms, is that if everyone knew how they worked then any profits would be squeezed out of the algorithms by front running, or anticipating its predictions. This algorithm reflects its own success by returning a model of perfect randomness in the ultimate case if it were to be globally implemented, it can reflect its own success against itself

A model is a probabilitic finite state automata, namely it has a set of states, and from each state two arcs labelled 0 and 1 with probabilites p and q emanate to other states (including even the original state). Sequences of digits are generated by following arcs around the model with the probability of choosing that arc according to its assigned value. Models can be represented as tables with current states along the left handside, with arcs along the top, cell entries show the next state the model moves to with its probablity.

This generates the sequence ...0000....:

State \ Arc | 0 | 1 |

A | A / 100% | A / 0% |

This generates sequences like ...0100011011...:

State\Arc | 0 | 1 |

A | A / 50% | A / 50% |

This generates sequences like ...00110000001111001111...:

State \ Arc | 0 | 1 |

A | B / 50% | D / 50% |

B | C / 100% | C / 0% |

C | A / 100% | A / 0% |

D | E / 0% | E / 100% |

E | A / 0% | A / 100% |

The model identification process is straightforward. A canonical automata is formed to represent the original sequence. This canonical automata may have infinitely many states, it is then pruned to be finite state. (In reality, all the canonical automata are finite with a large number of states since only finitely many observations can be done humanly possible.)

For example, consider how a random sequence of 0's and 1's would have a model identified with it. First a histogram table would be generated showing the probability of each binary sequence occuring in the sequence. Since the sequence is totally random, we know the probability of any binary subsequence is 0.5 raised to the power of the length of the sequence:

Sequence | Probability |

0 | 0.50 |

1 | 0.50 |

00 | 0.25 |

01 | 0.25 |

10 | 0.25 |

11 | 0.25 |

000 | 0.125 |

001 | 0.125 |

... | ... |

The next step is to create an infinite state automata, the nodes of the automata are the binary sequences 0, 1, 00, 01, 10, 11, 000, 001, ... that are in the left column of the histogram. There is a special state called the root state, this is the beginning state from which all the sequences start. The probability of the arcs are the conditional probability that a 0 or 1 would follow a given arc, which in this case are always 50%. The canonical automata would have the following form:

State \ Arc | 0 | 1 |

Root | 0 / 50% | 1 / 50% |

0 | 00 / 50% | 01 / 50% |

1 | 10 / 50% | 11 / 50% |

00 | 000 / 50% | 001 / 50% |

01 | 010 / 50% | 011 / 50% |

10 | 100 / 50% | 101 / 50% |

11 | 110 / 50% | 111 / 50% |

000 | 0000 / 50% | 0001 / 50% |

001 | 0010 / 50% | 0011 / 50% |

... | ... | ... |

Notice that all the rows in the canonical automata are the same as the first row for the root state. Namely, the subtrees can be identified with the root node, which is the reduction phase of the model identification process. Identifying subtrees that can be identified with prior nodes to reduce the tree from being infinite dimensional to finite dimensional. When this reduction is applied the final resulting automata is the same as that given in the examples for a random sequence of 0's and 1's.

Note that each model can also given the probability of a particular sequence occuring in that model by multiplying the probabilities of the arcs as a sequence is traced through the model. Models which explain a sequence well will have probabilities that explain sequences that match the original histogram.

Regime identification asks the question if a sequence is better explained by having one or more models. At this point we only ask the simple question if one or two models are better in explaining the observation of a particular sequence. The methodology behind regime identification begins by splitting up the sequence at different points. On each subsequence a model is fitted, and from the model a probability can be computed for the observed subsequence. The two probabilities, one for each subsequence, can be multiplied to give a joint probability of the original sequence if it has indeed been composed of two submodels at the point of splitting. The best point at splitting the original sequence, which gives the highest probability is the best candidate for a regime change, if it exists. It may happen that the best model may be the whole sequence, which is just one model. For predicting a sequence it is only sufficient to ask whether one or two models are best, as only the most recent model is used for predictions.

Once a sequence has had its regimes identified, the most recent model is used for binary prediction. The involves taking the subsequence that corresponds to the most recent model and finding the probability of a 0 or 1 occuring. The one with the larger probablity is the prediction for the next element in the sequence.

This has been a very quick overview, of a method to predict binary sequences. Predicting sequences of elements other than binary digits can be accomplished by mapping some attribute of them onto binary digits, for example if prices will go up or not. The most salient part of the algorithm involves finding efficient and accurate algorithms to reduce the canonical automata to finite automata without losing too much information.