The motivation for this project lies in the deep connection between induction and compression.
TLDR
There is a strict mathematical sense in which generative models may be viewed as compressors. Any model able to compress so well would also possess the ability to perform a significant level of logical induction. The goal of this project is to investigate further the compression capabilities of generative models.
Notes:
I’ve made a short primer gently introducing compression in the following slides. This might be useful to anyone only interested in the TLDR of this page.
There is also a wonderful introduction to Solomonoff Induction by Eliezer Yudkowsky in the style of an A-B dialogue. I highly recommend it for all.
Compression Primer Slides
Compression_primer.pdf
Mathematical Background
We place ourselves in a situation familiar to any machine learning model undergoing training: given some observed data, come up with a prediction method or theory that is both (1) consistent with the observations and (2) as perfect a predictor as possible on future instances.
Before stating any theorems, we provide a quick glossary:
- Induction/Inductive Inference: the process of reassigning a likelihood to a theory having observed particular instances. “Determining the truth.”
- All inference problems can be cast in the form of extrapolation (i.e., continuation) from an ordered sequence of binary symbols; think of this as a standard form.
- Epicurus’ Principle of Multiple Explanations: (paraphrased as) Any and all theories consistent with all observations ought to be considered (but not necessarily with equal consideration).
- Any such theory is also called a hypothesis.
- Occam’s Razor Principle: (paraphrased as) Among the theories that are consistent with all observations, one should prefer the simplest.
- Computable: any operation or object which can be simulated by a Turing machine (or computer program in some Turing-complete programming language).
- Lossless Compression: the process of representing information in a more compact form in such a way that the process is perfectly invertible.
- Kolmogorov Complexity: the simplicity of a theory or object can be measured by the shortest length computer program describing* that theory or object. I.e., by the amount of information in the theory or object that is impossible to computably, losslessly compress further.
- (*) To describe a thing is to produce a description of it; i.e., to produce a code such that a fixed universal decoder could perfectly recuperate the original thing from that code.
- A special type of Kolmogorov complexity — known as monotone complexity and denoted by $\text{Km}(\cdot)$ — is defined similarly except using only monotone Turing machines, or those which cannot change their mind about previous predictions when given new observations.
- Minimum Description Length (MDL) Principle: the Occam-type principle declaring the most preferable hypothesis to be that which has (1) the simplest theory and (2) the simplest description of the observations from that theory (as measured by Kolmogorov complexity).
- Bayes’ Rule: a mathematical law to update likelihoods of various hypotheses given some observations based on conditional probability.
- Solomonoff Induction (SI): a method of “perfect” induction which performs sequence prediction using a weighted-combination of the consistent theories weighted by their complexities and using Bayes’ Rule to update the hypotheses’ likelihoods upon new observations; it presumes that the stochastic process generating the data is governed by a (somewhat) computable prior probability distribution. SI fulfills the metaphysical principles of Epicurus and MDL.
- Universal prior distribution: a somewhat computable (i.e., lower semi-computable, or approximable by a Turing machine from below), continuous semi-measure (i.e., distribution of likelihoods on all possible theories given some data that doesn’t necessarily sum fully to $100\%$) which can essentially play the role of any prior distribution on theories in Solomonoff Induction without much sacrifice.
Fact 1: There exists a universal prior distribution.
This fact is one of the key reasons Solomonoff Induction is even possible. For any other (computable) priori distribution $P$ one might place on the set of consistent hypotheses to the observed data, this universal prior distribution $\mathbf{M}$ has the property that there is some multiplicative constant $\alpha > 0$ such that for any hypothesis $H$ and observed data $D$,
$$
\mathbf{M}(H \mid D) > \alpha \cdot P(H \mid D).
$$