I propose two novel experiments to investigate the compression capabilities of generative models. The main ideas from each experiment are as follows:
A shallow interpretation of our general research goal might be to ask the following question:
Question: Can ChatGPT perform nearly lossless compression in its I/O?
While the formal connections between compression and induction capabilities strictly refer to how a model internally represents data, this naïve question nevertheless deserves some thought: to what extent is training a model for natural language tasks also training a model to perform compression?
In particular, do large language models obey the typical laws of a complexity function or compressor? Let $A: \texttt{Strings} \to \texttt{Strings}$ be a map on finite strings, $\texttt{s} \in \texttt{Strings}$ be a finite string, and $C(\texttt{s}) = \text{length}(A(\texttt{s}))$ the length of however $A$ represents the string $\texttt{s}$. Recall that in order for $A$ to qualify as a lossless or injective compression algorithm, there must be a corresponding program $A^{-1}$ computing the inverse of $A$, i.e. such that $A^{-1} \circ A \equiv \text{id}$. The following are various properties one might ask of a compression algorithm $A$, where inequalities only hold up to an additional $O(\log n)$ term, where $n$ is the maximal length of a string involved in the concerned inequality. Formally, these are desiderata characterizing normal compressors.
Idempotency: Compression is no harder for repetitive inputs:
$$ C(\texttt{s}, \texttt{s}) = C(\texttt{s}). $$
Monotonicity: Compression is harder given more inputs:
$$ C(\texttt{s}) \leq C(\texttt{s}, \texttt{t}). $$
Symmetry: Compression doesn’t much mind the order of inputs:
$$ C(\texttt{s}, \texttt{t}) = C(\texttt{t}, \texttt{s}). $$
Distributivity: Compressing using an intermediate string cannot improve compression:
$$ C(\texttt{s}, \texttt{t}) + C(\texttt{w}) \leq C(\texttt{s}, \texttt{w}) + C(\texttt{t}, \texttt{w}). $$
See Clustering by Compression by Cilibrasi and Vitányi for more background to normal compressors.
Due to hallucination, training data, and the transformer architecture of large language models, it is unlikely that they possess a generic ability to losslessly compress input data upon output, especially data not coming from natural language tasks. I further hypothesize that certain desiderata of normal compressors are more likely to be met than others. Namely, idempotency, monotonicity, and symmetry might have more convincing evidence in our experiment than distributivity.
I design an experiment to evaluate just how well a GPT model behaves as a normal compressor. For computational feasibility, I went with a GPT-2 model available through Hugging Face’s transformers
library. Of course, the same could be done for more advanced models.
I use the first $50$ documents from the 20 Newsgroups
dataset. The following figure illustrates the distribution of character counts for all $50$ of the test inputs.