Kolmogorov complexity: the true amount of information in an object

Mar 28, 2026 · #information-theory #computation #complexity #philosophy

The information in a single object is the length of its shortest description — the smallest program that regenerates it. A definition that makes compression equal to understanding, declares randomness the rule rather than the exception, and turns out to be uncomputable.

A long random string set against a tiny program that regenerates a clean repeating pattern, illustrating Kolmogorov complexity as the length of the shortest generating program. — The information in an object is the size of the shortest recipe that recreates it — not its apparent length.

Shannon’s entropy measures the average surprise of a source, computed over a distribution of probabilities. But there is a question it cannot quite reach: a single object — this file, that sequence, with no source and no probabilities behind it — how much information does it really contain? The answer, given by a twenty-five-year-old Russian mathematician in 1965, is at once the most beautiful definition of information ever written and one of the most unsettling, because it leads straight to the impossible.

Two strings, one trick question

Consider these two sequences of thirty-two characters:

A : ABABABABABABABABABABABABABABABAB
B : A8cR2pLx9QmZ4kE1nWj7uYt0vBd3sFh6

Which contains more information? The naïve intuition says they are both thirty-two characters long, so they carry the same amount. That is wrong. String A can be summarised in a phrase: “write AB sixteen times.” String B, if it is genuinely random, has no summary shorter than itself: to transmit it, you are forced to dictate it character by character. The first is compressible, the second is not. And that is exactly what the information content of an object is — the length of its shortest description.

The definition

Andrei Kolmogorov — and, independently, Ray Solomonoff and Gregory Chaitin — formalised the idea as follows. The Kolmogorov complexity of an object x, written K(x), is the length of the shortest program that, run on a universal computer, produces x and then halts.

The cleanest way to picture it: imagine you must send x to a friend over the phone, but your friend owns a computer. You are not obliged to read out the object; you can dictate a recipe for building it. K(x) is the length of the shortest possible recipe. A regular chequerboard: short recipe. A photograph of television static: no recipe beats “here are the pixels, one by one.” The complexity of an object is the size of its minimal recipe — nothing more to memorise.

Why a universal computer? Because a central result, the invariance theorem, guarantees that the choice of language — Python, C, a Turing machine — changes the answer only by a fixed constant. Switching from one language to another costs, at worst, a fixed-size translator prepended to the program. For large objects this constant becomes negligible. K(x) is therefore a near-absolute property of the object, not of the programmer. That is what makes it a universal measure of information, where Shannon needed a probability distribution handed to him first.

Shannon versus Kolmogorov

This is not a new subject; it is the same one from another angle. Shannon measures the average information of a source that emits symbols according to probabilities. Kolmogorov measures the absolute information of an individual object, never mentioning probability at all. And yet they meet: if you draw objects at random from a source of entropy H, their average Kolmogorov complexity is approximately H. Shannon entropy is the statistical average of Kolmogorov complexity.

The distinction that matters is this. Shannon needs an ensemble — a source, a set of frequencies — before it can speak at all. Kolmogorov can point at a single object and say “here is its true information content.” It is the difference between “on average, the adults in this country are 1.75 m tall” and “you, specifically, are 1.82 m tall.”

To compress is to understand

Here is the bridge to a deeper idea. Predicting and compressing are the same operation. If you can predict the next character, you do not need to transmit it; you transmit only your errors. A good compression of an object is a short program that regenerates it — and a short program that regenerates an object is a theory of that object. String A above? You understood it — “it’s AB repeated” — at the exact moment you knew how to compress it.

To understand a phenomenon is to find the short program that generates it. The laws of physics are the ultimate compression of the universe: F = ma summarises an infinite number of observations in three symbols. Understanding and compression are not analogous; they are the same act named twice.

The normality of chaos: almost everything is incompressible

Now a counter-intuitive truth, and the argument fits on one line of counting. How many binary strings of length n are there? Exactly 2ⁿ. How many programs shorter than n are there — of length 0, 1, …, n−1? At most 2ⁿ − 1. So it is mathematically impossible for every string of length n to have a shorter program: there are simply not enough short programs to go around.

It is a game of musical chairs — the pigeonhole principle. You have a million objects to file and only 999,999 short drawers. At least one object will have no short drawer; it stays at its full size. Push the count further and you show that the vast majority of strings are incompressible or nearly so. Randomness is not the exception; it is the rule. Structure, pattern, the compressible — everything the mind delights in — is a tiny, rare island in an ocean of noise.

This gives, in passing, the cleanest definition of randomness ever written: an object is random if its shortest description is itself. No pattern, no shortcut, no theory: K(x) ≈ length(x). The random is the incompressible.

The twist: this quantity is uncomputable

And here is the abyss. K(x) is perfectly well defined — and yet no algorithm can compute it. There does not exist, and never will, a program that, given x, returns K(x). Kolmogorov complexity is undecidable, a direct cousin of Turing’s halting problem.

The proof runs through a delicious paradox, the Berry paradox. Consider the phrase “the smallest integer not describable in fewer than fifteen words.” That phrase has just described it in fourteen words. Contradiction. Transposed to programs: if a machine could compute K, we could have it search for “the smallest object whose complexity exceeds one million bits” — and that short search program, far shorter than a million bits, would nonetheless produce an object supposedly requiring a million bits to describe. Absurd. Therefore K is uncomputable.

There is no formula to take away here, only a vertigo worth keeping. We have a quantity that exists, that is unique, that defines the true information of every object in the universe — and that no machine, however powerful, will ever compute exactly. We can only approach it from above: every time you compress a file you find a recipe, hence an upper bound on K. But you will never know whether an even shorter recipe exists. Compression is a bottomless quest: we draw nearer, we never arrive.

Chaitin drew from this an informational version of Gödel’s theorem: there is a complexity threshold beyond which no mathematical theory can prove that a specific object is random. The truth “this object is incompressible” is, for almost every object, unprovable. The limits of Turing and of Gödel are, at bottom, the same limit seen through the lens of information.

Intelligence as compression

This is where the thread turns sharp. Solomonoff inverted Kolmogorov into a theory of prediction: the best forecast of the future is the one issued by the shortest explanation of the past. This is Occam’s razor, but proven mathematically — among the theories that fit the data, bet on the most compressible. Marcus Hutter pushed it to a formal definition of intelligence, the AIXI agent: an optimal agent is a universal compressor. Uncomputable again, but it serves as the lodestar.

And a large language model, in this frame? Precisely a compressor. Training it to predict the next token is making it find the shortest program that “explains” the corpus — the principle of minimum description length. The better the model compresses, the better it has understood. This has become literally measurable: recent work reports a roughly linear relationship between a model’s ability to compress text and its scores on reasoning benchmarks. The same idea has produced lossless compressors that lean on large models to predict the data, turning Solomonoff’s principle into running code — the better the model understands the data, the better it compresses it.

This closes the loop. Entropy defines surprise; a predictive mind minimises it by anticipating the world; and prediction equals compression equals understanding equals, on this view, the very definition of intelligence — all of it resting on a foundation that no machine will ever compute exactly. There is something fitting in that. The measure of true information is real, universal, and forever out of reach. We spend our effort, biological and artificial alike, looking for the short recipes — the few rules that predict the most — knowing we can always find a shorter one and can never prove we have found the shortest.