2026-06-11

Transformers Are Inherently Succinct: What an Expressivity Result Can and Cannot Tell You

A new paper proves transformers represent certain languages exponentially more succinctly than temporal logic and RNNs, and doubly exponentially more so than automata. It explains scale, it is not an engineering guide.

theory transformers research

Transformers Are Inherently Succinct: What an Expressivity Result Can and Cannot Tell You — Photo / Unsplash

Summary

A paper accepted at ICLR 2026 and upvoted 148 times on Hacker News, “Transformers are inherently succinct” (Bergsträßer, Cotterell, Lin), proposes a different lens on transformers. Instead of only asking which languages a transformer can recognize, ask how large it has to be to represent them. The authors answer using a classical and precise notion from formal language theory, succinctness, and the result is clean: for certain families of languages, transformers can be exponentially more succinct than linear temporal logic (LTL) and recurrent neural networks (RNNs), and doubly exponentially more succinct than finite automata.

This is a tidy theoretical result and also one that is easy to misread. What it genuinely shows is that there exist complex structures a transformer can represent at polynomial size, while any other formalism representing the same thing must blow up to exponential or even doubly exponential size. What it does not say, and does not try to say, is that real training will find these structures, or that transformers are therefore stronger in practice. The paper has clear directional value for researchers and mostly long-term implications for builders. Reading “proven to be succinct” as “more powerful in practice” is the most common error people make with results like this.

What happened

The paper does not study the engineering object called a transformer. It studies a widely used theoretical abstraction: the unique-hard attention transformer (UHAT). In this model, attention at each position selects the single highest-scoring position (with a fixed tie-breaking rule), layered with affine maps and ReLU layers. The authors stress that they work under fixed-precision arithmetic, where every value is represented with a constant number of bits. This is the setting that most faithfully mirrors real hardware, rather than the theoretically convenient assumption of arbitrary-precision rationals. That choice makes the conclusions closer to reality, and it deserves credit.

The central measuring tool is succinctness. Given a language L and a class of recognizers C (transformers, finite automata, LTL formulas), the succinctness of L relative to C is the size of the smallest recognizer in C that recognizes L. This is an old idea in computer science: it refines the blunt question of expressivity (who can recognize what) into the sharper question of cost (how compactly each formalism can describe the same thing). A classical precedent: LTL and finite automata are equally expressive, yet LTL can be exponentially more succinct than automata, at the price of correspondingly harder decision problems. This paper places transformers on the same comparison table.

The result tightens both directions rather than proving only half. On the lower-bound side (Theorems 15, 17, Corollary 18): there exist families of languages recognized by polynomial-size UHATs whose smallest equivalent LTL formula or RNN must be exponentially large, and whose smallest finite automaton must be doubly exponentially large. On the upper-bound side (Proposition 16 and related results): conversely, any LTL formula can be represented by a UHAT that is at most polynomially larger, and any fixed-precision UHAT can be translated into an LTL formula that is at most exponentially larger, which incidentally improves a prior doubly exponential translation. Only with both directions together is the size gap pinned down. It is not that transformers happen to be smaller sometimes; on certain languages they are systematically, provably smaller, and they do not lose out in the reverse direction.

The key technical ingredient that makes this work is the authors’ demonstration of how a UHAT can use attention to encode a counter that counts from zero up to a doubly exponential value. An intuitive example: use a binary counter to enumerate all strings satisfying certain adjacency constraints, so the string length itself grows exponentially with the number of counter bits; then stack many such strings with vertical constraints, and the shortest accepted string can reach doubly exponential length. Because a transformer can unfold such enormous structure from a tiny description, it gains the succinctness advantage. To be honest about this: these are artificial languages constructed for the proof, not patterns common in natural language or code, and the paper makes no claim that the latter share the same property.

One direct consequence is spelled out by the authors (Theorems 4 and 19): precisely because a transformer can compress such large structure into such a small description, analyzing it becomes correspondingly hard. Specifically, deciding whether the language recognized by a given UHAT is empty (the non-emptiness problem), and deciding whether two UHATs recognize the same language (the equivalence problem), are both EXPSPACE-complete. Under standard complexity assumptions, that means no algorithm can solve them in less than doubly exponential time. By contrast, the same problems are solvable in polynomial time for deterministic finite automata. Succinctness is not free: the harder you compress, the more expensive it becomes to unpack and verify.

One final point must be clear: this is a purely theoretical paper with no experiments. It reports no training, no benchmarks, no measurements of real models. Its conclusions are mathematical theorems, strictly scoped to the UHAT abstraction and the fixed-precision setting. The authors themselves acknowledge in the conclusion that the empirical evidence on whether such succinct transformers are actually learnable, meaning whether training can find them, remains mixed and unsettled.

Why it matters

The paper’s value is that it gives a formal anchor to a phenomenon long explained only by intuition: why can a transformer represent seemingly complex structure at relatively modest size? Most prior expressivity work asked whether, which languages a transformer can or cannot recognize. But the whether lens has a blind spot: many architectures turn out equally expressive (all corresponding to some class of regular languages) yet behave very differently in practice. Succinctness offers a finer lens, the how-much-cost lens, and that may be the dimension closer to the practical differences. Placing transformers on this cost-measured comparison table for the first time is itself a meaningful contribution.

For understanding architectural trade-offs, this gives a concrete, provable point of difference. RNNs are actually more expressive than transformers (under fixed precision they recognize all regular languages), yet this paper shows that on certain languages, an RNN must be exponentially larger than a transformer to achieve the same recognition. That turns “what is a transformer good at” from a vague intuition into a directional, quantified statement. It does not explain all of the transformer’s practical advantage, which also comes from parallel training, optimization friendliness, data scale, and many other factors unrelated to expressivity, but it does precisely characterize one piece of it.

At the same time, the result delineates a boundary of capability, and boundaries are often more useful than the capability itself. The direct cost of succinctness is exactly that EXPSPACE-complete verification difficulty. If you want to formally verify a transformer, to prove it will never do a certain thing on any input, this result is bad news: in the worst case that is provably extremely hard. But it also points to a way out: the difficulty comes from the transformer’s ability to encode huge counters, so conversely, if one can identify the subclasses of transformers that cannot encode such counters, there is hope for subclasses verifiable at lower complexity. That is a clear direction the paper leaves to future work, and it is the path on which it might actually touch engineering, but that belongs to future work, not to what this paper has already delivered.

Builder impact

Practically, this paper will not change how you train models, choose architectures, or write prompts today. It offers no method, technique, or tool that lands directly on engineering. If you build products, reading it as background knowledge rather than an action item is the correct posture. This is not a put-down; theory was never meant to prove its value through direct engineering guidance. It is to avoid a common misuse: seeing “transformers are proven more succinct” and inferring “so I should use transformers more aggressively.” That inference does not hold, because the languages that benefit in the paper are artificial constructions with no established correspondence to real tasks.

There are two ways it may become relevant to builders over the long term. The first is formal verification. Verification tools for feed-forward neural networks have made substantial progress, with an annual verification competition even, but verification for transformers is largely out of reach. This paper explains, in complexity terms, why it is hard, and points to which transformer subclasses might become verifiable. If your domain demands strong guarantees on model safety or reliability (autonomous driving, healthcare, financial risk), the question of which transformer structures are verifiable matters to you in the long run, but the usable tools do not yet exist. This is a research agenda, not a product option.

The second is calibrating intuition about scale. The result supports a reasonable but carefully-handled intuition: a transformer can represent fairly complex structure at modest parameter scale. That may help explain why mid-sized models sometimes display capability beyond what their size would suggest. But treat it as an explanatory frame, not an extrapolable law. The paper proves “such languages exist,” not “your task is such a language.” Until you have mapped your specific task onto the paper’s construction, any inference that “my model should therefore manage with fewer parameters” has no grounding.

What to ignore

Ignore any reading that treats this result as proof of the transformer’s engineering superiority. The paper proves succinctness of representation size, which is a different thing from “performs better on real tasks,” “trains more easily,” and “uses less compute.” The paper claims none of these. Be especially wary of the chain misreading: succinct, therefore stronger, therefore use transformers more. Not one link in that chain is supported by the paper.

Ignore the implication that succinct always means worthwhile. The paper itself states the cost plainly: the flip side of succinctness is an explosion in verification difficulty (EXPSPACE-complete). In settings that need to be analyzable, verifiable, and debuggable, being more succinct can be a liability instead. Succinctness is a neutral structural fact, not a flattering engineering metric.

On one class of objection from Hacker News, separate what holds from what does not. A commenter noted that the comparison with LTL used unreduced representations, and that if LTL formulas were sufficiently simplified (via certain reductions), the exponential advantage might shrink. That intuition has some merit for the LTL comparison specifically and is worth keeping in mind. But note that the paper’s doubly exponential lower bound against automata rests on a different argument: it relies on the length of the shortest accepted string, which is a hard lower bound that does not change with representation (any automaton recognizing a non-empty language has size at least linear in the length of some string it accepts). So the rebuttal that “a different representation erases the gap” does not apply to the automata direction. When reading such debates, pin them to the specific theorem and the specific object of comparison, rather than accepting or rejecting them wholesale.

Finally, ignore the opposite extreme, that “this is pure theory so it is irrelevant to reality.” It is indeed pure theory with no experiments, strictly scoped to UHAT and fixed precision; but fixed precision is exactly the setting closest to real hardware, and UHAT is a mainstream, repeatedly studied abstraction. It does not guide engineering directly, but it precisely explains part of a real phenomenon and draws a concrete research path toward verifiable AI. For researchers, it is a direction; for builders, it is background. Neither should be inflated into the other.

Sources

No official primary source available; this analysis is based on reliable secondary reporting (named outlets, cross-confirmed).