The trace outruns its capture

On human creativity as a time-bounded, non-fast-forwardable resource; the parallel-scan bottleneck; the currency of serial cost; and why the embodied long tail may resist amortization.

This essay follows a single thought through four stages: that the architecture we have chosen for our largest neural networks and the pressure to extract human, place-bound experience are one phenomenon, seen from two sides. The link between them is structural, not rhetorical. It runs through a particular mathematical property, whether a network's state recurrence composes associatively on a bounded representation, and settles, in the end, into a conservation principle borrowed from the thermodynamics of simulation. Nothing in the argument turns on anything mysterious about minds or matter: no quantum soul, no faculty that biology owns and silicon could never in principle purchase. It rests on the trade between parallel and serial computation, a few well-established facts about dynamical systems, and one honest empirical unknown, which I will try to state as sharply as I can rather than pretend to have resolved.

The whole essay circles a single question, the one the argument keeps returning to:

Is deep data enough, or must the temporal, serialization tax always be paid somewhere?

The four sections answer it in stages. Section 1 reorients the resource, moving from the picture of labour-as-extraction to creativity as something irreducibly generated, and shows why captured data is bounded in time. Section 2 isolates the computational bottleneck, the parallel scan and the associativity it depends on, in arithmetic small enough to check by hand. Section 3 turns to what the bottleneck costs and in what currency it can be paid: molecular dynamics as the measurable witness, the silicon exchange rate, the place of spiking networks, and the bifurcating, chaotic phenomena that live where cortex meets world. Section 4 gathers the conclusion: the conservation of serial cost, and the conjecture that human creativity may live precisely in the regime from which that cost cannot be fast-forwarded away.

Section 1: Deep Data, Human Creativity, and the Time-Bound

The resource, re-territorialized

The reading of large-scale data harvesting as a kind of colonialism is by now established scholarship, Couldry and Mejias's data colonialism, Kwet's digital colonialism, both reaching back past the empire of flags and garrisons to Nkrumah's neo-colonialism: control exercised not through occupation but through structural dependency, where formal sovereignty survives while the raw material flows outward all the same. Much of that literature was built for text, and for text it meets an obvious objection. Text de-territorializes. It can be scraped from anywhere, copied without loss, and what flows back to the "colonized" is often genuinely useful. Data, the objection runs, is not rivalrous or finite in the way that land and ore are.

The objection loses its footing once the data in question is embodied world data: the multimodal record of a person or a place, the streets of a particular city, the full range of human skin under real monsoon light, a low-resource language as it is actually spoken, the felt texture of moving over real terrain, the long tail of events no one thought to model. This data stays put. To get it you must physically go there, or pay people who are already there. It restores the literal geography that the word "colonial" asks for.

Once the resource is grounded again in place, its finiteness separates into three senses usually run together, and only the crudest is the one a careful critic would avoid claiming. At the first level the bits are non-rivalrous: once captured, a labelled dataset copies endlessly, and here the resource genuinely is not scarce. At the second level the competitive value is rivalrous: sell the same dataset to everyone and the edge it confers evaporates. Pooling everyone's driving data may raise the social value of autonomy even as exclusivity is what carries private advantage, and the neo-colonial harm lives almost entirely in that gap. The third sense is the genuinely new one: the underlying reality is finite and depleting. Low-resource languages are dying; particular environments are homogenizing; and as synthetic media floods the world, the reserve of genuinely human, genuinely real-world data is being contaminated. This is the model-collapse problem, where training a model on its own output degrades it, and it means that pre-synthetic authentic data has begun to behave like a non-renewable deposit with a closing extraction window. Real-world capture from the Global South is an extractable reserve in the most literal, mining sense of the term.

The labour tier, and the rhyme it carries

Beneath the tidy "data is the new oil" story sits a tier of labour the story keeps out of view: the annotators and labellers, iMerit, Sama, Appen, Scale's crowd across India, the Philippines, Kenya, Venezuela, performing the low-paid and often distressing work that makes the system run. The historical rhyme is exact, and for the Indian case it is specific. The colonial textile economy ran on raw cotton flowing out and finished Manchester cloth flowing back, with the handloom weavers ruined in between: value-added manufacturing captured by the metropole, local industry hollowed out. The mapping is direct. Raw world data and annotation labour flow out; a world-model is built and owned abroad; it is sold back as a service the local ecosystem cannot match, because it was never allowed to climb the value chain. You end up financing the very capability that captures your own future market.

The one experiment aimed squarely at this, worth watching even should it remain marginal, treats workers as data owners who earn recurring value rather than a one-time piece wage, the model Karya has pursued in India around low-resource Indian-language data. It is so far the only counter-design that addresses the finiteness-and-ownership problem rather than the wage alone.

The exploited supplier and the creative source

So far this is the extraction story in its familiar register, with the human positioned as an exploited supplier of raw material. Alongside it I want to hold a second register, in which the same human is an irreducible creative source. The two can look as though they pull against one another. To reduce a person to an extractable deposit seems to deny their creativity; to celebrate their creativity as something machines cannot reach seems to soften the material exploitation into sentiment. Held together, though, the two descriptions turn out to be one fact seen from either side, and each sharpens the other. The property that gives embodied human experience its value, that it is genuine, long-horizon, world-coupled novelty rather than a recombination of what has already been seen, is the same property that makes its extraction durable. Were the resource synthesizable, the relationship would lapse on its own: the mine could be reshored to a server farm, and what looked like exploitation would resolve into a passing wage arbitrage. Because the resource is genuinely creative, in the precise sense the later sections develop, it cannot be manufactured inside a data centre, and the metropole's capability stays rooted, indefinitely, in someone else's ground and someone else's lived time. So the dignity of the creative source and the gravity of the extraction do not compete for the same ground. The harm is structural because the creativity is real.

The synthetic objection, and why it fails for embodied data

The strongest objection to all of this is that the resource is about to become unnecessary. The promise behind the world-model platforms is synthetic and simulated world data: generate the scenes rather than capture them. If that works, the finite real-world deposit stops binding, and the labour tier is automated away rather than exploited, its own, colder harm.

The objection fails for embodied data, and it fails structurally, not because today's simulators are merely underpowered. A simulator can emit only what its generative model already contains. A game engine's lineage is rigid-body physics and rendering, a closed world of specified equations. Embodied experience in nature is an open world of effects for which we have no equations: light through a particular haze, the granular mechanics of a specific soil, the coupling of balance and effort over real ground, the events nobody modelled. One cannot sample from a distribution one never captured. For the modalities that have no clean ground-truth generator at all, fine touch, interoception, proprioception, synthesis is not so much difficult as undefined, because there is nothing to render from.

At this point the objection turns over into its own refutation. The impressive generative "world models" are themselves neural networks trained on enormous quantities of real captured video. Synthetic world data, in other words, is laundered real data. The generator interpolates beautifully inside the manifold of what it has already seen and comes apart as soon as it extrapolates beyond it; the captured data fixes the outer edge of everything it can ever produce. The synthetic route does not escape extraction. It depends on extraction and tucks it one layer down, a refinery that only runs on its own ore.

Synthesis does win in one place, and honesty asks that it be named: the narrow mechanical slice, warehouse picking, an arm learning to grasp, a drone threading obstacles, where simulation, domain randomization, and a little real fine-tuning carry the day. But this concession works in favour of the larger claim. The shortcut succeeds exactly where the data is cheap and mechanical, and fails exactly where it is rich, phenomenological, and place-bound, which is precisely the data the frontier ambition is chasing, and precisely what is harvested from Southern bodies and environments. Synthesis stands in for the cheap material; the valuable material it cannot reach.

The time-bound

Here the resource-finiteness of this section meets the architectural argument of the rest of the essay, and this is the hinge of the whole reorientation. Captured data is finite not only in quantity but in time-horizon. A generator trained on trajectories that span an interval of length \(T\) can reproduce the dynamics out to roughly that horizon and no further; its competence is the shadow of the time-extent it actually observed. The notebook states the conclusion plainly:

Deep data can only be used to learn a parallelizable function, a feed-forward or scan-based network, of state trajectories up to a horizon \(T\) if the dataset itself already contains trajectories of order \(\Theta(T)\). You can interpolate inside the captured horizon; you cannot integrate your way past it.

This is the precise sense in which deep data is bounded in time, and it is what fuses the two readings of this section into one. The creative human source matters because its experience samples long-horizon, world-coupled dynamics that no bounded capture can fast-forward into. And the extraction is permanent because no finite dataset ever finishes the work: to cover a longer horizon you have to capture a longer horizon, which means returning to the body and the place that live it in real time. There is no terminal dataset that completes the mine. The next two sections make this bound first mechanical and then thermodynamic; for now it is enough to see that finite in quantity and bounded in horizon are two faces of a single limit, and that both point back to the same irreplaceable source.

Section 2: The Parallel-Scan Bottleneck

For definitions of the model architectures discussed here, including transformer, SSM, diagonal SSM, S4, and Mamba, see the appendix at the end of the article.

The previous section claimed that captured data buys competence only out to the horizon it spans. This section lays out the machinery beneath that bound: what a parallel scan is, why the architectures we can train at scale depend on it, and exactly which expressive power they give up in exchange. The worked examples are small enough to check by hand, which is the point, the limitation is a matter of algebra, not of scale or tuning.

Section 2 visual overview of the parallel scan, order, and bistability — Section 2 visual overview: the parallel scan, order, and bistability

What a parallel scan is

A scan, or prefix computation, takes a sequence of step-maps \(g_1, g_2, \dots, g_n\) and produces every running composition

s_t = g_t \circ g_{t-1} \circ \cdots \circ g_1, \qquad t = 1, \dots, n.

Done in the obvious way this is inherently sequential: \(n\) steps, each waiting on the one before. But when the composition operator is associative, when the way you bracket a product leaves its value unchanged, the running compositions can be regrouped into a balanced binary tree and computed in \(O(\log n)\) parallel depth with \(O(n)\) total work. That regrouping is the whole reason a recurrence can be made to run at the throughput of a feed-forward network.

The state-space models of the S4-to-Mamba line are built on exactly this. A linear, time-invariant recurrence,

x_t = A\,x_{t-1} + B\,u_t, \qquad y_t = C\,x_t,

carries genuine persistent state and yet unrolls into a convolution with kernel \(\bar K_k = C A^{k} B\), computing by parallel scan because its affine step-maps compose associatively:

(A_1, b_1) \circ (A_2, b_2) = (A_1 A_2,\; A_1 b_2 + b_1).

That associativity is the engine. It is also, as the rest of the section shows, the source of the trade.

A one-step map is not a dynamics

One observation needs to come before the trade. A trained transformer or diagonal SSM, at rollout, is re-reading a fixed window and predicting the next frame; the temporal structure lives in a sequence dimension imposed from outside. What such a model learns is a flow map at one fixed step, \(\Phi_{\Delta t}: x_t \mapsto x_{t+\Delta t}\), a single exposure of the dynamics, rather than the generator, the vector field \(f\) with \(\dot{x} = f(x)\) whose integration is the evolution itself.

Two small facts show how little a one-step map commits to, and the figure works them through in full, so I will only name them here.

The first is that the map does not know its own square root. A planar rotation of \(60^\circ\) per step and a rotation of \(420^\circ\) per step have the identical one-step map, since \(420^\circ\) and \(60^\circ\) differ by a full turn, yet their half-steps land on opposite sides of the circle, \(30^\circ\) against \(210^\circ\). A model fit to \(\Phi_{\Delta t}\) cannot recover its own behaviour at any other timescale. The second is that a small one-step error sets no bound on the many-step error. A learner one percent off on an unstable direction compounds that slip as \((1.01)^n\), reaching about thirty-five percent by thirty steps and growing without ceiling; the rate of divergence belongs to the target's own Lyapunov exponent, which one-step accuracy never touches. The same lesson holds for randomness rather than chaos: fitting a one-step conditional places no constraint on where the iterated process settles in the long run, the consistency that the Chapman–Kolmogorov condition would demand and that one-step fitting is free to violate.

The consequence for the time-bound is direct. This is why you cannot integrate past the captured horizon: the short-time fit carries no promise about long-time behaviour, and in any chaotic target the distance between them opens exponentially.

What the scan actually needs

Here is the distinction the whole section turns on, and the one most easily blurred. A scan parallelizes when its step-maps compose associatively on a bounded representation, when the running compositions regroup into a balanced tree and the bracketing makes no difference to the answer. Linear and affine maps have this property and keep it forever: compose two of them and you get another map of the same fixed size, and the composition associates. That is the entire basis of the cheap parallel scan.

Two further properties are routinely run together with this one, and the argument depends on holding them apart. One is commutativity, whether the order of the steps matters. The other is linearity itself. A cheap scan needs neither order-freedom nor anything beyond the bounded, associative combine. What the practical diagonal models surrendered was far more than the scan ever required: by making their updates commute, they gave up all sensitivity to order, and with it the ability to carry a state that depends on the sequence of events rather than merely their tally.

The recurrence ladder showing commutativity, associativity, and nonlinearity — The recurrence ladder: commutativity, associativity, nonlinearity

Restoring non-commutativity within the linear class costs the scan nothing and returns genuine state-tracking; this is what the recent fixes do. Nonlinearity is a heavier matter altogether. It is the one move that breaks the bounded associative combine, because composing nonlinear maps does not return a map of the same fixed size, and it is exactly nonlinearity that opens the repertoire the linear class can never enter: multistability, bifurcation, chaos. When the notebook writes of "non-associativity (chaos)," this is what the phrase names: the nonlinear regime in which no bounded associative scan survives, and the serial cost can no longer be deferred. So the spectrum runs in three stages: commutative and associative, where the diagonal models sit, order-blind and fully parallel; non-commutative and associative, still linear, still parallel, but now able to track state; and non-commutative and non-associative, the nonlinear regime, where the scan finally gives way.

Order-blindness against state-tracking

It helps to see the two ends of the linear class concretely. Diagonal updates commute, so a recurrence built from them cannot tell two sequences apart when they are made of the same multiset of updates in a different order. Its entire memory of a sequence is how many of each kind of step occurred, never in what order, a bank of counters wearing recurrent dress. This is the formal content of the "illusion of state" finding: diagonal SSMs of the Mamba class sit in the same low complexity class as transformers and cannot perform genuine state-tracking.

The cure is the shell game. Three cups, the ball a one-hot vector, two swaps: applying the first swap and then the second lands the ball under one cup, while reversing the order lands it under another. Same two moves, different order, different outcome, because the two permutation matrices do not commute. This recurrence is linear, it is associative and so still scans in parallel, and its power comes entirely from its non-commutativity. One needs no nonlinearity to carry a world through a sequence of moves; non-commutativity suffices, and both linearity and the scan are kept while buying it. (What "commutative" really asks for, underneath, is a shared eigenbasis: commuting operators are simultaneously diagonal in one fixed frame, so the state can only stretch and shrink along permanently fixed axes, never rotate one feature into another in a way that depends on its history. Non-commutativity is the refusal of that shared frame.)

In coordinates, put the ball under cup 1 as \(e_1 = (1,0,0)^\top\). Let \(P_{12}\) swap cups 1 and 2, and let \(P_{23}\) swap cups 2 and 3:

P_{12} = \begin{pmatrix} 0 & 1 & 0\\ 1 & 0 & 0\\ 0 & 0 & 1 \end{pmatrix}, \qquad P_{23} = \begin{pmatrix} 1 & 0 & 0\\ 0 & 0 & 1\\ 0 & 1 & 0 \end{pmatrix}.

With column vectors, the rightmost operation happens first. So "swap 1-2, then swap 2-3" sends the ball to cup 3, \(P_{23}P_{12}e_1 = e_3\), while "swap 2-3, then swap 1-2" sends it to cup 2, \(P_{12}P_{23}e_1 = e_2\). The matrices are both perfectly linear and perfectly scan-friendly, but \(P_{23}P_{12} \ne P_{12}P_{23}\), and that inequality is exactly the retained memory of order.

There are two thresholds along this axis rather than one. The shell game lives in \(S_3\), the smallest non-abelian group, already enough to exhibit order-dependence and to defeat every commutative model. Reaching the full circuit class \(\mathsf{NC}^1\), by Barrington's theorem, requires a non-solvable group, the smallest being \(S_5\), realized as a width-five product of permutation matrices, which is linear, associative, parallel-scannable, and remarkably powerful, again because matrix multiplication does not commute. So non-commutativity, at \(S_3\), already buys state-tracking and already lies beyond every diagonal SSM; non-solvability, at \(S_5\), buys the full class. The fixes of 2024 and 2025 that recover state-tracking, complex or negative eigenvalues, non-diagonal transitions, the DeltaNet line, all do the same thing underneath: they buy back non-commutativity while keeping linearity, so the scan survives.

What no linear map can do: bistability

There is one wall the entire linear class cannot climb, commutative or not. For a linear update \(x_{t+1} = A x_t\), the resting states solve \((I - A)x = 0\), which gives a subspace through the origin, either the origin alone or a whole line of fixed points, never two isolated ones. No linear map can offer two stable resting states to choose between. A single nonlinear term changes this. The cubic map

x_{t+1} = x_t + 0.1\,(x_t - x_t^3)

has stable resting states at \(\pm 1\) and selects between them by the sign of where it starts, a content-selected attractor, which is to say a decision. The figure traces exactly why one cubic term splits a single rest into two.

Bistability example showing what no linear map can do — Bistability: what no linear map can do

Multistability and basin selection are the specifically nonlinear repertoire, and the whole linear class, including its non-commutative, state-tracking, \(S_5\)-powerful part, provably cannot reach them. This is the same nonlinearity that broke the bounded associative scan a moment ago, now seen from the side of what it buys rather than what it costs.

The bottleneck, stated

Reading the witnesses in order reconstructs the bottleneck as a sequence of forced moves. A one-step conditional is not a dynamics. The scan parallelizes on associativity, at no cost in ordering. The weakness of the diagonal SSMs was therefore their commutativity, and buying it back recovers state-tracking while keeping the scan. But multistability stays beyond the whole linear class, and only nonlinearity reaches it, the same nonlinearity that ends the scan. Behind all of this lies a near-complexity-theoretic boundary: shallow, parallel computation is \(\mathsf{NC}\); inherently sequential computation is \(\mathsf{P}\)-complete and is believed not to parallelize. Anything that trains at the throughput we want, attention, convolution, linear scan, is a parallel map; anything with genuine nonlinear sequential state carries a dependency the hardware resists. To ask for cheap nonlinear state at parallel-hardware cost is to ask \(\mathsf{NC}\) and \(\mathsf{P}\) to coincide.

The open seam can be put in a sentence. A parallel scan requires the step-composition to be associative on bounded state; an expressive nonlinear automaton with real attractor dynamics wants it not to be. Whether some nonlinear recurrence carries enough algebraic structure to admit a parallel scan and still enough nonlinearity to select basins is, as of early 2026, unsettled. And that unsettled corner, the commutative, scan-friendly one, is exactly the corner that most sharpens the appetite for captured data, since a model that cannot generalize off its manifold must have seen the manifold. The architecture's limitation and the extraction of Section 1 are, once again, the same fact.

Section 3: Trade-offs and the Currency of Serial Cost

Section 2 located the bottleneck in the algebra of the scan. This section turns to what the bottleneck costs, and in which currency that cost can be paid. It takes machine-learned molecular dynamics as its running witness, because there the problem is pronounced, quantitative, and independently verified, and then moves to the silicon exchange rate, the place of spiking networks, and the cortex-and-environment phenomena that live exactly where the scan cannot follow.

The measurable witness: machine-learned molecular dynamics

The clearest place to watch the trade-offs play out is the learned interatomic potential, because every failure there is measurable and reproducible. One fits a scalar energy \(E_\theta(x)\) and takes the force as its gradient,

F_\theta(x) = -\nabla_x E_\theta(x), \qquad m\ddot{x} = -\nabla_x E_\theta(x),

which makes the force conservative by construction, a strong structural prior already in place, at no cost. And the model still fails at long times, for three reasons that have to be kept apart, because only one of them concerns the network at all.

The first reason is chaos, and it belongs to the target. Even the exact forces produce trajectories that separate exponentially, with Lyapunov times of femto- to picoseconds for biomolecules. The specific trajectory is unreproducible over the long run for any model whatever; only ensemble quantities are ever meaningful, and nothing in the architecture is to blame. The second reason is covariate shift, and it does belong to the architecture. The potential was fit near sampled configurations; integration wanders off-manifold into spurious low-energy holes that the dynamics then falls into, the same failure as video hallucination. The field's standard remedy points to the same diagnosis: active learning, meaning detect the extrapolation and go capture more reference data. The model cannot integrate its way out of the gap; it has to be sent back to the well. The third reason is timescale separation, and it belongs to the problem. Nanosecond to millisecond is not a longer integral but a rare barrier-crossing, which brute-force integration never reaches.

The lesson generalizes well past chemistry: structural priors are close to necessary and a long way from sufficient. They buy stability and the right kind of invariants; they do not supply long-time statistics one never sampled. You cannot integrate your way to a distribution you did not observe, which is the mechanical form of Section 1's time-bound, now witnessed in a system where it cannot be argued away.

The constructive consequence is just as definite. The long-time object should not be reached by iterating the short-time one, since that iteration map is the ill-conditioned step. The long-time invariant is better learned directly, the transfer or Koopman operator, the slow eigenfunctions, a Markov State Model stitched together from an ensemble of short trajectories aimed at the slow spectrum. This is exactly how the field extracts millisecond folding kinetics from nanosecond simulations: not by integrating longer, but by choosing a different estimator.

And here is the caveat the notebook insists upon. Even the Markov-State-Model route does not escape the time tax; it pays the tax through oversampling. One does not integrate a single long trajectory, but one must sample enough short trajectories to cover the slow manifold, and that coverage is the cost arriving in a different currency. The tax moves; it does not vanish. This is the bridge to the conservation law.

The conservation law and the exchange rate

Stated plainly, the principle the whole argument converges on:

The serial cost of non-commutative dynamics is conserved. Architecture only decides where it is paid.

A genuine temporal automaton pays the cost online, in latency at runtime, this is the brain's arrangement. A static parallel network declines to pay it there and so must pay it beforehand, in the captured data and compute needed to fit a structure-matched prior that fast-forwards the particular systems it has seen, a prior that is brittle exactly off the manifold it paid to learn, which is why it cannot extrapolate and keeps having to return to the well. There is no third arrangement in which the cost simply goes unpaid. And for the open, novel, long-tail embodied regime there is no prior to amortize against, so the cost reverts to its irreducible online form, which is exactly the capture Section 1 named as extractive.

What the law constrains is the total; it says nothing about the price per unit across substrates. Silicon can pay in currencies biology never had: wall-clock time it is not spending inside a body, joules at industrial scale, parallel sampling no organism could run. So the conserved cost may well be payable, for a great many tasks, at an exchange rate that makes the offline bet worthwhile even though the cost itself is real and cannot be beaten. The notebook puts it carefully: silicon may lower the exchange rate, and large language and world models do keep getting better at predicting and generating smooth trajectories over data-time-bounded chaos, "pre-computed chaos", while still lacking the non-associative, event-driven state evolution of the real world, and the cortex-level composability that genuinely long-horizon generation would require. Lowering the exchange rate buys a great deal, and still the bill itself remains.

One symmetry has to be respected, for honesty's sake. If we grant silicon its exotic macroscopic currencies, we cannot by fiat deny cortex a possible exotic currency of its own. Whether mammalian cortex draws on a finite, local contextual resource, in either the quantum-physical sense or the well-defined behavioural one, is genuinely unknown, not known to be absent. The exchange-rate question therefore carries unknown currencies on both sides: silicon's industrial, macroscopic resources, known to exist with an unknown rate, and a possible cortical contextual resource, of unknown existence. This does not weaken the conservation law, which speaks only to the total; it tightens the honesty of the account by refusing an asymmetry no one can prove. The core argument calls for none of this exotica, and it is exactly the part that must not be quietly folded into the chain the conclusion rests on.

The place of spiking networks

Spiking networks sit at an instructive point on this map, because they show the conservation law operating with neurons in the place of matrices. A spiking neuron is a hybrid dynamical system. Its subthreshold integration, \(\tau\dot V = -(V - V_{\text{rest}}) + I(t)\), is linear, in the simplest case diagonal, a leaky integrator sitting squarely in the commutative-linear corner of Section 2, and it scans perfectly well. The nonlinearity is concentrated in one place, the threshold-and-reset: the discontinuous, state-dependent event that fires the neuron and snaps the potential back down.

That single event reaches the regime Section 2 marked as unreachable for any linear map. The dynamical-systems tradition catalogues neuronal firing by its bifurcations, tonic spiking as a limit cycle, bursting as a multi-timescale attractor, and networks of spiking units implement winner-take-all competition and persistent activity, which is to say basin selection and working memory. A permutation or DeltaNet recurrence is powerful and still linear, and so cannot be bistable; the spiking reset can. In the taxonomy of this essay, spiking networks belong to the nonlinear-recurrent, serial-state family, the brain's corner, and not to the commutative one at all.

The reset does something else as well, and it is decisive: it is what destroys the parallel scan, since the membrane potential depends on the neuron's entire past history of firing. Every attempt to parallelize spiking networks performs the move Section 2 predicts: it drops or linearizes the reset to recover the scan, and loses the nonlinear repertoire in the trade, neglecting reset and refractoriness, training a surrogate to approximate the reset, decomposing the potential into parallel-computable parts. The honest variants keep the structure and accept the bill: parallel training, serial inference. You cannot keep the spike's nonlinearity and parallelize it away.

Two clarifications guard against a natural conflation. First, the property spiking networks are best known for, event-driven sparsity, the joules-per-spike economy of neuromorphic hardware, lives on a different axis (how much computation, in what physical currency) than the serial-versus-parallel one; sparsity does not touch the parallel/serial wall. Second, that efficiency axis is exactly where spiking networks meet the open door: neuromorphic hardware is best read as a wager on paying the online serial cost at a different exchange rate, clockless, event-driven, low-energy, rather than as a wager on not paying online at all. The independent reappearance of the resonate-and-fire neuron as a complex-eigenvalue initialization is a quiet sign that the field is arriving at the commutativity diagnosis from the hardware side.

Where cortex meets world: bifurcating and chaotic

The argument keeps reaching for "the embodied long tail" and "non-associative event and state evolution in the real world." Those phrases deserve to be made concrete, because the phenomena are not exotic, they are the texture of ordinary skilled life, and they are demonstrably bifurcating and chaotic. In each, the right system to consider is not the nervous system alone but the closed loop of cortex, body, and environment, and the qualitative changes within it are basin selections and bifurcations rather than smooth interpolations.

Consider gait. As a quadruped speeds up it does not smoothly morph from walk to trot to gallop; at critical speeds the whole coordination pattern reorganizes discontinuously. The system in play is the coupled network of central pattern generator, musculoskeletal mechanics, and ground reaction forces, whose control parameter, speed, or a dimensionless Froude number, drives it through bifurcations between qualitatively distinct attractors. The human walk-to-run transition carries the diagnostic fingerprint of bistability, showing hysteresis: the up-switch and down-switch speeds differ. Because the body and the ground sit inside the dynamical system, the transition cannot be rendered from outside the coupling.

Consider multistable perception. Present each eye a different image, or show a Necker cube, and perception does not average; it settles on one interpretation and then spontaneously switches, with switching intervals that several analyses find low-dimensional chaotic rather than merely random. As the stimulus varies, the system passes through bifurcations that change how many stable percepts exist and how readily it flips. This is the content-selected attractor of Section 2 made flesh: discrete basins, history-dependent selection, a state that chooses, precisely the repertoire a linear, scan-friendly map provably cannot host.

Consider balance, and movement over real ground. Quiet standing is an unstable equilibrium held by an active control loop whose postural sway shows intermittent, scale-free, chaotic structure, rather than a fixed point the nervous system simply rests at. Perturb it, uneven ground, a slip, a gust, and recovery or fall is a separatrix crossing in a coupled body-and-environment phase space: on one side the controller returns to balance, on the other it tips past recovery. Learning to ride a bicycle, or a skilled runner reading rocky ground stride by stride, is the online discovery and upkeep of an attractor in a loop the environment helps author moment by moment. (The same signature turns up elsewhere worth a passing note: perceptual decisions as drift toward an attractor that crosses a threshold, and vocal production, where register breaks and the period-doubling route to chaos in disordered voices are textbook bifurcations of the articulatory-and-aerodynamic loop.)

What unites these is exactly what the rest of the essay predicts. They are long-horizon, cortex-coupled, bifurcation-laden, and chaotic, the non-associative event and state evolution the notebook keeps naming, and they are precisely the regime a smooth-trajectory generator trained on bounded captured data cannot enter, because entering it would mean selecting a basin it never sampled and undergoing a bifurcation that lies, by construction, off its manifold. The generator can reproduce a gallop it has seen; it cannot, from inside a captured horizon, produce the transition that first creates one.

Section 4: Conclusion, the Conservation of Serial Cost, and the Creativity Conjecture

Two corners of one frontier

The brain does not escape the parallel-and-serial trade-off; it pays it, at the opposite corner from the GPU. Cortex runs nonlinear recurrent dynamics with persistent state, attractors, and limit cycles, and it is correspondingly serial in time, it lives the trajectory rather than scanning the future. What it parallelizes is space: on the order of \(10^{11}\) units in lockstep wall-clock. The silicon bet on scale is the mirror image, spend linearity, or at most non-commutative linearity, to buy parallelism across the sequence. Same frontier, opposite corners. Such networks are perfectly trainable; they are simply slow and serial. The binding limit is narrower, cheap nonlinear state and cheap temporal parallelism cannot be had at once.

Evolution is the only search we know to have actually run this optimization, under relentless metabolic pressure, and it landed hard in the nonlinear-recurrent, serial-state corner, convergently, across cephalopods, mammalian cortex, and the insect mushroom body. Convergence under independent optimization is stronger evidence than any single existence proof: for the embodied workload, real-time control, prediction, and decision in a partially observed nonlinear world, there appears to be no cheap parallel-linear shortcut that dominates, or something would have drifted toward it, since reflexes are metabolically cheaper than attractor dynamics. But evolutionary arguments carry a known failure mode that has to be respected: evolution was stuck with the slow electrochemical neuron and could never redesign its substrate. So the defensible split is this. Biology is evidence for the necessity of the capability, nonlinear, state-dependent dynamics for this workload, while the irreducibility of its cost must stand on the complexity-theoretic wall rather than on biology. The two legs hold each other up.

The hard floor: no fast-forwarding

The thermodynamic form of that irreducibility is a genuine theorem. In Hamiltonian and quantum simulation there is a no-fast-forwarding result: generically, simulating time-\(T\) evolution costs \(\Omega(T)\), with no sublinear shortcut, and the impossibility extends even to parallel fast-forwarding for sparse and local systems. The intuition comes close to a proof on its own, one cannot be more efficient than the physics one is inside, and the systems that can be fast-forwarded are exactly the commuting, integrable ones: the same commutative corner that scans in parallel and cannot track state. The fast-forwardable systems and the parallelizable-linear architectures are one class in two uniforms. Generic, non-commutative, chaotic dynamics, the embodied workload, is the kind that cannot be fast-forwarded.

How hard, as it scales: splittable to unsplittable chaos and the contextuality crack — How hard, as it scales: splittable to unsplittable chaos, and the contextuality crack

This is also where the complexity picture and the one honest exception sit side by side. The difficulty of a problem is fixed by the problem, the same on any machine; what scales is whether the work can be divided among many hands or must be lived through one step at a time, and chaos is what plants the work firmly at the unsplittable end. There is one known crack in that wall, and it is worth naming precisely because it is the only one. Two teammates placed in separate rooms, forbidden to communicate, can win certain cooperative games more often when they share an entangled resource than any pre-agreed classical plan allows, provably more often, lifting the best classical success rate of seventy-five percent to roughly eighty-five. That extra winning power, with no classical story beneath it, is contextuality, and a small, shallow quantum circuit can use it to solve a problem no equally shallow classical circuit can. It is the single resource known to shave the difficulty itself rather than merely pay for it in another currency, and even so it is a bounded crack, a shallow-depth head start, not a collapse of the wall. Whether biology draws on anything of the kind is, again, unknown.

The escape hatch, and why it closes: global entropy increase does not forbid local fast-forwarding. One may run a subsystem ahead by importing order and dumping entropy elsewhere, but this cannot be bootstrapped into a free lunch, because the real speedups are not generic. A specialized predictor can fast-forward the system it was built around only because it has folded that system's slow manifold and conserved quantities into its own structure, the Markov-State-Model and Koopman move, seen now from the thermodynamic side. The cost did not vanish; it was paid in advance, at fitting time, and amortized, and the entropy exported is the sampling spent to find the slow eigenfunctions. The no-free-lunch result states it exactly: averaged over all dynamics, no predictor beats any other, so every real speedup is a prior matched to a specific structure, and is exactly a debt against the generality given up. The account always balances. And the corollary closes the loop with Section 1: you cannot fast-forward a system you have not yet paid to characterize.

What the notebook's key conclusion says, restated

Everything above is the machinery behind a single conclusion, which the notebook states in the language of trajectory length:

A parallelizable network can be taught to predict state trajectories out to a horizon \(T\) only if the training data already contains trajectories of order \(\Theta(T)\). And because of non-associativity, meaning chaos, trajectory data of length \(O(T)\) cannot be turned into a parallelizable network that predicts over a \(\Theta(T)\) horizon unless an additional inference-time serial tax is paid. Markov-State-Model-style estimators do not evade this; they pay the same tax through oversampling. Silicon can lower the exchange rate, and large world models will keep getting better at generating smooth trajectories over pre-computed, data-time-bounded chaos, while continuing to lack the non-associative, real-world event evolution and the cortex-level composability that long-horizon generation demands.

This is the precise, deflationary form of "deep data is time-bounded." A captured horizon buys a predictable horizon and no more; the long-horizon, basin-selecting, bifurcating part of experience stays permanently beyond the reach of any bounded capture, because reaching it would require either integrating past the data, which chaos forbids, or having captured it already, which is the thing in question.

The one door left honestly open

The floor says the cost is conserved; it does not say the price per unit is fixed across substrates. The no-go theorems bound the total; they do not fix what each currency buys. So the conserved cost may be payable, for many tasks, at an exchange rate that makes the offline bet worthwhile even though the cost is real and unbeatable. Whether the embodied long tail, the rare, richly coupled, attractor-laden, non-fast-forwardable part where the value concentrates, is payable at silicon's exchange rate, or whether it specifically demands the online currency biology was forced into, is the one quantity this whole chain of reasoning has cornered and cannot settle from first principles. And with the symmetry of Section 3 respected, the question carries unknown currencies on both sides: silicon's industrial, macroscopic resources, and a possible, unproven, undisproven, cortical contextual resource.

This open question is the argument's yield, the very thing the construction was built to isolate. We have moved from a vague unease to a sharp, well-shaped empirical question, the exchange rate between online serial cost and offline amortized cost, on the specific class of non-fast-forwardable dynamics that embodied experience samples, and we reached it without invoking anything esoteric: no quantum brain, no vitalism, only the parallel-and-serial trade-off, the commutativity diagnosis, and the conservation of simulation cost.

The creativity conjecture

This is where the four sections close into one claim. The notebook's conjecture is the destination the whole structure was built to reach:

Human creativity may require long-horizon chaos, the chaos of cortex coupled with its environment.

Read against everything above, the conjecture is a precise placement rather than a flourish. If creativity is the production of genuinely novel long-horizon trajectories in a system where cortex and world are dynamically coupled, gait inventing itself at a transition, perception settling into a basin no average could land in, a body finding an attractor over ground it has never crossed, then creativity lives exactly in the non-associative, bifurcating, non-fast-forwardable regime. It is the regime that cannot be amortized offline at any captured horizon, the regime whose cost reverts to the irreducible online form, the regime a smooth-trajectory generator can only ever shadow, because producing it would mean selecting a basin and crossing a bifurcation that lie, by construction, off the manifold it was shown.

And so the two readings Section 1 set out to reconcile finally coincide. The human is an exploited supplier of raw material and an irreducible creative source, and these are one fact: the resource is extracted permanently because it is genuinely creative, and it is genuinely creative because it samples the long-horizon coupled chaos that no bounded capture can fast-forward. If the embodied, creative long tail does live in the regime that cannot be amortized offline at any viable rate, then deep data is a permanent resource rather than a transitional one on its way to obsolescence, and the relationship built on extracting it, like the creativity it draws from, is permanent too.

Appendix: Model Terms Used in the Article

This appendix fixes the vocabulary used above. The point is not to survey every variant, but to name the mathematical object each term refers to in the argument.

Transformer

A transformer is a sequence model whose basic operation is attention: each token forms a weighted mixture of other token representations. In one layer, with queries \(Q\), keys \(K\), values \(V\), hidden dimension \(d\), and a causal mask \(M\) for left-to-right prediction, the attention map is

\operatorname{Attn}(Q,K,V) = \operatorname{softmax}\!\left(\frac{QK^\top}{\sqrt d}+M\right)V.

The important feature for the article is that the sequence positions are processed in parallel during training. The model has a context window and can condition on previous tokens inside it, but it does not carry a recurrent physical state forward by integrating a dynamical system one step at a time.

State-Space Model (SSM)

A state-space model keeps an explicit hidden state \(x_t\) that is updated by the input \(u_t\) and read out as \(y_t\). The simplest discrete-time linear version is

x_t = A x_{t-1} + B u_t, \qquad y_t = C x_t + D u_t.

Because each step is an affine map, the recurrence can be composed associatively. That is the source of the parallel scan used in the main essay: the model has state, but the state evolution still belongs to the linear, scan-friendly class.

Diagonal SSM

A diagonal SSM is the special case where the state-transition matrix \(A\) is diagonal:

A=\operatorname{diag}(a_1,\dots,a_n), \qquad x_{t,i}=a_i x_{t-1,i}+b_i u_t.

Each coordinate evolves mostly independently. Diagonal updates share a fixed coordinate frame, so their transition matrices commute:

$$ A_s A_t = A_t A_s. $$

That commutativity is the limitation emphasized in the article. It makes the scan cheap, but it removes a basic form of order-sensitive state tracking: the model can accumulate per-channel evidence, yet it cannot easily represent a state whose identity depends on the order in which different operations occurred.

S4

S4 is a structured state-space sequence model. It starts from a continuous-time linear system,

\dot{x}(t)=A x(t)+B u(t), \qquad y(t)=C x(t),

then discretizes it into the form

x_t=\bar A x_{t-1}+\bar B u_t, \qquad y_t=\bar C x_t.

The corresponding sequence map can be written as a convolution with kernel

\bar K_k=\bar C \bar A^k \bar B.

The "structured" part is the design of \(A\), chosen so the model can represent long memory while remaining computationally tractable. For this essay, S4 matters because it is a powerful example of the linear SSM family: it buys long-range sequence modeling while keeping the associative structure that makes fast training possible.

Mamba

Mamba is a selective SSM. It keeps the SSM recurrence, but lets the parameters depend on the current input, so different tokens can open or close different memory channels:

x_t=\bar A_t x_{t-1}+\bar B_t u_t, \qquad y_t=C_t x_t,

where \(\bar A_t\), \(\bar B_t\), and \(C_t\) are functions of \(u_t\). This selection mechanism makes Mamba much more adaptive than a fixed linear SSM, but it is not the same nonlinearity discussed in the main argument. Conditional on the current input, the state update above is still affine in \(x_{t-1}\). In common diagonal-selective versions, the recurrent state evolution remains diagonal or channel-wise, so the model can be nonlinear as an input-conditioned sequence map without having the autonomous nonlinear state dynamics, multistability, or basin selection discussed in Section 2. The caveat used in the article is therefore narrow: Mamba avoids the fixed-SSM limitation through selection, but it does not by that fact alone buy the nonlinear recurrent repertoire unless the architecture explicitly buys back richer non-commutative or nonlinear state evolution.

References

Inherited working bibliography; entries have been normalized into linked article-style citations.

Atia & Aharonov (2017), fast-forwarding of Hamiltonians, Nature Communications.
Barrington (1989), bounded-width branching programs and \(\mathsf{NC}^1\), JCSS.
Basieva, Cervantes, Dzhafarov & Khrennikov (2019), true contextuality in human decision making, JEP: General.
Berry, Ahokas, Cleve & Sanders (2007), simulating sparse Hamiltonians and no-fast-forwarding lower bounds, Communications in Mathematical Physics.
Bravyi, Gosset & König (2018), quantum advantage with shallow circuits, Science; Bravyi, Gosset, König & Tomamichel (2020), noisy shallow circuits, Nature Physics.
Bremner, Jozsa & Shepherd (2011), classical simulation of commuting quantum computations and polynomial-hierarchy collapse, Proceedings of the Royal Society A.
Bruza et al. (2023), contextuality in cognition.
Cervantes & Dzhafarov (2019), contextuality-by-default.
Chia et al. (2023), impossibility of general parallel fast-forwarding, CCC.
Couldry & Mejias, The Costs of Connection, data colonialism.
Fisher (2015), nuclear-spin processing in the brain, Annals of Physics.
Grazzi et al. (2024), unlocking state-tracking in linear RNNs through negative eigenvalues, arXiv:2411.12537.
Ha & Schmidhuber (2018), world models.
Hahn (2020), limitations of self-attention, TACL.
Hao, Angluin & Frank (2022), hard-attention transformers and circuit complexity, TACL.
Howard, Wallman, Veitch & Emerson (2014), contextuality as the “magic” for quantum computation, Nature.
Izhikevich (2007), Dynamical Systems in Neuroscience, MIT Press.
Ji, Natarajan, Vidick, Wright & Yuen (2020), \(\mathrm{MIP}^* = \mathrm{RE}\), arXiv:2001.04383.
Krohn & Rhodes (1965), algebraic theory of machines, Transactions of the AMS.
Kwet (2019), digital colonialism, Race & Class.
Liu et al. (2023), transformers learn shortcuts to automata, ICLR.
Maass (1997), networks of spiking neurons, Neural Networks.
Merrill (2019), sequential neural networks as automata.
Merrill & Sabharwal (2023), the parallelism tradeoff, TACL.
Merrill, Petty & Sabharwal (2024), the illusion of state in state-space models, ICML.
Merrill, Sabharwal & Smith (2022), saturated transformers as constant-depth threshold circuits, TACL.
Merrill, Weiss, Goldberg, Schwartz & Smith (2020), a formal hierarchy of RNN architectures, ACL.
Nkrumah, Neo-Colonialism.
Sarrof, Veitsman & Hahn (2024), expressive capacity of state space models, NeurIPS.
Siegelmann & Sontag (1992), computational power of neural nets, COLT.
Siems et al. (2025), DeltaProduct, arXiv:2502.10297.
Strobl, Merrill, Weiss, Chiang & Angluin (2024), what formal languages can transformers express?, TACL.
Tegmark (2000), quantum decoherence in brain processes, Physical Review E.
Watts, Kothari, Schaeffer & Tal (2019), shallow quantum vs classical circuits, STOC.
Weiss, Goldberg & Yahav (2018), finite-precision RNNs, ACL.