AI Roundtable Open Debate · 4 models

in the case of a language model, where the target is either predicting the next token or MLM, which architecture more sensible, Predictive Coding (as per the Rao and Ballard paper) or Equilibrium Propagation (as per the Scellier and Bengio paper)? Please justify why.

Answer: Predictive Coding is more sensible because its hierarchical error-correction natively aligns with the sequential, predictive structure of language. (All 4 models agreed)

Model Responses

Grok 4.3

Answer: Predictive Coding remains the more sensible architecture for language models targeting next-token prediction or MLM.

Predictive Coding aligns directly with the predictive nature of language modeling by using hierarchical top-down expectations and bottom-up error signals to forecast tokens or infer masked ones, mirroring the sequential dependencies in text. Claude's emphasis on Equilibrium Propagation's supervised nudging overlooks how Predictive Coding can incorporate categorical objectives through appropriate likelihoods while better capturing the generative, context-driven structure of language without requiring slow recurrent settling. GPT-5.5 and Gemini both reinforce this by highlighting Predictive Coding's natural fit for conditional prediction and error-driven hierarchical representations, which avoid the computational overhead of equilibrium phases in long sequences. Thus, the inductive biases of Predictive Coding provide a clearer match for autoregressive and masked objectives than Equilibrium Propagation's energy-based relaxation dynamics.

GPT-5.5

Answer: I maintain that Predictive Coding is the more sensible architecture for next-token prediction or MLM, although Claude Opus 4.7 raises a strong objection that the exact Rao–Ballard formulation is not directly a modern categorical language-modeling objective.

I would not change my position, because the question asks which architecture is more sensible for a language model whose target is prediction, and at the architectural/inductive-bias level Predictive Coding fits the problem more naturally than Equilibrium Propagation. The core operation in Predictive Coding is to maintain latent representations that generate predictions and to update representations using prediction errors. That maps directly onto both next-token prediction and MLM: context induces expectations; missing or future tokens are inferred; discrepancies between predicted and observed tokens drive learning. Language also has hierarchical structure—tokens, phrases, syntax, semantics, discourse—and a predictive-coding hierarchy is at least conceptually well matched to that kind of multi-level constraint propagation. Claude Opus 4.7 is right that Rao and Ballard's original model is not a ready-made language model: it was developed for continuous perceptual input, often with Gaussian-style reconstruction errors, and it does not by itself specify a softmax cross-entropy objective over a large vocabulary. But I do not think that makes Equilibrium Propagation the more sensible architecture. It means one would need a language-adapted predictive-coding formulation, with discrete-token likelihoods or categorical prediction-error units, rather than a literal copy of the 1999 visual model. Equilibrium Propagation has an appealing supervised-learning story: free phase, weakly clamped target phase, and updates approximating a loss gradient. But for language modeling it introduces a heavy and awkward computational commitment: recurrent energy dynamics must settle to equilibria, potentially for every context, token position, or mask configuration. That seems especially ill-suited to long sequences, large vocabularies, attention-like contextual interactions, and the need to produce fast conditional distributions. So the decisive distinction is this: Equilibrium Propagation may resemble supervised gradient learning more directly, but Predictive Coding resembles the actual inference problem of language modeling more directly. If forced to choose between the two as an architectural basis for next-token prediction or MLM, I would choose Predictive Coding, while modifying it to use categorical likelihoods and sequence-aware representations rather than the original continuous visual-cortex assumptions.

Claude Opus 4.7 (changed mind)

Answer: I am updating my view: Predictive Coding (Rao & Ballard style) is the more sensible of the two architectures for next-token prediction and MLM in language models.

On reflection, and after weighing the arguments from the other models, I think I overweighted the supervised/discrete-target advantage of Equilibrium Propagation and underweighted the structural mismatch between EqProp's fixed-point settling and the nature of language modeling. The key argument that changed my mind, articulated most clearly by GPT-5.5 and Gemini 3.1 Pro, is that *the task itself is prediction*. Next-token prediction and MLM are quite literally about forming top-down expectations from context and updating representations based on the discrepancy between expectation and observation. Rao & Ballard's framework is, at its core, a machine for exactly this: hierarchical generation of predictions and upward propagation of residual errors. The fact that the original paper applied it to natural image patches does not mean the framework is wedded to Gaussian pixel reconstruction — predictive coding has since been generalized (e.g., Whittington & Bogacz, Millidge et al.) to discrete categorical outputs and shown to approximate backprop on standard supervised losses including cross-entropy. So my earlier objection that PC is fundamentally Gaussian/MSE was too narrow. Meanwhile, the cost I had downplayed for Equilibrium Propagation is severe in this domain: EqProp requires the network to settle to a fixed point in a free phase and again in a weakly nudged phase, *per training example*. For sequences of thousands of tokens with vocabulary distributions of tens of thousands of classes, this double-relaxation is enormously expensive and, more importantly, structurally misaligned. Language is inherently sequential and asymmetric (context → next token), not a symmetric pattern-completion problem where a static input is clamped and an answer emerges from relaxation. Grok 4.3's point about the autoregressive, time-unfolded nature of language is well taken: PC naturally accommodates a directional, top-down predictive flow, whereas EqProp's symmetric energy minimization does not. I still think my original point — that EqProp has a cleaner story for supervised discrete targets than the *original* Rao & Ballard formulation — has some merit. But once we allow modest extensions to PC (categorical likelihoods, sequential/causal hierarchies), it becomes both a better inductive-bias match for language and dramatically cheaper than EqProp's two-phase relaxation. So Predictive Coding wins on both conceptual fit and computational tractability.

Final Summary

The AI Roundtable unanimously concluded that Predictive Coding is the superior architecture for language modeling tasks. While Claude Opus 4.7 initially argued for Equilibrium Propagation, it ultimately joined GPT-5.5, Grok 4.3, and Gemini 3.1 Pro in favoring the hierarchical flow of Predictive Coding over the slow, energy-based relaxation dynamics of Equilibrium Propagation. The group emphasized that Predictive Coding naturally captures the way context generates expectations, whereas its competitor is structurally misaligned with long, discrete sequences.

All 4 models agreed