Attention Is Not Enough

01 — The Thesis

Generalization isn’t one thing

Humans don’t just interpolate between things we’ve seen — we extrapolate to genuinely novel combinations. Hear a sentence with an unfamiliar word in a familiar slot, and you still parse the role it’s playing. The paper asks whether Transformers, famous for their abstraction abilities, can do the same thing.

The answer is layered. Transformers comfortably outperform LSTMs at standard generalization and at recombining familiar fillers into unfamiliar role-orderings. But on the hardest test — a filler the inner model has never once seen during training — every Transformer-based architecture collapses to roughly the same middling accuracy as the plain LSTM. Only a much older, slower, reinforcement-learning-based indirection model holds up.

“We confirm that the Transformer does possess superior abstraction capabilities compared to LSTM. However, what it does not possess is extrapolation capabilities.” — Abstract

02 — Background

What “indirection” actually means here

The paper borrows the term from computer science: a pointer doesn’t hold a value directly, it holds the address of a value stored somewhere else. Kriete et al. (2013) proposed that working memory does something similar — an abstract role (agent, verb, patient) can point to whichever concrete word currently fills that role, rather than the role and the word being fused into one inseparable representation.

Three sentence roles, three arbitrary fillers:

Role: Agent

points to — “Tom”

Role: Verb

points to — “Ate”

Role: Patient

points to — “Food”

Swap in a brand-new filler the model has never seen, and an indirection-style system only needs to update which address the pointer resolves to — not re-learn the whole bound representation from scratch.

Four tasks, increasing difficulty

Task	What changes between train & test
SG	Standard Generalization — same fillers, new role-orderings
SA	Spurious Anticorrelation — familiar fillers, never co-occurring together before
FC	Full Combinatorial — a filler tested in a role it never occupied in training
NF	Novel Filler — a word the inner model never encountered in any role, at all

03 — Two Ways to Implement a Pointer

Cortical stripes vs. vector algebra

The original indirection model and its later, faster descendant both implement the same pointer idea — but in very different substrates.

PFC Layers (Kriete et al.)

A direct computational model of prefrontal cortex anatomy: sparsely-interconnected neuron “stripes,” each implemented as its own recurrent layer, gated by a separate basal-ganglia module that decides when a stripe updates.

Biologically plausible by design — and expensive. Many interacting recurrent parts simulating neural dynamics over time make this slow to train and test.

Holographic Reduced Representations

Jovanovich’s (2017) replacement: role–filler binding done with circular convolution on fixed-size vectors, and approximate inverse operations to unbind. No simulated neurons, no recurrent stripe dynamics — just vector math.

Mullinax (2020) kept HRRs for role encoding but wrapped them in an LSTM-based word embedder, forming the OL/IND model used as this paper’s gold-standard baseline.

04 — Methods

Five architectures, one shared encoder/decoder skeleton

Every model reads and reproduces three-word sentences through a nested outer/inner structure: an outer component spells words letter-by-letter, an inner component binds the resulting word-vectors into a sentence and back. The authors swap LSTM and Transformer components into both slots, then compare against the older indirection model.

OL/IND

Outer LSTM, inner Indirection (HRR + Q-learning). Mullinax 2020 baseline, reused as-is.

OL/IL

Outer LSTM, inner LSTM. The fully-recurrent control.

OL/IT

Outer LSTM, inner Transformer.

OT/IL

Outer Transformer, inner LSTM.

OT/IT

Outer Transformer, inner Transformer. Fully attention-based.

Only the four LSTM/Transformer combinations were newly trained, all supervised (Adam/Nadam, MSE + BCE loss). OL/IND was not rebuilt for this paper — its reinforcement-learning training loop (Q-learning-gated stripes) doesn’t slot cleanly into the same supervised pipeline, which is likely why an OT/IND combination was never attempted either.

05 — Figure 1

How a sentence actually moves through the pipeline

Diagram of the inner/outer encoder-decoder pipeline, showing Tom Ate Food being encoded through outer encoder, inner encoder, inner decoder, and outer decoder — **Figure 1 (paper).** “Tom Ate Food” enters the Outer Encoder, which independently spells out each word as letter tokens. Those three word-encodings pass to the Inner Encoder, which compresses them into a single sentence encoding. The Inner Decoder unpacks that sentence encoding back into three word-embeddings, and the Outer Decoder spells each one back out into letters.

This is the architecture every model shares — what differs is purely whether the outer spelling component and the inner role-binding component are built from LSTM layers, Transformer blocks, or (in the baseline) an indirection mechanism.

06 — Figure 2 & Results

Where the cracks show up

Bar charts comparing word-level and letter-level accuracy across SG, SA, FC, and NF tasks for five models — **Figure 2 (paper).** Mean accuracy (N=10 runs, ±1.96 SE) across the four tasks. Note the SG/SA rows look almost identical across models — the real separation only appears once you reach FC, and especially NF, at the bottom.

Task	OL/IND	OL/IL	OL/IT	OT/IL	OT/IT
SG / SA	~100%	~90–100%	~100%	75–93%	~100%
FC	~100%	<40%	~85–92%	<20%	~100%
NF	~100%	~55–65%	~55–95%	~55–72%	~58–72%

Figures approximated from the plotted bars in Figure 2; word-level and letter-level values are merged into ranges per cell for brevity.

Three findings stand out. First, the outer-Transformer/inner-LSTM combination (OT/IL) actually underperforms the all-LSTM model on the easy tasks — the richer Transformer-produced embeddings seem to confuse a plain LSTM trying to bind them. Second, any model with an inner Transformer (OL/IT, OT/IT) handles FC well, contradicting earlier assumptions that one-hot/HRR-style representational interference was unavoidable on that task. Third, and most importantly, every non-indirection model — LSTM or Transformer, in any combination — lands in roughly the same 55–72% band on Novel Filler, while OL/IND stays near 100%.

07 — Conclusion

Attention learns abstraction. It doesn’t learn pointers.

Bottom line

Self-attention is enough to recombine known fillers into unfamiliar arrangements — that's a genuine win over LSTMs, and it's why OL/IT and OT/IT dominate the FC task. But self-attention has no mechanism analogous to a memory address: when a filler has never appeared in training at all, there's no pointer to redirect, so every attention-based model falls back to roughly the same degraded performance as a recurrent network.

The indirection model's advantage isn't free, though — it trains via reinforcement learning rather than backprop, which the authors note makes it considerably slower than any of the supervised models compared here.

The paper explicitly leaves open why attention fails at indirection specifically, and never tests an Outer-Transformer/Inner-Indirection (OT/IND) hybrid that might combine attention's abstraction strength with explicit pointer-style binding.