Papers
arxiv:2602.22600

Transformers converge to invariant algorithmic cores

Published on Feb 26
· Submitted by
Josh Schiffman
on Mar 4

Abstract

Research reveals that independently trained transformers converge to shared algorithmic cores despite different weight configurations, indicating low-dimensional invariants in transformer computations.

Large language models exhibit sophisticated capabilities, yet understanding how they work internally remains a central challenge. A fundamental obstacle is that training selects for behavior, not circuitry, so many weight configurations can implement the same function. Which internal structures reflect the computation, and which are accidents of a particular training run? This work extracts algorithmic cores: compact subspaces necessary and sufficient for task performance. Independently trained transformers learn different weights but converge to the same cores. Markov-chain transformers embed 3D cores in nearly orthogonal subspaces yet recover identical transition spectra. Modular-addition transformers discover compact cyclic operators at grokking that later inflate, yielding a predictive model of the memorization-to-generalization transition. GPT-2 language models govern subject-verb agreement through a single axis that, when flipped, inverts grammatical number throughout generation across scales. These results reveal low-dimensional invariants that persist across training runs and scales, suggesting that transformer computations are organized around compact, shared algorithmic structures. Mechanistic interpretability could benefit from targeting such invariants -- the computational essence -- rather than implementation-specific details.

Community

Paper author Paper submitter

Training selects for behavior, not circuitry – so which internal structures reflect the computation, and which are accidents of a particular training run? Independently trained transformers – despite having very different weights – converge to shared low-dimensional algorithmic core subspaces that capture essential computations.

  • Three independently trained Markov-chain transformers (cosine similarity ~0.03) share a 3D core that recovers ground-truth mechanism: transition spectra to within 1%.
  • On modular addition, cores crystallize at grokking and automatically reveal rotational operators, yielding a predictive scaling law for grokking time: τ ∝ 1/(ωp), validated at R² > 0.99; grokking speed scales inversely with task symmetry (p) and weight decay (ω).
  • Subject–verb agreement in GPT-2 Small/Medium/Large reduces to a single shared axis across scales -- flipping it inverts grammatical number throughout autoregressive generation.

Cyclic mechanism emerges in core at grokking

Core steering inverts grammatical number in GPT-2

Broader take: Mechanistic interpretability should target invariants – structure preserved across models – rather than implementation-specific details of any single model or checkpoint.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2602.22600
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.22600 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.22600 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.22600 in a Space README.md to link it from this page.

Collections including this paper 1