Contents

Sequence Modelling from Markov Chains to GLM-5.2

Contents

Contents

Open-weight models such as GLM-5.2 make the gap between closed and open models feel much smaller. The useful way to read that history is not as a list of model names, but as a sequence modelling story.

A sequence model first chooses a representation, then a dependency graph, then a way to spend compute. Raw text is mapped into tokens $z_1,\ldots,z_T$, tokens become vectors $X\in\mathbb{R}^{T\times d}$, and the model repeatedly mixes information across positions and across channels.

For language modelling, the objective is usually autoregressive:

$$ p(x_1,\ldots,x_T)=\prod_{t=1}^{T}p(x_t\mid x_{<t}). $$

That equation hides the whole design problem. What counts as a token? Which previous positions may influence position $t$? Should the model build a bidirectional representation, generate left-to-right, or translate from one sequence into another?

Tokens And Positions

The first modelling decision happens before the network. Characters are too long, words are too brittle, and bytes are too low-level. Subword tokenisation made neural language models practical by representing rare words as smaller reusable units [2]. SentencePiece made this cleaner by training tokenisers directly from raw text, without assuming pre-tokenised words [3].

$$ z_{1:T}=\operatorname{Tokenise}(s),\qquad X_0[t,:]=E[z_t]+P[t]. $$

The tokeniser is not a footnote. It sets the length of the sequence, the granularity of memory, and the surface on which the model must learn morphology, code, arithmetic, and multilingual structure.

Once tokens become vectors, most modern architectures alternate two operations:

$$ X’ = X + S(\operatorname{Norm}(X)), \qquad X^+ = X’ + C(\operatorname{Norm}(X’)). $$

Here $S$ is sequence mixing: communication across positions. $C$ is channel mixing: computation inside each token vector. The vocabulary became common in Mixer-style models [10], but it is also a good way to understand Transformers. Attention is sequence mixing. The MLP is channel mixing.

Dependence Graphs

The history of sequence modelling is largely the history of changing the graph over positions.

Markov and $n$-gram models use the smallest graph: position $t$ sees a fixed local window [1]:

$$ p(x_t\mid x_{<t})\approx p(x_t\mid x_{t-n:t-1}). $$

RNNs replaced the table with a learned state [4]:

$$ h_t=f_\theta(h_{t-1}, x_t), \qquad p(x_{t+1}\mid x_{\le t})=\operatorname{softmax}(W h_t). $$

This gives generalisation across contexts, but the graph is still a chain. Everything, including the gradient, must pass through the same narrow recurrent path:

$$ \begin{aligned} \frac{\partial L}{\partial h_t} &= \frac{\partial L}{\partial h_T} \prod_{i=t+1}^{T} \frac{\partial h_i}{\partial h_{i-1}}. \end{aligned} $$

LSTMs softened the chain with gates and an explicit memory cell [5]:

$$ c_t=f_t\odot c_{t-1}+i_t\odot \tilde c_t,\qquad h_t=o_t\odot\tanh(c_t). $$

The recurrent lineage taught the right lesson: memory should be learned and controlled. The shape was the problem. A chain is hard to train over long horizons and hard to parallelise.

Attention Made Memory Addressable

Attention changed the graph. Instead of forcing the past through one vector, every token writes keys and values, and later tokens query them [6]:

$$ \operatorname{Attn}(Q,K,V)= \operatorname{softmax}\left(\frac{QK^\top}{\sqrt{d_h}} + M\right)V. $$

The causal mask $M$ keeps language modelling autoregressive. The important move is random access: a token can retrieve an earlier token directly, rather than hoping it survived inside $h_t$.

q = x @ W_q
k = x @ W_k
v = x @ W_v
scores = q @ k.T / sqrt(d_head)
scores = scores.masked_fill(causal_mask, -inf)
y = softmax(scores) @ v

The Transformer stacked this retrieval operation with residual streams, normalisation, and feed-forward computation [6]:

$$ x’ = x + \operatorname{Attn}(\operatorname{Norm}(x)), \qquad x^{+}=x’ + \operatorname{MLP}(\operatorname{Norm}(x’)). $$

The residual stream became the working memory. Attention moved information between positions. The MLP transformed information within each position. Depth repeated this routing again and again.

The same block supports different sequence modelling regimes by changing the mask. Encoders use bidirectional mixing and learn representations for all positions, as in BERT [7]. Decoders use causal mixing and generate one token at a time. Encoder-decoder models use one graph for the source sequence, one causal graph for the target, and cross-attention between them.

GPT-style models made the decoder-only case the interface itself [8], [9]:

$$ p_\theta(x)=\prod_t p_\theta(x_t\mid x_{<t}). $$

That was a conceptual simplification. Prompt text specifies the task; the model continues the sequence. But it moved the pressure elsewhere: long contexts made the key-value cache and attention bandwidth expensive.

Efficient Sequence And Channel Mixing

Llama-style models kept the decoder-only interface but made both mixers easier to scale [11]-[17]. The changes were not glamorous. They were engineering pressure turned into architecture.

Pre-normalisation stabilised deep optimisation [11]. RMSNorm removed unnecessary mean subtraction [12]:

$$ \operatorname{RMSNorm}(x)= \frac{x}{\sqrt{\frac{1}{d}\sum_i x_i^2+\epsilon}}\odot g. $$

RoPE made position relative inside the sequence mixer by rotating query and key channels [13]:

$$ \langle R_m q, R_n k\rangle \quad\text{depends on}\quad m-n. $$

SwiGLU strengthened the channel mixer [14], while grouped-query attention reduced sequence-mixing bandwidth by using fewer key/value heads than query heads [15], [16]:

$$ n_Q > n_{KV} > 1. $$

The result was not a new objective. It was the same causal Transformer made less brittle: better positional geometry, stronger per-token computation, more stable depth, and cheaper decoding.

DeepSeek Made Scale Conditional

Dense decoders are powerful, but they pay almost the same cost for every token, layer, and parameter. DeepSeek-style systems changed the question from “how large can the dense model be?” to “where does full compute actually need to be spent?”

Mixture-of-experts layers make the channel mixer conditional. The parameter count can be large while only a few experts activate per token [18], [19]:

$$ y=\sum_{e\in \operatorname{TopK}(g(x))} g_e(x)E_e(x). $$

Multi-head Latent Attention makes the sequence mixer conditional in memory. Instead of storing full keys and values, the model stores a compressed latent and reconstructs what it needs [18], [19]:

$$ c_t^{KV}=W^{DKV}h_t, \qquad k_t=W^{UK}c_t^{KV}, \qquad v_t=W^{UV}c_t^{KV}. $$

Sparse attention pushes the same idea into the dependence graph. Rather than attending densely over every previous token, an indexer selects the useful subset [20]:

$$ \operatorname{Attn}_{\mathrm{sparse}}(q_t,K,V)= \sum_{j\in \operatorname{TopK}(I(q_t,K))} \alpha_{tj}v_j. $$

That is the modern turn. Attention is no longer just a universal retrieval mechanism. It is also a routing problem. The model must decide which tokens, experts, and cached states are worth paying for.

GLM-5.2 And The Current Shape

GLM-5.2 sits on this side of the transition: open weights, long context, sparse attention, MoE-style active compute, and decoding optimisations [21], [23]. Z.ai reports a 1M-token context, MIT licensing, and a 753B-parameter model family [23].

The interesting detail is IndexShare. Sparse attention needs an indexer, but recomputing selection everywhere is costly. GLM-5.2 reuses sparse-attention indexers across groups of layers, reducing long-context overhead [22], [23].

for layer in layers:
    idx = layer.indexer(q, k) if layer.refreshes_index else cached_idx
    x = x + sparse_mla_attention(x, idx)
    x = x + moe_swiglu(x)

This is why the current generation can feel both affordable and surprisingly capable. The lineage did not discard the old ideas. It compressed them.

Autoregressive modelling survived from Shannon. Subword tokenisation made the input tractable. Learned state survived from RNNs. Gating survived from LSTMs. Addressable memory survived from attention. Encoders, decoders, and encoder-decoders became different masks over the same sequence-mixing idea. Llama-style blocks made the dense core stable. DeepSeek-style routing made scale economical.

GLM-5.2 is one current expression of that stack. The frontier is no longer simply “make attention bigger”. It is allocation: choose the right tokens, choose the right dependency graph, mix sequence and channel information efficiently, and spend serious compute only where it buys predictive power.

References

[1] C. E. Shannon, “A Mathematical Theory of Communication,” Bell System Technical Journal, vol. 27, no. 3, pp. 379-423, 1948. Available: https://doi.org/10.1002/j.1538-7305.1948.tb01338.x

[2] R. Sennrich, B. Haddow, and A. Birch, “Neural Machine Translation of Rare Words with Subword Units,” in Proceedings of ACL, pp. 1715-1725, 2016. Available: https://aclanthology.org/P16-1162/

[3] T. Kudo and J. Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing,” in Proceedings of EMNLP: System Demonstrations, pp. 66-71, 2018. Available: https://arxiv.org/abs/1808.06226

[4] J. L. Elman, “Finding Structure in Time,” Cognitive Science, vol. 14, no. 2, pp. 179-211, 1990. Available: https://doi.org/10.1207/s15516709cog1402_1

[5] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997. Available: https://doi.org/10.1162/neco.1997.9.8.1735

[6] A. Vaswani et al., “Attention Is All You Need,” in Advances in Neural Information Processing Systems, 2017. Available: https://arxiv.org/abs/1706.03762

[7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in NAACL, 2019. Available: https://arxiv.org/abs/1810.04805

[8] A. Radford et al., “Language Models are Unsupervised Multitask Learners,” OpenAI, 2019. Available: https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

[9] T. Brown et al., “Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems, 2020. Available: https://arxiv.org/abs/2005.14165

[10] I. Tolstikhin et al., “MLP-Mixer: An all-MLP Architecture for Vision,” in Advances in Neural Information Processing Systems, 2021. Available: https://arxiv.org/abs/2105.01601

[11] R. Xiong et al., “On Layer Normalization in the Transformer Architecture,” in ICML, 2020. Available: https://arxiv.org/abs/2002.04745

[12] B. Zhang and R. Sennrich, “Root Mean Square Layer Normalization,” in NeurIPS, 2019. Available: https://arxiv.org/abs/1910.07467

[13] J. Su et al., “RoFormer: Enhanced Transformer with Rotary Position Embedding,” 2021. Available: https://arxiv.org/abs/2104.09864

[14] N. Shazeer, “GLU Variants Improve Transformer,” 2020. Available: https://arxiv.org/abs/2002.05202

[15] N. Shazeer, “Fast Transformer Decoding: One Write-Head is All You Need,” 2019. Available: https://arxiv.org/abs/1911.02150

[16] J. Ainslie et al., “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints,” in EMNLP, 2023. Available: https://arxiv.org/abs/2305.13245

[17] H. Touvron et al., “LLaMA: Open and Efficient Foundation Language Models,” 2023. Available: https://arxiv.org/abs/2302.13971

[18] DeepSeek-AI, “DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model,” 2024. Available: https://arxiv.org/abs/2405.04434

[19] DeepSeek-AI, “DeepSeek-V3 Technical Report,” 2024. Available: https://arxiv.org/abs/2412.19437

[20] DeepSeek-AI, “DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models,” 2025. Available: https://arxiv.org/abs/2512.02556

[21] GLM-5-Team et al., “GLM-5: from Vibe Coding to Agentic Engineering,” 2026. Available: https://arxiv.org/abs/2602.15763

[22] Y. Bai et al., “IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse,” 2026. Available: https://arxiv.org/abs/2603.12201

[23] Z.ai, “GLM-5.2,” Hugging Face model card, 2026. Available: https://huggingface.co/zai-org/GLM-5.2