<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Attention on Fred Wieser</title><link>https://fredericowieser.github.io/tags/attention/</link><description>Recent content in Attention on Fred Wieser</description><generator>Hugo</generator><language>en</language><managingEditor>frederico.wieser@proton.me (Fred Wieser)</managingEditor><webMaster>frederico.wieser@proton.me (Fred Wieser)</webMaster><lastBuildDate>Tue, 23 Jun 2026 01:14:06 +0100</lastBuildDate><atom:link href="https://fredericowieser.github.io/tags/attention/index.xml" rel="self" type="application/rss+xml"/><item><title>Sequence Modelling from Markov Chains to GLM-5.2</title><link>https://fredericowieser.github.io/posts/modern_transformer_pretraining/</link><pubDate>Sat, 20 Jun 2026 00:00:00 +0000</pubDate><author>frederico.wieser@proton.me (Fred Wieser)</author><guid>https://fredericowieser.github.io/posts/modern_transformer_pretraining/</guid><description>&lt;p&gt;Open-weight models such as GLM-5.2 make the gap between closed and open models feel much smaller. The useful way to read that history is not as a list of model names, but as a sequence modelling story.&lt;/p&gt;
&lt;p&gt;A sequence model first chooses a representation, then a dependency graph, then a way to spend compute. Raw text is mapped into tokens $z_1,\ldots,z_T$, tokens become vectors $X\in\mathbb{R}^{T\times d}$, and the model repeatedly mixes information across positions and across channels.&lt;/p&gt;</description></item></channel></rss>