The perfect weekend article just dropped by @rasbt on gpt-oss architectures.

It’s wild to see how fast these architectures have leveled up since GPT-2 in 2019.

abs positional emb → RoPE GELU MLP → Swish + SwiGLU one dense feed-forward → MoE Multi-head attention → GQA …