A Geometric View of LLMs: Semantic Spaces and Denoising
Large Language Models (LLMs) are commonly referred to as "black boxes," for the same reasons people cite regarding many other deep learning systems. Given the highly evolutionary nature of training these algorithms, we often rely entirely on output evaluation to judge a model's capabilities, with little regard for internal explainability.
Such complex systems seem impervious to mechanistic questions: What exactly does this specific neuron do? Why is this parameter activated by one subset of inputs but not another? Why do we get this exact behavior in this specific edge case? While the billions of individual weights remain obscure, there is still ample room for understanding how these networks reason. Even if the individual parameters are opaque, the architecture itself is open to a high degree of structural interpretability.
Let's start by describing, on a high level, how a modern LLM works. The architecture is fundamentally based on Transformers, a deep learning model that makes heavy use of an attention mechanism. In simple terms, attention allows the model to look at a sequence of words and mathematically weigh the importance of each word relative to the others. It dynamically assesses which parts of a sentence are most relevant to each word being processed.
[ Token_1 ] [ Token_2 ] [ Token_3 ]
| | |
+------+-----+-----+------+
| |
+----------------------------+
| ATTENTION MECHANISM |
| (Mixes context via weights)|
+----------------------------+
| |
+----------------------------+
| FEED-FORWARD NET (FFN) |
| (Non-linear Denoising step)|
+----------------------------+
| |
[ Refined Internal State ]
The fundamental view of an LLM that we are going to discuss is that of a machine operating on semantic spaces, which are defined by the high-dimensional vector spaces where these linguistic tokens live.
The Division of Labor:
Attention and Non-linearity
The inputs to the system are contextual tokens. Each block of the Transformer combines the semantic information conveyed by these tokens using dynamic weights, this is the routing phase of the attention mechanism. Related tokens receive stronger weights, effectively making them produce a stronger signal, while unrelated tokens exert only marginal effects.
Latent Vector Space (Simplified 2D) ^ | * [ Paris ] | ^ | | | | [ Capital ] | | | + (Vector Addition) | ^ | / [ France ] |/ +--------------------------------->
The result of this semantic mixing is then passed to a Feed-Forward Neural Network (FFN). The network uses its non-linear activation functions to process and effectively denoise the representation. This entire routing-and-processing operation is iteratively repeated across dozens of consecutive Transformer blocks.
"Paris" (City) \
"Capital" (Gov) ==> [ FFN Filter ] ==> [ Concept: FRANCE ]
Typo / Syntax /
[ NOISY CONTEXT ] [ Ambiguity ] [ REFINED SIGNAL ]
Finally, the data reaches the language modeling head, which takes this highly refined, cleanly filtered internal representation and maps it back into our vocabulary to produce the final next-token prediction via sampling of the obtained distribution.
Information Injection as Iterative Denoising
Let's look closer at how concepts merge in this space. Taking a classic conceptual example: combining the
contextual tokens of France and Capital will predictably result in the injection of the concept of
Paris into the latent representation.
When you think about it carefully, isn't this essentially a sophisticated form of signal denoising?
"We have information scattered and diffused across different tokens, possibly in ambiguous ways. The model reorganizes this information in a more coherent way."
A single word isolated in a vacuum is inherently noisy, it carries multiple dictionary definitions, cultural associations, and syntactical uses. The model takes this scattered, multi-dimensional data and compacts it with respect to its own internal representation. In doing so, it actively chooses to eliminate and filter out secondary information and irrelevant nuances that could lead to ambiguity. By the time the final prediction is calculated, the Transformer has acted as an iterative semantic filter, continuously shedding noise to isolate the pure conceptual signal.
Transformers as Semantic Diffusion Models
If we step back and look purely at the geometry and the flow of information, we can make a bold conceptual leap: Transformers are, fundamentally, a type of diffusion model.
Traditional diffusion models (like those used for image generation) operate over time. They take a mathematical object and loop it through a network repeatedly to subtract noise trying to recover a clear signal. An LLM operates, similarly, over its depth.
The diffusion process in an LLM happens sequentially from one Transformer block to another. Because of residual connections, each block takes a "noisy" semantic state and subtracts ambiguity. By the time the data reaches the final block, the model has iteratively diffused the scattered input into a single, highly refined conceptual representation.
This brings us to the language modeling head. The final Transformer block holds this fully "denoised" concept, but a concept is not yet a word. The model must now extract a discrete token from this continuous, multi-dimensional representation.
It is only when the system samples from the obtained distribution that this collapses into a single, discrete token. That token is then appended to the context, and the semantic diffusion process begins all over again.