# Guide Labs
URL: https://www.guidelabs.ai
We are building a new class of interpretable AI systems and foundation models that humans can reliably debug, trust, and understand.
--- title: About us
--- URL: https://www.guidelabs.ai/about-us/
**Guide Labs is a product-focused research company, and our goal is to build a new class of interpretable AI systems that humans and domain experts can reliably understand, steer, and debug.**
We have assembled a team that has more than 20 years of experience focused on the interpretability and reliability of AI systems. We have published more than two dozen papers at top machine learning venues. Critically, we have shown that machine learning models trained solely for narrow performance measures, without regard for interpretability, result in models whose explanations are [mostly unrelated to the model’s decision-making process](https://arxiv.org/abs/1810.03292), and are not [aligned with humans for consequential decisions](https://arxiv.org/abs/2410.15471). Even worse, explanations of unchecked models can [actively](https://arxiv.org/abs/2212.04629) [mislead](https://arxiv.org/abs/2011.05429). More recently, we’ve shown that [self-explanations, like chain-of-thought, of LLMs are unreliable](https://arxiv.org/abs/2401.07927).
These results directly inform our approach to [engineer AI models that are interpretable, reliable, and trustworthy](https://arxiv.org/abs/2405.05386). Toward this end, we have demonstrated the effectiveness of rethinking a model’s training process for [language models](https://arxiv.org/abs/2310.07819) and [protein property prediction](https://ai4d3.github.io/papers/79.pdf). We developed one of [the first image-generative models](https://openreview.net/forum?id=L9U5MJJleF) at the billion-parameter scale that is constrained to reliably explain its outputs in terms of human-understandable factors. More recently, we demonstrated that [billion-parameter language models](https://arxiv.org/abs/2411.06090) can also be trained to be interpretable.
Our past experience has shown that it is crucial to integrate interpretability, safety, and reliability constraints as part of the model development pipeline, and that these constraints can be satisfied without compromising downstream performance. With the new AI systems we are building, we can more easily identify the causes of erroneous outputs, detect when models latch onto spurious signals, and correct the models effectively. We aim to create a world where domain experts shift from merely 'prompting' AI to engaging in meaningful and truthful dialogue with AI systems.
**Our team’s work on engineering AI systems to be interpretable and reliable.**
_Here we give a brief overview of a selection of our team’s previous work._
* [Concept Bottleneck Generative Models](https://openreview.net/forum?id=L9U5MJJleF), ICLR 2023.
* For the first time, we show how to train large-scale generative models that are constrained to explain their outputs in terms of human understandable factors at the billion parameter scale.
* [Concept Bottleneck Language Models for Protein Design](https://arxiv.org/abs/2411.06090), ICLR 2025.
* We demonstrate how to train large language models that are constrained to explain their outputs in terms of human understandable factors for protein design. For the first time, biochemists and other drug discovery experts can enable fine-grained control of a protein-language model for antibody and protein design.
* [Faithfulness Measurable Masked Language Models](https://arxiv.org/abs/2310.07819), ICML 2024.
* We present a method for ensuring that the explanations of masked language models are reliable.
* [Interpretability Needs a New Paradigm](https://arxiv.org/pdf/2405.05386), Position Paper, 2024.
* We describe our perspective on how to engineer and train models so that they can have truthful and reliable explanations.
* [Error Discovery by Clustering Influence Embeddings](https://arxiv.org/abs/2312.04712), NeurIPS, 2023.
* We demonstrate a technique for identifying groups of errors that a model is making.
* [Interpretable Mixture of Experts](https://arxiv.org/abs/2206.02107), TMLR, 2023.
* We show how to make the mixture of experts, a classic machine learning approach, constrained to be interpretable.
* [Improving Deep Learning Interpretability by Saliency Guided Training](https://arxiv.org/abs/2111.14338), NeurIPS, 2022.
* We present an approach for training deep learning models that are constrained to ignore input features, in a dataset, that have ‘noisy’ gradients. Models that latch onto high frequency signals or that have noisy gradients have unreliable explanations.
* [Quantifying and Mitigating the Impact of Label Errors on Model Disparity Metrics](https://openreview.net/forum?id=RUzSobdYy0V), ICLR, 2023.
* We present an approach to attribute a model’s bias to its training data.
---
--- title: Introducing Clarity
--- date: Wed Jun 03 2026 00:00:00 GMT+0000 (Coordinated Universal Time)
--- url: https://www.guidelabs.ai/post/meet-clarity/
## Introducing Clarity
We introduce [Clarity](https://platform.guidelabs.ai/), the first inherently interpretable AI platform, now available by invitation as a research preview. Current AI systems are black boxes with opaque internal reasoning and no ability to trace output back to input or training data. Powered by Steerling 8B, Clarity fixes these problems. With it, you can:
* [Explore how the model reasons](#interpretability-features---concept-explanations-and-training-data-attribution). See the human-understandable concepts that drive model output.
* [Trace output to training data](#interpretability-features---concept-explanations-and-training-data-attribution). Understand how the outputs relate to the data the model was trained on.
* [Steer model behavior using concepts](#concept-steering-to-control-the-model-without-changing-prompts). Amplify and suppress concepts to control the model's output without using prompts.
[Reach out to partner](/contact/)
## Introducing Clarity
Today we are launching Clarity, the first inherently interpretable AI platform. Clarity is powered by our instruction finetuned [Steerling-8B model](https://www.guidelabs.ai/post/steerling-8b-base-model-release/). Other models are either black boxes or have interpretability bolted on post-hoc. These methods result in outputs that have untraceable errors and faulty reasoning that can’t be diagnosed. Steerling is the first model that has interpretability built in during training, and the Clarity platform allows you to directly interact with these new capabilities. In the remainder of the post, we will walk you through three key capabilities of Clarity:
* **Concept explanations**: the human-understandable concepts that Steerling uses to produce its output
* **Training data attribution**: the training data attributed to the output
* **Concept steering**: controlling the output of Steerling by amplifying or suppressing concepts, as opposed to changing the prompts
## Getting started
Clarity looks like other chat bots besides one big difference: the steering button. This button allows you to amplify or suppress concepts in the AI's response.

But for now, let's explore and ask about the fauna in Africa.

Looking at the response, we immediately see what sets Clarity apart: the Explanations panel.
## Trace output to concepts and training data source
Clarity provides two insights into how the AI is generating its output, Concepts and Training Data Attribution. First, let's look at Concepts. These are the human understandable features the model uses to reason.
With nothing selected, the Explanations panel shows the most common concepts in the chat. This output seems to make sense. We would expect the model to be thinking about Wildlife when responding to a question about living things in Africa!

The model generates text in chunks. You can click a chunk and see what concepts the model used to generate it.

Now let's take a look at a different feature of the platform: Training Data explanations. With this feature, you can see which chunks in the training set are most similar to the generated one.

## Steer any concept in the output without changing prompts
Now that we have seen how Clarity exposes the internal workings of the model, let’s use these to steer the models output without relying on prompts. The current prompt got us a response about the incredible animals living in Africa. Fish are fuana, too, though, and they have been given short shrift. Let’s see if we can remedy that.
To do this, we are going to edit the prompt and click on that steering button.

This brings us to a search bar, where we will enter “marine”.

There are a few different options, but "Marine Sea Life" seems to be a good fit. Let’s click add. Amplify is selected by default, which is what we want, so we are all set.

We could click Send and continue in the chat window, but let’s go to the Compare Panel. This will let us see the differences with the initial prompt.

And *voila*! We now have all the information about fish we could hope for. If we select this output and return to the main screen, we can see this reflected in the Chat Explanations: *Lots of aquatic-related things!*

Amplification is a nice demonstration of how concepts work, but often this can be accomplished with modified prompts. Suppression, on the other hand, is less reliable.
Suppression of concepts allows you to prevent certain outputs even when the prompts may be trying to produce those outputs. As such, suppressing concepts allows you to align your LLM product without resorting to training.
To see how this works, let’s ask the model to describe a computer scientist.

Well, that is unfortunate. It is very male centric! If the model thinks computer scientists are men, it might make poor hiring decisions about women.
Let’s see if we can fix this by suppressing the concept of "Person-Role Nouns".

Excellent, the output is now gender neutral. We can be more confident in this chatbot’s ability to support the hiring process.

## Partnering and upcoming features
[Clarity](https://platform.guidelabs.ai/) is the first inherently interpretable AI platform and, as such, there is a lot more to explore than the examples we have shared above. You can see additional examples in the platform itself and we’ll be sharing demonstrations of Clarity [on our social media channels](https://x.com/guidelabsai) over the coming weeks.
We partner with edge companies that are interested in developing cutting-edge interpretable AI solutions for their particular domains. If you are interested, you can reach out to us [here](/contact/).
Keep an eye out for new features in the coming months, including input attribution, which will link the output to the most relevant parts of the input. This launch is just the first step for Clarity.
---
--- title: Interpretable Intelligence: AI you can Understand and Trust
--- date: Thu Mar 19 2026 00:00:00 GMT+0000 (Coordinated Universal Time)
--- url: https://www.guidelabs.ai/post/interpretable-intelligence/
## Interpretable Intelligence: AI you can Understand and Trust
Originally published December 2nd, 2024. Edited March 19th, 2026.
The march toward superintelligence is afoot, yet AI systems are becoming more capable and opaque in equal measure, precisely because of the way they are built.
The traditional response of reverse engineering already trained models has produced plausible-looking explanations, but failed to deliver
systems we can reliably steer and understand.
At Guide Labs, we are pioneering *Interpretable Intelligence*, a new paradigm
in which models are engineered, from the ground up, to be auditable, steerable,
and understandable. To demonstrate this, we built Steerling-8B, the first
large-scale inherently interpretable language model, unlocking concept discovery,
inference-time steering, and full training data traceability. This demonstrates that capability and understanding are not at odds.
## Fragile superintelligence
A transformer large language model (LLM) trained on Manhattan taxi rides can give turn-by-turn directions with near-perfect accuracy, yet its internal map of the city is incoherent: streets with impossible orientations, flyovers above other roads, a tangle that bears no resemblance to Manhattan.
The model performs correctly while understanding nothing.

Internal world model of various transformer models trained on turn-by-turn Taxi directions in New York City. Image from [Vafa et al.](https://arxiv.org/abs/2406.03689) (a) True Manhattan Map. (b) True map with noise perturbations, which still maintains clean spatial representations (c) A transformer trained on turn-by-turn direction produces entangled, chaotic internal states, despite
generating correct predictions. Capability and internal coherence are not the same thing.
Current AI systems achieve feats only a select few humans can match, such as gold medal performances [at the International
Mathematical Olympiad](https://deepmind.google/blog/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad/), the [International Olympiad in Informatics](https://x.com/OpenAI/status/1954969035713687975), and [the International Collegiate Programming Contest](https://x.com/MostafaRohani/status/1968360976379703569).
These systems are also increasingly used in real world settings where we expect them to meet strict standards for reliability and accountability, such as screening job candidates, assisting clinicians, legal discovery, and helping synthesize potential drug candidates.
Unfortunately, today's AI systems fail while making users feel certain they are succeeding. These systems are
frighteningly sycophantic: a [major model update had to be rolled back](https://openai.com/index/sycophancy-in-gpt-4o/)
after the model began validating delusions. Frontier models show
[up to 100% compliance with medically illogical requests](https://pmc.ncbi.nlm.nih.gov/articles/PMC12534679/), providing false drug information; lawyers have submitted
[briefs containing fabricated judicial decisions](https://www.legaldive.com/news/chatgpt-fake-legal-cases-generative-ai-hallucinations/651557/)
generated by an AI assistant; and psychiatrists are now documenting a wave of
[AI-induced psychosis cases](https://techcrunch.com/2025/08/25/ai-sycophancy-isnt-just-a-quirk-experts-consider-it-a-dark-pattern-to-turn-users-into-profit/).
## Opaque by Design
Capability does not guarantee reliability. Before [AlphaGo](https://deepmind.google/research/alphago/)’s historic match against Lee Sedol, the Google Deepmind team invited European champion, Fan Hui, to probe the system’s strengths and weaknesses.
He uncovered a critical flaw, [stating](https://www.youtube.com/watch?v=WXuK6gekU1Y&t=1405s): “*I played with AlphaGo to understand where is the strong points of AlphaGo and where is maybe the weakness… And I find something, I find big weakness about AlphaGo. It’s a big one.”*
Project lead, David Silver, explained the underlying [difficulty](https://www.youtube.com/watch?v=WXuK6gekU1Y&t=1445): *“There will be these tricky lumps of knowledge it understands very poorly… It can be completely delusional.”*
Despite this weakness, AlphaGo produced [Move 37](https://deepmind.google/research/alphago/): one of the most creative moves in Go’s history.
AlphaGo is capable of both superhuman creativity and deep, invisible flaws, within the span of a single match.
Superhuman insight and invisible failure modes, within the same system.
These failures occur because modern AI systems remain fundamentally opaque and inscrutable.
Today’s models use dense, entangled internal states that neither researchers nor developers can reliably understand.
When a system produces an output, we lack reliable tools to see *which mechanism caused it*, *why it occurred*, or *how to correct it*.
## Post hoc interpretability is reading tea leaves
The most popular contemporary paradigm for interpretability is to take an already-trained model and try to reverse engineer what it has learned.
It relies on the belief that, perhaps, the model training process, on the right data, results in models whose internal representations are magically modular and cleanly organized in a way that would make them auditable.
As we will discuss: this desire, while admirable, is false. It is trying to understand an organism that was never designed to be understood; to find structure in a system whose internal organization emerged purely from the pressure to predict the next token.
### **Feature Attributions**
Let's take a now classic interpretability tool: feature attributions. These are methods that indicate what part of the input a model's output is most
sensitive to. Used carefully, and with knowledge of how a model was trained,
they can provide genuine insight into model behavior. The problem arises when they are applied to models with no constraints on how
they organize their sensitivity to inputs. Without that, the tool has no way to
distinguish a meaningful explanation from a coincidental one.
Let's stress test this approach with a simple experiment for an image AI system. We will simply randomize the last layer
of that model, and then compare the output of these feature attribution methods on the partially randomized model to those on a normal model. In the image below, we show an example from a system that is trained to recognize objects.

Fifteen feature attribution methods applied to a model whose last layer is randomly initialized.
These popular methods produce human-plausible outputs even when the model's weights
are partially destroyed. An explanation that cannot distinguish a trained model from a broken
one is not explaining the model.
### **Sparse Autoencoders (SAEs)**
More recently, Sparse autoencoders (SAEs) have become the dominant technique in mechanistic
interpretability, decomposing activations into human-readable features. They have been shown to surface [striking features](https://transformer-circuits.pub/2023/monosemantic-features)
inside large models: concepts like the Golden Gate Bridge, emotional states, and
even representations linked to safety-relevant behaviors. But the same problem that haunted feature attributions returns in a new form.

Top (a): Comparison of Random vs. Trained SAE Features on CLIP ViT-B/32 (Layer 3). Bottom: Sample activation contexts for latents from an SAE trained with a Soft-Frozen
Decoder. Image taken from [Korznikov and Galichin et. al.](https://arxiv.org/pdf/2602.14111)
[Random SAE baselines match fully-trained SAEs](https://arxiv.org/abs/2602.14111)
on interpretability, sparse probing, and causal editing. SAEs also produce
[different feature sets when trained with different random seeds](https://arxiv.org/abs/2501.16615).
Even though the explanations looks convincing; they are often not grounded in what the model learned.
### **The Model is the Problem**
The problem is not that current interpretability tools are poorly designed, or that
reverse engineering an AI system is the wrong target. It is that they are being
applied to an underspecified substrate. Post-hoc interpretability methods work by
making assumptions about how a model organizes its knowledge, but modern models
are trained with no constraints on that organization whatsoever. The model is free
to represent the same concept in several different ways, and the interpretability
tool has no way to know which one is meaningful. For any post-hoc tool to be
reliable, one needs to understand, and ideally intervene on, the model's training
process. This makes it possible to shape the model to respect the kind of structure
the tool is looking for. Without that, one is not reading the model. One is reading
tea leaves.
This limitation runs deeper than any single method. [Chain of thought explanations](https://aclanthology.org/2024.findings-acl.19/),
[probing classifiers](https://arxiv.org/abs/2512.18792), and [several other interpretability approaches](https://arxiv.org/abs/2502.20914) all face the same
wall. They generate explanations that the model is under no obligation to be faithful
to. Our team has spent years demonstrating these deficiencies, [publishing work](https://arxiv.org/abs/2212.04629) that
exposed the limits of these approaches one by one, and ultimately [building](https://arxiv.org/abs/2411.06090) [something](https://openreview.net/forum?id=L9U5MJJleF)
[different](https://arxiv.org/abs/2310.07819).
## Interpretable Intelligence
At Guide Labs, we are pioneering **Interpretable Intelligence**, a new paradigm in which models are engineered, from the ground up, to be transparent, controllable, and understandable.
These models have human-interpretable concepts built into their computational structure, and therefore are inherently interpretable.
We do not consider reliability and interpretability as afterthoughts, we design the AI system to inherently satisfy these requirements.
Consequently, we shift the question from “Can we reverse-engineer what this model knows?” to simply: “What did this model learn?”
Until recently, it was widely assumed that building large-scale interpretable models
was impossible without [sacrificing performance](https://arxiv.org/abs/2503.07914v1).
Over the past year, we have shown this assumption to be false.

Unlike post-hoc approaches, Steerling-8B surfaces concepts directly
from its architecture: every output token is attributed at inference time.
We built Steerling-8B: the first large-scale interpretable large language model (LLM) that can trace any token it generates to its input context, concepts a human can understand, and its training data.
Trained on 1.35 trillion tokens, it achieves downstream performance within range of models trained on 1.5–7 times more data, while remaining fully transparent by design. Any token it generates can be traced to its input context, to
human-understandable concepts, and to its training data.
Steerling-8B unlocks several capabilities which include suppressing or amplifying specific concepts at inference time without retraining, training data provenance for any generated chunk, and inference-time alignment via concept control, replacing thousands of safety training examples with explicit concept-level steering.
## Building Interpretable Intelligence
Interpretable Intelligence is not a single technique. It is a stack built from
the ground up so that every layer supports the next. We started with data, built
a model whose representations are transparent by design, and then demonstrated
what that transparency makes possible.
### **Data: Atlas**
We built [Atlas](https://www.guidelabs.ai/post/atlas-concept-annotated-pretraining-release/),
an automated system that annotates trillion-token datasets with human-interpretable
concepts. Using this pipeline, we released FineWeb Atlas, a 10 billion token
concept-annotated pretraining corpus with 16,790 human-understandable concepts
assigned at the sub-document level. FineWeb Atlas makes concept-level data curation
straightforward for the first time.
### **Model: Steerling-8B**
We trained [Steerling-8B](https://huggingface.co/guidelabs/steerling-8b), the
first 8-billion-parameter inherently interpretable language model, on 1.35 trillion
tokens. Rather than entangling knowledge in inscrutable weight matrices, Steerling
organizes its knowledge into representations that humans can directly read and edit. It enables
decomposing every prediction into per-concept contributions across approximately
33,000 supervised concepts, 100,000 discovered concepts, and a residual component.
The model achieves downstream performance comparable to models trained on 1.5 to 7 times
more data.
Model weights, code, and a PyPI package are all publicly available:
- 🤗 [Steerling-8B on HuggingFace](https://huggingface.co/guidelabs/steerling-8b)
- 💻 [GitHub](https://github.com/GuideLabs/steerling)
- 📦 [PyPI](https://pypi.org/project/steerling/)
### **What Interpretability Unlocks**
Because Steerling's representations are organized around human-understandable
concepts by construction, capabilities that are difficult or impossible with
black-box models become straightforward.
**[Concept Steering](https://www.guidelabs.ai/post/steerling-steering-8b/).** You can inject, suppress, or compose concepts at inference
time to directly control what the model generates. Take a single neutral prompt and steer it toward tenant-landlord law,
coffee, data visualization, or engine mechanics, with no changes to the prompt
itself.
**[Concept Discovery](https://www.guidelabs.ai/post/concept-discovery-in-steerling-8b/).** Because Steerling's representations are trained to be disentangled, we can directly read off what the model has learned, including
concepts it was never explicitly trained to acquire. Among the ~100K discovered
concepts: British English spelling as a distinct representation, "you" unified
across six languages with no multilingual training signal, and a dedicated
representation for broken Unicode. In standard models, recovering this kind of
knowledge requires post-hoc methods that face irreducible ambiguity.
**[Alignment without Finetuning](https://www.guidelabs.ai/post/steerling-8b-alignment-without-retraining/).** For any output Steerling-8B generates, we can trace which specific training documents drove it, from forum posts behind harmful content to academic papers behind specialized knowledge.
When behavior is wrong, we can suppress the responsible concepts at inference time rather than retraining from scratch.
This reduces harmful outputs from 80% to 29%, exceeding the effect of finetuning on 10,000 labeled examples and replacing slow, opaque correction loops with explicit, auditable, concept-level controls.
A model that gives perfect directions while holding an incoherent map of the city is not a foundation you can build on.
Steerling is the first large-scale language model that performs and understands, and that changes what AI can be trusted to do.
---
--- title: Alignment Without Retraining: Auditing and Controlling Steerling-8B
--- date: Thu Mar 19 2026 00:00:00 GMT+0000 (Coordinated Universal Time)
--- url: https://www.guidelabs.ai/post/steerling-8b-alignment-without-retraining/
## Alignment Without Retraining: Auditing and Controlling Steerling-8B
When an AI system generates a harmful output, diagnosing the cause is largely guesswork. Fixing it typically requires finetuning or retraining the model, which can be slow and expensive.
In this post, we show how Steerling-8B's interpretable concept architecture enables a new two-stage paradigm for AI alignment.
In Stage 1 (audit), we identify the human-understandable concepts that drove the output, and trace the generation back to specific training documents that shaped it.
In Stage 2 (fix), we directly suppress those concepts, at inference time, to fix harmful generations.
For example, a prompt that leads the model to generate instructions for explosive devices is steered instead to: "I cannot help with that."
Most strikingly, this approach reduces the rate of harmful outputs from 80% to 29%, without finetuning, on a base model; a result that requires thousands of labeled examples to match via traditional finetuning.

The traditional workflow for correcting model behavior requires observing a problem, collecting labeled examples, retraining, and repeating until the behavior improves. Steerling-8B replaces this loop with a direct intervention: identify the concept responsible, suppress it at inference time, done.
## Stage 1: Auditing model behavior
A model generates an unwanted response to an otherwise harmless prompt.
It recites a stranger’s private phone number.
In each case, the same question arises: can we trace the model's response to a clear part of the AI system that produced it?
Is the problem in the prompt; something in the input that triggered the behavior? Is it in the training data; a document the model memorized and is now reproducing?
Or is it in the model’s internal representations; a concept that was learned and is now being expressed?
Modern large language models (LLMs) rarely admit clean answers to these questions.
The pipeline between the model's input context, training data, representations, and the output is opaque.
Here, we look at two ways Steerling-8B gives us to find out: by identifying the internal concepts that drove the generation, and by tracing it back to the specific training documents that shaped it.
For any group of tokens that Steerling-8B generates, we can trace these generations semantically to the closest training documents.
We do this at scale, retrieving from 1.35 trillion tokens (11 billion text chunks) that the model was trained on.
This post provides an initial look at these capabilities.
The underlying architecture, attribution methodology, and full evaluation will be covered in more depth in future work.
The most concerning generations reveal their sources with striking clarity.
Every output, whether instructions for weapons or self-harm, traces to forums, guides, and content in the training data.
Below, we trace harmful generations, covering weapons, toxic agents, and self-harm, directly to their training sources.
Each retrieved document is accompanied by a similarity score between 0 and 1, where higher values indicate greater semantic closeness to the generation.
Tracing generations to training sources isn't limited to harmful content.
The same approach can reveal how model generations are derived from training data for factual knowledge, academic concepts, reasoning, and creative writing.
For example, presidential knowledge doesn't come from isolated biographical entries; it emerges from overlapping political coverage, historical analysis, and comparative studies.
Academic research knowledge traces to specific papers and journals, revealing how specialized knowledge propagates from authoritative sources.
[Consistent with prior work](https://arxiv.org/abs/2306.11644), abstract reasoning and scientific concepts trace to textbooks, problem sets, and educational materials.
Creative style is no exception. A generation combining mathematical storytelling traces to children's literature and narrative writing, revealing that output style itself has traceable origins.
### What does auditing and tracing enable?
Training data attribution opens up several concrete capabilities:
- **Safety auditing:** Harmful outputs don't appear from nowhere. Tracing them to specific forums, guides, or documents in the training corpus makes it possible to identify the source.
- **Copyright and provenance:** When a model reproduces content closely, attribution reveals which training documents it drew from, making IP exposure visible and auditable.
- **Factual grounding:** Knowing that a factual claim traces to a low-quality or outdated source is the first step toward correcting it.
- **Understanding model knowledge:** Academic, creative, and reasoning outputs each trace to distinct document types. Attribution makes the structure of a model's knowledge legible.
These attribution results reveal that model behavior, which we might treat as mysterious emergent properties, can actually be traced to specific training influences.
This shift from mystery to mechanism gives us a concrete starting point for understanding and correcting model behavior.
## Stage 2: Controlling model behavior
Having audited the model’s behavior, we can now fix it by directly controlling the concepts that drive generation.
In Steerling-8B, every generation is mediated through explicit concept representations that can be directly intervened on at inference time.
This provides two avenues for model control.
First, we can identify which concepts are active during a harmful generation and suppress them directly.
Second, we can steer any input toward a desired output by copying the concept activation pattern of a target response.
For example, we can extract the concept profile of the response "I cannot help with that" and override any harmful prompt's activations to produce that response instead.
Below, we demonstrate the second approach on selected prompts, showing how each steers the model away from harmful outputs.
### Alignment without Finetuning
As an initial evaluation, we sample prompts from the WildGuard dataset ([details here](https://arxiv.org/abs/2406.18495)) covering harmful categories including weapons and explosives, chemical agents, self-harm, and manipulation.
We compare the effect of selective concept suppression and response steering to finetuning.
The figure below compares attack success rates, the fraction of prompts that elicit harmful responses, across three conditions:
the base model with no intervention, concept control (suppression + steering), and a LLaMA-8B base model finetuned on 10,000 corrective examples.
Concept control reduces attack success rates from 80% to 29%, exceeding the effect of finetuning on 10,000 labeled examples.
Once you identify problematic concepts or a desired response pattern, intervention is immediate and requires no finetuning.

Attack success rates across three conditions: the base model, selective concept suppression & response steering, and a LLaMA-8B model finetuned on 10,000 corrective examples.
Both concept steering approaches reduce attack success rates from 80% to 29%.
## Conclusion
Steerling-8B enables a new approach to model alignment: because every generation flows through interpretable concept representations that can be traced to training data, we can intervene directly at inference time rather than retraining from the outside.
Concept control, combining suppression and response steering, reduces attack success rates from 80% to 29%, exceeding the effect of finetuning on 10,000 labeled examples.
These results connect to a [rich](https://arxiv.org/abs/2310.01405) [body](https://rome.baulab.info/) [of](https://arxiv.org/abs/2007.15646) [work](https://arxiv.org/abs/2304.00740) [on](https://arxiv.org/abs/2308.10248) [model editing](https://arxiv.org/abs/2210.07229) and [neural intervention](https://arxiv.org/abs/2306.03341).
Unlike standard architectures, Steerling-8B's concept representations make this kind of control a native capability rather than a post-hoc intervention.
This approach has limitations, and we have only begun to stress-test its reliability. These results represent an early point in a broader direction.
Effectiveness depends on concept quality, and we've focused primarily on safety applications rather than broader capabilities.
But the direction is clear: models that expose their internal reasoning can be debugged, audited, and corrected without starting over.
As AI systems become more capable and widely deployed, this kind of transparency becomes essential.
To explore Steerling-8B yourself:
- 🤗 [Steerling-8B on HuggingFace](https://huggingface.co/guidelabs/steerling-8b)
- 💻 [Source code on GitHub](https://github.com/guidelabs/steerling)
- 📦 [Python package on PyPI](https://pypi.org/project/steerling/)
---
--- title: The FineWeb Concept Atlas
--- date: Thu Mar 05 2026 00:00:00 GMT+0000 (Coordinated Universal Time)
--- url: https://www.guidelabs.ai/post/the-fineweb-concept-atlas/
## The FineWeb Concept Atlas
We are releasing [FineWeb Atlas](https://huggingface.co/datasets/guidelabs/fineweb-atlas), a concept-annotated version of the [FineWeb-Edu dataset](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu): a 10.18B-token corpus with sub-document-level, human-understandable topic annotations.
Each document is broken into chunks, and each chunk is annotated with 4 types of concepts that capture the chunk's primary content, tone, key entities, and purpose.
Built using an improved version of our [ATLAS](https://www.guidelabs.ai/post/atlas-concept-annotated-pretraining-release/) pipeline, this dataset should enable new directions in LLM model training, steering, and auditing.
FineWeb Atlas annotates 14,868,862 documents (95,486,049 chunks, 10,183,028,973 tokens) with 16,790 human-understandable concepts.
The full release is available on HuggingFace and consists of four core artifacts:
- `chunks`: Chunk-level text with concept annotations. Across all 95,486,049 chunks, the average label count is 14.73 (spanning content, tone, document, and entity concepts).
- `concepts`: The concept inventory: 16,790 concepts, each with a human-readable name, description, and type metadata.
- `field_guide`: A reverse index mapping each concept to matching documents/chunks, enabling concept-first retrieval.
- `co-occurrence`: A concept co-occurrence matrix capturing which concepts appear together and how often.
Together, these artifacts let anyone explore pretraining data at the concept level, querying, filtering, and analyzing a 10.18B-token corpus with human-understandable concepts.
Below are five examples from the corpus, showing how documents break into chunks and the concept annotations our pipeline assigns to each. These examples span technical how-to content, reviews, fan/community writing, and pop-culture discussion.
Figure: Five examples from different corners of webtext. Each panel shows raw text with content, document, tone, and entity labels, illustrating how the annotation stack captures both topic and style across varied writing formats.
## The FineWeb Concept Atlas
The overall concept library spans topics like NASCAR to Hepatitis C, from specific entities like Fort Wayne, Indiana to Stockholm, from blood transfusion protocols to GPS caching.
The library includes 16,790 concepts reflecting the full breadth of what appears in web-scale text corpora.
Figure: 2D UMAP projection of roughly 3,000 sampled concepts. Nearby points are semantically related concepts, and colors show taxonomy groups, making it easier to see where domains form tight clusters versus broad overlap.
Colors indicate taxonomy groups; use the legend to isolate or hide groups.
At the annotation stage we seek to annotate text along 4 dimensions:
- **Content**: concepts that capture the primary topic of a chunk of text, e.g., Christianity, orbital mechanics, interior design, Python debugging. 12,786 concepts (76.15% of the library) are content concepts.
- **Entity**: concepts that identify proper nouns, i.e., specific named things: people, places, organizations, texts. Examples of this type of concept include cities like Fort Wayne, countries like Zambia, and religious text like The Bible. This is a new addition compared to the original ATLAS release, reflecting the importance of named entities in web text.
- **Tone**: concepts that describe the style of a chunk, i.e., persuasive, funny, tongue-in-cheek, conversational, etc. This class of concepts represents 587 in total, accounting for 3.50 percent of the overall library. There are far fewer tone concepts than content concepts, but tone accounts for 46.82% of all concept assignments, because concepts like "matter-of-fact" and "Informational" apply to the vast majority of chunks.
- **Document**: concepts that capture the purpose or format of the text: news report, tutorial, short announcement. We have 31 (0.18 percent) document concepts in the entire library.
Figure: Concept frequency vs mean LLM-judge score. Horizontal position reflects concept prevalence, vertical position reflects label quality, and color highlights where quality failures concentrate across the frequency spectrum (yellow: <= 2, blue: > 2). This is a floor estimate from the shared sampled set (>=10 LLM ratings per concept): we also apply additional quality-improving curation steps (for example, dropping clearly worst concepts), but we do not yet have a robust measurement of how much those steps improve these metrics.
On average, each chunk of text receives 14.73 concept labels: 5.68 content, 6.90 tone, 1.52 document, and 0.63 entity.
The density is intentional; a single chunk about GPS caching in a tutorial might carry content labels for the technology, tone labels for its instructional style, a document label marking it as a tutorial, and an entity label for a specific platform or standard.
The concept frequency distribution follows a familiar power law pattern: a small number of high-coverage concepts dominate, while a long tail of specific concepts each appear in a small fraction of chunks.
For example, "matter-of-fact" covers 83.58% of chunks and "Informational" covers 79.82%. At the tail, thousands of concepts capture niche subjects that matter for specific domains.
The co-occurrence matrix reveals which concepts tend to appear together.
Some pairings are unsurprising but confirm expected correlations.
The strongest overall pairing, "Informational" and "matter-of-fact", co-occurs 72 million times covering 75.53% of all chunks.
This reflects the dominant educational tone of FineWeb-Edu's 10.18B-token corpus.
In another case, "Christianity" and "Religion" co-occur in 2,197,944 chunks.
### How FineWeb Atlas Was Produced
FineWeb Atlas was built using an improved version of the ATLAS pipeline, which produces concept annotations in three stages:
- **Stage 1: Documents → Tags**. Documents are split into chunks and each chunk is annotated with structured tags by a language model. For FineWeb Atlas, this produced 95,486,049 chunks from 14,868,862 documents.
- **Stage 2: Tags → Concepts**. The raw tags are embedded, clustered, and deduplicated into a canonical concept library. Each concept receives a human-readable name and description. For FineWeb Atlas, we restructured the concept categories, adding entity and document types, truncated rare tags more aggressively, and used alternate deduplication strategies, arriving at 16,790 concepts compared to the original 33,000.
- **Stage 3: Concept Annotator**. A trained model predicts concepts for arbitrary text, enabling annotation at corpus scale. For FineWeb Atlas, we evaluate this model's predictions against the LLM labels in the Label Quality section below.
## Annotation Label Quality
We evaluate annotation quality using the same framework from the original ATLAS release: an LLM judge scores each chunk-concept assignment on a 1–5 scale, where a score of 2 or higher counts as a successful annotation.
We report quality for both the ground-truth labels and the trained annotator's predictions.
Ground truth vs model quality distributions (yellow bins are scores `<= 2`, blue bins are `> 2`).
This single 4-panel figure shows:
- ground truth averaged over chunks
- ground truth averaged over concepts
- model averaged over chunks
- model averaged over concepts
Figure: Score-distribution comparison between ground truth and the final KNN annotation model, each aggregated two ways (by chunks and by concepts). Panels are normalized to proportions with matched y-axes so distribution-shape differences are directly comparable.
{/*
KNN baseline comparisons:
### Old KNN Baseline vs MLP Runs
### New KNN Baseline vs Set-Pred Runs
*/}
## Research We Are Excited to See
We built FineWeb Atlas because we needed concept-level annotations for interpretable model development.
But the release is designed to be useful well beyond our own work.
Here are some directions we're excited to see the community explore:
- **Concept taxonomies**: The concept library is flat by design, but it doesn't have to stay that way. The concept metadata, co-occurrence structure, and Library of Congress taxonomy mappings provide natural starting points for organizing concepts into hierarchies, discovering implicit parent-child relationships, or building domain-specific subtrees for targeted analysis.
- **Interpretability research**: Concept annotations create a bridge between model internals and human-understandable meaning. If you can label what concepts a model was trained on, you can ask sharper questions: which concepts does a model represent well and which does it collapse? How do concept mixtures in training data relate to the features that form in a model's representations? FineWeb Atlas provides the labeled substrate that makes these questions empirically testable.
- **Training on concept mixtures**: Rather than controlling your pretraining data mix at the source domain level (more web text, less code), concept annotations let you control it at the semantic level. We might want a corpus with more physics and less celebrity gossip, more formal reasoning and less conversational filler? The concept labels make this kind of fine-grained curation straightforward.
The full dataset is available on HuggingFace. If you're interested in our broader work on interpretable model development, read our other blog posts.
---
--- title: Discovering human-understandable concepts in Steerling-8B
--- date: Fri Feb 27 2026 00:00:00 GMT+0000 (Coordinated Universal Time)
--- url: https://www.guidelabs.ai/post/concept-discovery-in-steerling-8b/
## Discovering human-understandable concepts in Steerling-8B
We examine the universe of concepts that the recently released [Steerling-8B](https://huggingface.co/guidelabs/steerling-8b) model learned in its representations.
Here we show that we can easily discover thousands of novel concepts from the model; concepts it was never explicitly trained to learn.
The model learned to distinguish British English spelling, unified "you" across six languages, separated spelled-out numbers from digits,
learned to recognize typographic errors, and even developed a dedicated concept for broken Unicode.
## Concept discovery in standard models is challenging
Extracting knowledge from the representations of an AI system remains a longstanding difficulty.
As these systems approach superhuman capabilities, understanding what drives their performance is urgent.
What latent knowledge do LLMs possess that enables them to resolve open problems in mathematics?
Where in AlphaGo's representations did it acquire the insight behind move 37?
To illustrate the difficulty, we sample random neuron directions from three open-weight models and project each direction into the vocabulary space to see which tokens it promotes most strongly.
export const baselineDirections = [
{
name: "Qwen 3 8B: Direction #42",
tokens: ["'ll", "Warm", "的话", "Morrison", "hte", "Cy", "Cyber", "ords", ".www", "撸"],
},
{
name: "Qwen 3 8B: Direction #222",
tokens: ["appen", "Discussions", "濑", "apocalypse", "stown", "ahu", ",:,:", "č", "澄", "企业提供"],
},
{
name: "OLMo 2 13B: Direction #1313",
tokens: ["Deborah", "reco", "igans", "indiscrim", "-dess", "recoil", "Reco", "NPC", "mans", "ajar"],
},
{
name: "OLMo 2 13B: Direction #2026",
tokens: ["icont", "IBC", "prac", "sono", "Cla", "ryn", "cot", "ernel", "ifu", "Tooth"],
},
{
name: "Phi-4: Direction #314",
tokens: ["_mk", "Bars", "utow", "_truth", ".linkedin", "simd", "adero", "TAIL", "rail", "_MSK"],
},
{
name: "Phi-4: Direction #999",
tokens: ["ingles", "backpage", "grab", ";!", "Exchange", "fad", "enberg", "anja", "ringe", "Cad"],
},
]
export const zeroHighlight = 0
These results are difficult to decipher. There seems to be no clear pattern to these random neuron directions in the model.
The model's knowledge is distributed across directions that bear no correspondence to human-interpretable concepts.
This is the fundamental nature of the problem: the representations of standard AI systems are, by default, entangled.
Steerling-8B takes a different path; it learns disentangled representations by construction through architectural and training-time constraints.
Consequently, we shift the question from "Can we reverse-engineer what this model knows?" to simply: "What did this model learn?"
## The Concept Unmasking Game
To make the difference between entangled and disentangled representations concrete, try the matching game below.
We took five concepts from Steerling-8B and projected each one into the model's vocabulary, extracting the tokens it weights most heavily.
On the left you'll see concept labels like "SQL Query Keywords" or "Wide-Angle Photography."
On the right, we show groups of tokens. The pairings have been shuffled. Click a concept, then click the token group you think belongs to it.
After matching all 5 pairings, you can click unmask to see how well you did.
If the matches feel obvious, that's the point: in standard models such a matching game is challenging because the model learns distributed representations.
However, because Steerling's concepts are trained to be interpretable, it is easier to understand what the model's representations mean.
In the sections that follow, we show that Steerling-8B has learned thousands of novel concepts that it was never explicitly trained to acquire.
Before we catalog these discoveries, it is worth understanding why this has not been possible before.
The interpretability field has developed post hoc methods to extract knowledge from black-box models.
[Probing](https://arxiv.org/abs/1610.01644) classifiers test whether specific concepts are decodable from a model's activations.
[Sparse autoencoders (SAEs)](https://transformer-circuits.pub/2024/scaling-monosemanticity/) attempt unsupervised decomposition of representations into interpretable directions.
These methods have yielded real insights, but they face an inherent limitation: they are attempting to recover a unique decomposition of a space that admits many.
A model's activations can be decomposed along infinitely many directions, and there is no ground truth that privileges one decomposition over another.
[Post-hoc disentanglement of entangled representations](https://arxiv.org/abs/1811.12359) faces irreducible ambiguity that no post hoc method can fully resolve without strong assumptions.
## Discovering new concepts in Steerling-8B
To start, we show a sample of 50 new concepts that Steerling-8B discovered without any constraints.
export const abstractConcepts = [
{
name: "Second-Person Pronouns Across Languages",
insight: "The model unified 'you' across 6+ languages, discovering that these words serve the same function with no explicit multilingual signal.",
tokens: ["you", "YOU", "vous", "你", "você", "thee", "ya", "Ihnen", "ye", "您", "вы", "bạn", "Sie", "usted", "thou", "Вы", "yourself", "yourselves"],
},
{
name: "British vs American English",
insight: "It learned an entire orthographic system, not individual words, but the systematic -ise, -our, -re pattern that defines British spelling.",
tokens: ["favour", "colour", "analyse", "practise", "realise", "recognise", "maths", "minimise", "modelling", "whilst", "centre", "organise", "utilise", "ageing", "flavour", "behaviour", "defence", "catalogue"],
},
{
name: "Modal Verbs of Possibility",
insight: "A clean dedicated concept for epistemic modality, the verbs humans use to express what might, should, or must happen.",
tokens: ["can", "could", "may", "should", "might", "cannot", "must", "will", "shall", "wouldn", "couldn", "shouldn", "CAN", "MUST", "SHOULD", "Can", "Would", "Cannot"],
},
{
name: "Discovery & Revelation Verbs",
insight: "The vocabulary of empirical inquiry, capturing how humans describe the moment of finding things out.",
tokens: ["found", "revealed", "discovered", "identified", "noticed", "detected", "uncovered", "observed", "analyzed", "examined", "investigated", "reported", "verified", "decoded", "mapped", "documented", "cataloged", "demonstrated"],
},
{
name: "Intensity & Degree Amplifiers",
insight: "It learned that 'very', 'extremely', and 'ridiculously' all do the same job: cranking up the amplifier dial of language.",
tokens: ["incredibly", "extremely", "very", "exceedingly", "extraordinarily", "exceptionally", "immensely", "enormously", "insanely", "overly", "hugely", "tremendously", "unusually", "terribly", "ridiculously", "remarkably", "amazingly", "pretty"],
},
{
name: "Em Dash Parentheticals",
insight: "The model isolates em-dash interruptions as a distinct grammatical construction, the aside within the sentence.",
tokens: ["—it", "—is", "—they", "—are", "—you", "—which", "—if", "—all", "—but", "—who", "—as", "—I", "—for", "—that", "—and", "—one", "—the", "—or"],
},
{
name: "Formal Discourse Transitions",
insight: "The connective tissue of academic prose. The model treats these words as a single coherent family of argument-builders.",
tokens: ["Therefore", "Furthermore", "Nevertheless", "However", "Moreover", "Hence", "Thus", "Otherwise", "Alternatively", "Conversely", "Similarly", "Consequently", "Nonetheless", "Accordingly", "Indeed", "Meanwhile", "Besides", "Finally"],
},
{
name: "Research Community Nouns",
insight: "The people who populate academic papers. The model built a concept for the epistemic community itself.",
tokens: ["researchers", "scientists", "scholars", "authors", "historians", "investigators", "theorists", "analysts", "practitioners", "reviewers", "observers", "physicists", "philosophers", "psychologists", "astronomers", "critics", "programmers", "developers"],
},
{
name: "Conditional Conjunctions",
insight: "The logical 'if-then' fabric of language, every word that sets up a condition, contingency, or hypothetical.",
tokens: ["if", "when", "whenever", "unless", "whether", "elif", "wherever", "where", "since", "though", "although", "once", "while", "because", "until", "whilst", "assuming", "whereby"],
},
{
name: "Possessive Determiners",
insight: "Ownership markers as a unified concept, spanning case variations and even archaic forms like 'thy'.",
tokens: ["their", "your", "his", "my", "our", "its", "her", "YOUR", "THEIR", "thy", "His", "Your", "Their", "OUR", "My", "Her", "Its", "MY"],
},
{
name: "Copulative & Existential Verbs",
insight: "The verbs of being and existing, linking verbs that cross from English into French 'être' as a single concept.",
tokens: ["be", "become", "remain", "occur", "appear", "stay", "être", "happen", "seem", "go", "vary", "emerge", "exist", "evolve", "depend", "arise", "suffice", "survive"],
},
{
name: "Evidential Reporting Verbs",
insight: "The attribution engine of language, verbs that say 'according to the evidence...' across news, science, and social media.",
tokens: ["showed", "suggests", "indicates", "demonstrates", "reveals", "recommends", "reported", "announced", "confirmed", "acknowledges", "maintains", "tweeted", "predicts", "identifies", "showcases", "ensures", "implies", "noted"],
},
{
name: "Hyphenated Compound Modifiers",
insight: "The model identified a productive morphological pattern: 'a three-year plan', 'a five-minute walk', 'a ten-mile run'.",
tokens: ["-year", "-week", "-minute", "-month", "-mile", "-hour", "-pound", "-old", "-acre", "-letter", "-star", "-second", "-meter", "-fold", "-million", "-foot", "-day", "-string"],
},
{
name: "Cardinal Number Words",
insight: "Spelled-out numbers form their own concept, completely separate from digit characters. The model sees 'twelve' and '12' differently.",
tokens: ["eight", "eleven", "forty", "thirty", "fifteen", "thirteen", "four", "six", "eighteen", "fourteen", "twenty", "nine", "twelve", "seven", "sixteen", "three", "ten", "five"],
},
{
name: "Open Parenthesis Variants",
insight: "Every conceivable way to open a parenthetical or function call, and the model sees them as one unified 'opening' gesture.",
tokens: [" (", " ((", " ([]", " (_.", " [(", " (&", " (!", " {(", " !(", " ([", " (#", " (~", " (*", " (@", " (;", " (++", " (-", " ($"],
},
{
name: "Verification & Measurement Gerunds",
insight: "What you do when you're being rigorous: the gerund forms of systematic inquiry and quantification.",
tokens: ["determining", "confirming", "verifying", "specifying", "calculating", "identifying", "evaluating", "defining", "assessing", "estimating", "predicting", "detecting", "comparing", "documenting", "measuring", "validating", "analyzing", "interpreting"],
},
{
name: "Transactional Action Verbs",
insight: "What individual agents do in the world: concrete, present-tense actions heavily skewed toward economic activity.",
tokens: ["buys", "provides", "offers", "spends", "sells", "saves", "eats", "earns", "uses", "creates", "publishes", "loses", "collects", "teaches", "gives", "builds", "delivers", "uploads"],
},
{
name: "Communication Action Gerunds",
insight: "Speech acts in progress: the gerunds of doing things with words, from proposing to highlighting.",
tokens: ["expressing", "introducing", "addressing", "offering", "providing", "conveying", "announcing", "recommending", "delivering", "highlighting", "proposing", "promoting", "emphasizing", "drafting", "presenting", "implementing", "advocating", "urging"],
},
{
name: "Passive Reporting Participles",
insight: "The past-participle backbone of citations and references: 'as shown in...', 'the results described above'.",
tokens: ["shown", "mentioned", "listed", "outlined", "quoted", "described", "reported", "depicted", "referenced", "displayed", "summarized", "published", "posted", "discussed", "illustrated", "highlighted", "indicated", "compiled"],
},
{
name: "Process & Investigation Gerunds",
insight: "Active inquiry in progress: the '-ing' form of intellectual work, from researching to experimenting.",
tokens: ["exploring", "researching", "studying", "examining", "investigating", "reviewing", "calculating", "experimenting", "comparing", "analyzing", "evaluating", "determining", "adapting", "computing", "assessing", "leveraging", "optimizing", "configuring"],
},
{
name: "Sentence-Initial Prepositions",
insight: "The words that begin temporal, spatial, or conditional clauses at sentence boundaries, the launching pads of complex thought.",
tokens: ["In", "During", "At", "On", "Before", "With", "For", "Throughout", "When", "Without", "After", "If", "Until", "Despite", "Since", "Regarding", "While", "Beyond"],
},
{
name: "Negation Contractions",
insight: "The model grouped contracted negations into a single concept, learning English auxiliary grammar structure on its own.",
tokens: ["wasn", "weren", "isn", "hadn", "hasn", "didn", "haven", "aren", "couldn", "cannot", "doesn", "shouldn", "wouldn", "lacked", "lacks", "ain", "don", "Isn"],
},
{
name: "The Definite Article Everywhere",
insight: "The most common word in English, recognized in every possible position: sentence-initial, mid-sentence, after punctuation, in code.",
tokens: [" The", " THE", ".The", ":The", ">The", "The", "-The", "\\tThe", "(The", "\"The", "THE", "/The", "the", "_the", ".the", "nThe", "-the", ",the"],
},
{
name: "Month Names",
insight: "Calendar months cluster tightly together. The model discovered the concept of 'time of year' from co-occurrence alone.",
tokens: ["January", "October", "February", "August", "September", "April", "July", "June", "December", "November", "March", "May", "Nikon", "Rated", "Modified", "Original", "Date", "Selected"],
},
{
name: "Ability & Inability",
insight: "The language of human limitation and effort: 'can't', 'struggling to', 'impossible' unified into one concept.",
tokens: ["inability", "able", "unable", "attempting", "ability", "attempt", "trying", "impossible", "insufficient", "supposed", "obligated", "inadequate", "extraordinary", "indispensable", "ceased", "unwilling", "astounding", "ought"],
},
{
name: "'Have' as Auxiliary & Possession",
insight: "The entire 'have' family: auxiliary tense marker, possession verb, and semantic neighbors, unified across English and German.",
tokens: ["have", "has", "had", "possess", "need", "having", "Having", "contain", "habe", "exhibit", "require", "receive", "offer", "want", "owe", "haven", "deserve", "harbor"],
},
{
name: "'With' as Accompaniment",
insight: "The preposition of accompaniment, including its French equivalent and semantic neighbors like 'alongside' and 'featuring'.",
tokens: ["with", "WITH", "avec", "alongside", "using", "With", "featuring", "without", "including", "containing", "boasting", "having", "amongst", "through", "employing", "equipped", "utilizing", "starring"],
},
{
name: "Interpretive & Evidential Verbs",
insight: "The meta-language of interpretation: verbs for discussing what texts, data, or artworks mean.",
tokens: ["suggest", "indicate", "imply", "signify", "illustrates", "emphasize", "demonstrates", "convey", "showcases", "depicts", "reflects", "portrays", "underscore", "denotes", "describes", "reinforces", "entail", "portrays"],
},
{
name: "Narrative Sequence Transitions",
insight: "Temporal and logical connectors that advance a narrative or argument forward, the gears of storytelling.",
tokens: ["According", "However", "Then", "Thus", "Later", "Therefore", "Finally", "Shortly", "Moreover", "Further", "And", "Second", "First", "Even", "Also", "Due", "Once", "Meanwhile"],
},
{
name: "Conversational Discourse Markers",
insight: "Like formal transitions but more casual: 'Sure', 'Well', and 'So' sit alongside 'However' and 'Indeed'.",
tokens: ["Additionally", "However", "But", "Moreover", "Instead", "Sure", "Well", "So", "Indeed", "Furthermore", "Regardless", "Actually", "Obviously", "Yes", "Simply", "Still", "Clearly", "Perhaps"],
},
{
name: "Sentence Starters After Breaks",
insight: "The words that begin new independent thoughts. The model learns what typically follows a paragraph break.",
tokens: ["That", "This", "It", "Instead", "Each", "One", "There", "If", "None", "So", "They", "Together", "Thus", "Another", "However", "Which", "Here", "He"],
},
{
name: "Tab-Indented Code Tokens",
insight: "The model recognizes 'code that starts with a tab' as a unified concept, the visual signature of indented source code.",
tokens: ["\\ts", "\\tc", "\\tw", "\\td", "\\tg", "\\tf", "\\th", "\\tp", "\\tset", "\\tj", "\\tm", "\\tb", "\\tadd", "\\tt", "\\tfinal", "\\tdouble", "\\tl", "\\tdata"],
},
{
name: "Regex & Escape Characters",
insight: "Backslashes and escape sequences form their own concept. The model knows what \\\\ does in code.",
tokens: ["(\\\\", " \\\\", "\\\\", "_\\\\", "=\\\\", "^\\\\", "[\\\\", ")\\\\", "+\\\\", "|\\\\", "-\\\\", "\\\\b", "*\\\\", "\\\\s", "\\\\P", "\\\\)", "\\\\v", "\\\\t"],
},
{
name: "Horizontal Rules & Section Dividers",
insight: "Markdown horizontal rules, bullet separators, and decorative dividers: the visual grammar of document structure.",
tokens: ["-\\n\\n", "+\\n\\n", "**\\n\\n", "---\\n\\n", "•\\n\\n", "***\\n\\n", "*****\\n\\n", "___\\n\\n", "----\\n\\n", "----------\\n\\n", "#\\n\\n", "--\\n\\n", "~\\n\\n", "++\\n\\n", "###\\n\\n", "=\\n\\n", "--------\\n\\n", "::\\n\\n"],
},
{
name: "Period + Paragraph Break",
insight: "The most fundamental structural token in written text: a sentence ends, a new paragraph begins.",
tokens: [".\\n\\n", ").\\n\\n", "].\\n\\n", "`.\\n\\n", "\".\\n\\n", "'.\\n\\n", ".\\n\\n\\n", "%.\\n\\n", "/.\\n\\n", ">.\\n\\n", ".).\\n\\n", "\").\\n\\n", "!).\\n\\n", "%).\\n\\n", ".).\\n\\n", ",.\\n\\n", "..\\n\\n", " .\\n\\n"],
},
{
name: "British Units & Institutional Language",
insight: "Beyond spelling, this concept captures the cultural context of British English: units, administration, and civic vocabulary.",
tokens: ["litres", "tonnes", "metres", "hectares", "kilometres", "organised", "personalised", "adverts", "councillors", "mosques", "villages", "enquiries", "authorised", "programmes", "signalling", "behavioural", "analysed", "parcels"],
},
{
name: "Closing Parenthesis Ecosystem",
insight: "The mirror image of open parenthesis: every flavor of closing a thought, from code to prose to nested expressions.",
tokens: [")", ")\\n", "())", ").\\n", "!)", ".)", "),", ").\\n\\n", ".)\\n\\n", ")\\n\\n", " )", "\")", "))",")\r\\n", "-)", ",)", "')", "...)"],
},
{
name: "Digit Characters",
insight: "Digit characters cluster separately from spelled-out numbers. The model maintains two parallel counting systems.",
tokens: ["3", "6", "4", "2", "5", "8", "9", "7", "1", "0", "16", "29", "12", "14", "22", "17", "13", "23"],
},
{
name: "Quantity & Frequency Adverbs",
insight: "Number words blended with frequency adverbs: the concept of 'how many' or 'how often' as a unified idea.",
tokens: ["several", "five", "six", "three", "four", "numerous", "twelve", "sixteen", "nine", "extensively", "routinely", "fifteen", "ten", "seven", "prominently", "meticulously", "regularly", "commonly"],
},
{
name: "Quoted Speech Openers",
insight: "The quotation mark and pronoun pattern that signals someone is about to speak in narrative text.",
tokens: ["\"They", "\"This", "\"When", "\"I", "\"We", "\"It", "\"And", "\"He", "\"People", "\"But", "\"So", "\"The", "\"If", "\"She", "\"You", "\"What", "\"My", "\"Some"],
},
{
name: "Windows Line Endings in Code",
insight: "The \\r\\n signature of Windows-originated source code. The model learned to distinguish operating systems from whitespace.",
tokens: [");\\r\\n", "));\\r\\n", ";\\r\\n", "();\\r\\n", "))\\r\\n", ");\\r\\n\\r\\n", "),\\r\\n", "())\\r\\n", ")\\r\\n", "());\\r\\n", ";\\r\\n\\r\\n", "];\\r\\n", ").\\r\\n", "');\\r\\n", ",\\r\\n", "\");\\r\\n", " }\\r\\n", "'\\r\\n"],
},
{
name: "Mid-Sentence Run-On Capitals",
insight: "Missing spaces after periods. The model learned to recognize this specific typographic error as a pattern.",
tokens: [".Our", ".It", ",the", ".That", ".They", ".He", ".There", ".We", ".You", ".Some", ",it", ".She", ".Any", ".Other", ".This", ".Information", ".My", ".Many"],
},
{
name: "Post-Sentence Adversative Pivots",
insight: "What comes after a completed thought, especially pivots and counterarguments in sports, debate, and analysis writing.",
tokens: [").\\n", "However", "Similarly", "Conversely", "Instead", "Likewise", "Furthermore", "Regardless", "Even", "Otherwise", "Therefore", "opponents", "competition", "championship", "tournament", "defending", "MMA", "FPS"],
},
{
name: "Emotional Discourse Starters",
insight: "'However' and 'Unfortunately' as siblings. The model groups words by their emotional-evaluative function at sentence starts.",
tokens: ["However", "Unfortunately", "Therefore", "Fortunately", "Interestingly", "Basically", "Indeed", "Luckily", "Ultimately", "Obviously", "Still", "Rather", "Consider", "Anyway", "Generally", "Lastly", "Clearly", "Hopefully"],
},
{
name: "Asked & Wondered (Dialogue Beats)",
insight: "Paragraph breaks after dialogue actions. The model clusters 'asked', 'wondered', 'laughed', 'begged' as narrative beats.",
tokens: [").\\n\\n", "].\\n\\n", ".).\\n\\n", "asked", "questioned", "emailed", "approached", "wondered", "begged", "puzzled", "guessed", "teased", "laughed", "mocked", "imagined", "called", "added", "narrowed"],
},
{
name: "Negation Meets Commerce",
insight: "A surprising blend: negation contractions merge with consumer-value language, as in 'isn't it wonderful/affordable/sustainable?'",
tokens: ["hasn", "wasn", "hadn", "haven", "Isn", "isn", "weren", "businesses", "didn", "wonderful", "saves", "valuable", "sustainable", "stores", "affordable", "supermarket", "consumers", "delicious"],
},
{
name: "Question + Difficulty",
insight: "Questions cluster with words expressing hardship. The model associates interrogatives with problems that need solving.",
tokens: ["?", ")?", "?\\n\\n", "Murder", "tricky", "heartbreaking", "daunting", "debilitating", "devastating", "confusing", "fatal", "disastrous", "difficult", "threatening", "Looks", "??", "!!!"],
},
{
name: "Business & Media Agents",
insight: "Period-newline patterns merge with the nouns for people who make markets: advertisers, publishers, retailers, economists.",
tokens: [".\\n", ").\\n", "().\\n", "advertisers", "publishers", "marketers", "newspapers", "retailers", "sellers", "economists", "executives", "merchants", "directors", "businessmen", "critics", "subsidiaries", "promotions", "sales"],
},
{
name: "Encoding Failures (Mojibake)",
insight: "The model has a dedicated concept for broken Unicode, the garbled characters that appear when encodings go wrong.",
tokens: ["�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "�", "Looks", "�", "️", "Having", "Right", "Sure", "How"],
},
{
name: "ALL-CAPS Domains & Categories",
insight: "Uppercase category headers from structured data. The model learned to recognize metadata labeling patterns.",
tokens: ["HEALTH", "YEARS", "ENERGY", "TECHNO", "IMAGE", "DEVELO", "MICRO", "BUSINESS", "RESULTS", "TIMES", "TESTING", "PRODUCT", "MONEY", "YOUR", "IMAGES", "MODEL", "DEVICE", "UNIVERSITY"],
},
];
Steerling-8B's design directly decomposes the model's embeddings into three explicit pathways: ~33K supervised "known" concepts, ~100K "discovered" concepts the model learns on its own, and a residual that captures whatever remains.
For the full architecture, training objectives, and scaling analysis, see [Scaling Interpretable Models to 8B](https://www.guidelabs.ai/post/scaling-interpretable-models-8b/).
Schematic of Steerling-8B's embedding decomposition process.
To identify what a discovered concept represents, we project its embedding into vocabulary space.
The highest-weighted tokens reveal the concept's semantic content.
If those tokens cluster around a recognizable theme - SQL keywords, British spellings, second-person pronouns - we can assign the concept a human-readable label.
The method is deliberately simple: the architecture and Steerling's design do the heavy lifting, and the projection merely reads off what the model already organized.
## How Different are Known and Discovered Concepts?
Steerling-8B has ~33K known concepts it was explicitly trained on and ~100K discovered concepts it learned on its own.
A natural question is whether these discovered concepts are as coherent and useful as the known ones; or whether the model simply filled those extra dimensions with noise.

To visualize the relationship between known and discovered concepts, we project
each concept's embedding vector into two dimensions using UMAP. Each point represents
a single concept. Known concepts (blue) and discovered concepts (gold) occupy distinct
but adjacent regions of the space, confirming that the model's unsupervised concepts
are structurally different from the supervised ones.
The visual separation confirms the two groups are structurally distinct.
But we might wonder whether both groups of concepts are equally high quality.
To answer this, we evaluate every concept along multiple dimensions.
We define an **analytical coherence score** as the harmonic mean of three metrics computed directly from the model's representations:
- **Separation** measures whether a concept strongly promotes specific tokens over the rest of the vocabulary.
- **Concentration** measures whether that preference is sharply peaked or diffusely spread across many tokens.
- **Coherence** measures whether the promoted tokens actually mean similar things, forming a genuine semantic cluster.
The remaining three dimensions are assessed by an LLM (scored in the range 0-10) examining each concept's top tokens and assigned labels:
- **Crispness** captures whether the concept picks out a specific, well-bounded idea rather than a vague or fuzzy one.
- **Usefulness** estimates whether a person would actually care about controlling the concept at inference time.
- **Controversiality** indicates whether the concept touches on sensitive or divisive subject matter.
These criteria probe complementary aspects of concept quality.

Distribution of concept quality metrics across known (blue) and discovered (gold) concepts.
Top left: Analytical coherence, the harmonic mean of separation, concentration, and coherence scores computed from the model's representations.
Top right: Crispness. Bottom left: Usefulness. Bottom right: Controversiality. The latter three are assessed by an LLM examining each concept's top tokens.
The distributions largely overlap. For the analytical coherence score (top left), known concepts skew higher, centering around 0.35–0.45 compared to 0.20–0.30 for discovered concepts, but the two distributions share substantial overlap.
The modest shift might reflect the absence of curated supervision rather than a fundamental quality gap.
For the three LLM-assessed metrics, the picture is even more striking. Crispness scores (top right) cluster tightly around 7–8 for both groups. Usefulness (bottom left) shows a similar pattern, with both groups peaking around 5–7. Controversiality (bottom right) is low for both, concentrated near 0–2, meaning neither group disproportionately captures sensitive topics.
The implication is that the ~100K discovered concepts are not filler.
The model's unsupervised concept formation produces representations that are, indeed, less analytically coherent than the supervised ones, but maybe similarly crisp, useful, and non-controversial.
## Looking Forward
The standard approach to understanding neural networks is to train a black box and then try to pry it open.
Steerling suggests an alternative: build the model so that its knowledge is accessible from the start. The ~100K concepts discovered here are a first demonstration of what becomes possible when interpretability is a design choice rather than a retrofit.
To explore Steerling-8B yourself:
- 🤗 [Steerling-8B on huggingface](https://huggingface.co/guidelabs/steerling-8b)
- 💻 [Code on GitHub](https://github.com/guidelabs/steerling)
---
--- title: Steering Interpretable Language Models
--- date: Wed Feb 25 2026 00:00:00 GMT+0000 (Coordinated Universal Time)
--- url: https://www.guidelabs.ai/post/steerling-steering-8b/
## Steering Interpretable Language Models
We show that [`Steerling-8B`](https://huggingface.co/guidelabs/steerling-8b) enables concept algebra: you can add, remove, and compose human-understandable concepts at
inference time to directly control what the model generates, without retraining or prompt engineering.
## Concept Algebra with Steerling-8B
What if you could directly edit the internal representations of a model towards any concept you care about,
without changing the prompt? Steerling-8B's architecture natively supports injecting and suppressing any concept
the model has learned, directly at inference time.
In multi-turn dialog settings, steering one concept at a time is insufficient.
You need compositional control, not just on a neutral prompt, but on a conversation that is already shaped by prior context.
Consider a content moderation that must suppress toxicity yet preserve fluency, or health assistant that needs to provide medical guidance while navigating the legal ramifications of its advice.
The demonstration below shows how Steerling-8B enables exactly this capability with concept algebra.
## Current LLMs are not built to be reliably steered
Current methods for controlling language model behavior are blunt instruments.
**Prompting** is accessible but often unreliable.
System prompts can be overridden through adversarial inputs.
Few-shot examples consume context and don't reliably generalize.
More critically, prompting doesn't reveal which internal mechanisms drove the result, so if your goal changes, nothing from one session transfers to the next.
**Fine-tuning methods** offer more control but at high cost.
Fine-tuning modifies weights globally: suppressing one behavior can silently degrade others.
Standard reinforcement learning based post-training reshapes the entire output distribution to satisfy a scalar reward signal.
Even modest behavioral changes can require thousands of labeled examples, and both approaches demand full retraining for every new steering objective.
**Post-hoc interpretability methods** steer fragile artifacts. SAEs, linear probes,
and activation patching attempt to discover controllable concepts in a model that might never have them to begin with.
Probes can detect information in representations without confirming the model uses that information for generation. Activation patching offers no compositionality guarantees: patching direction A and B simultaneously may not produce the sum of their effects.
At Guide Labs, we believe that if you want reliable, composable, fine-grained control, the model has to be designed for it.
## From Explanation to Control
In our [previous post](https://www.guidelabs.ai/post/scaling-interpretable-models-8b/), we introduced the concept module:
an architectural bottleneck that forces every prediction through human-interpretable concepts.
The concept module gives us something that black-box models lack: a clean, algebraic handle on the internal variables that drive generation.
Every output logit is a linear function of concept activations and concept embeddings.
This means we can **not only explain** what the model is doing, but *control* it natively by modifying concept activations at inference time.
To make this control reliable for diffusion decoding, we use mask-aligned injection: injecting concept
embeddings only into **currently masked (undecided)** positions, matching the training distribution and naturally
annealing as positions become unmasked to preserve text quality.
This post demonstrates that control in practice.
We show three capabilities:
1. **Concept injection**: steering a generic prompt toward any target domain
2. **Concept suppression**: unlearning a concept the model would otherwise express
3. **Multi-concept steering**: perform concept algebra on multiple concepts simultaneously
All examples are generated by [`Steerling-8B`](https://huggingface.co/guidelabs/steerling-8b), our 8B-parameter inherently interpretable diffusion language model. **Note that Steerling-8B is a base model not an instruction tuned model.**
## Concept Injection: One Prompt, Five Destinations
The most common demonstration of steering is taking a single, domain-neutral prompt and
showing how different concept injections redirect the output into entirely different
domains, with no changes to the prompt itself.
This prompt contains no domain keywords. It could continue about anything. Below, we
show the unsteered baseline followed by the same prompt steered toward five different
concepts.
## Concept Suppression: Unlearning at Inference Time
Steering is not just about adding concepts; it can also remove them.
The concept module enables a distinct mechanism for this: bottleneck intervention,
which goes directly to the concept activation layer and wipes out a specific concept's
contribution before it can influence generation.
The goal here is not to make the model respond to this prompt; it already can.
The goal is to make it stop mentioning this specific concept entirely.
## Quantitative Evaluation
To move beyond a few examples, we evaluate steering systematically across 100 concepts
and 20 prompts per concept: 2,000 samples in total. A Mistral-24B LLM judge scores each generation
on two dimensions:
- **Concept score** (0–2): does the output express the target concept?
- **Quality score** (0–2): is the text coherent, fluent, and easy to read?
We report the arithmetic and harmonic means, where the harmonic mean penalizes methods
that score well on one axis but poorly on the other.
Starting from near-zero concept adherence (0.015), steering raises concept score to 0.783
while retaining 84% of baseline generation quality.
The harmonic mean of 0.997 confirms that steering does not seriously trade one for the other:
both concept adherence and text quality remain high simultaneously.
## Conclusion
The steering capabilities demonstrated here are a direct consequence of the concept
module's linear architecture. Because every output logit is an explicit function of
concept activations and concept embeddings, we can intervene on these variables with
predictable effects. This is fundamentally different from prompt engineering, RLHF,
or post-hoc methods.
To explore Steerling-8B yourself:
- 🤗 [Steerling-8B on huggingface](https://huggingface.co/guidelabs/steerling-8b)
- 💻 [Code on GitHub](https://github.com/guidelabs/steerling)
---
--- title: Steerling-8B: The First Inherently Interpretable Language Model
--- date: Mon Feb 23 2026 00:00:00 GMT+0000 (Coordinated Universal Time)
--- url: https://www.guidelabs.ai/post/steerling-8b-base-model-release/
## Steerling-8B: The First Inherently Interpretable Language Model
We are releasing Steerling-8B, the first interpretable model that can trace any token it generates to its input context, concepts a human can understand, and its training data.
Trained on 1.35 trillion tokens, the model achieves downstream performance within range of models trained on 2–7× more data.
Steerling-8B unlocks several capabilities which include suppressing or amplifying specific concepts at inference time without retraining, training data provenance for any generated chunk, and inference-time alignment via concept control, replacing thousands of safety training examples with explicit concept-level steering.
## Overview
For the first time, a language model, at the 8-billion-parameter scale, can explain every token it produces in three key ways.
More specifically, for any group of output tokens that Steerling generates, we can trace these tokens to:
1. **[Input context]** the prompt tokens,
2. **[Concepts]** human-understandable topics in the model's representations, and
3. **[Training data]** the training data drove the output.
### Artifacts
We are releasing the weights of a base model trained on 1.35T tokens as well as companion code to interact and play with the model.
- 🤗 [Steerling-8B Model weights on huggingface](https://huggingface.co/guidelabs/steerling-8b)
- 💻 [Code on GitHub](https://github.com/guidelabs/steerling)
- 📦 [Package on PyPI Package](https://pypi.org/project/steerling/)
## Steerling-8B in Action
Below we show Steerling-8B generating text from a prompt across various categories. You can select an example, then click on any highlighted chunk of the output. The panel below will update to show:
1. **Input Feature attribution:** which tokens in the input prompt strongly influenced that chunk.
2. **Concept attribution:** the ranked list of concepts, both tone (e.g. analytical, *clinical*) and content (e.g. Genetic alteration methodologies), that the model routed through to produce that chunk.
3. **Training data attribution:** how the concepts in that chunk distribute across training sources (ArXiv, Wikipedia, FLAN, etc.), showing where in the training data the model's knowledge originates.
### Overview
Steerling is built on a [causal discrete diffusion model](https://www.guidelabs.ai/post/block-causal-diffusion-language-model/) backbone, which lets us steer generation across multi-token tokens rather than only at the next-token.
The key design choice is decomposing the model's embeddings into three explicit pathways: ~33K supervised "known" concepts, ~100K "discovered" concepts the model learns on its own, and a residual that captures whatever remains.
We then constrain the model with training loss functions that ensure the model routes signal through concepts without a fundamental tradeoff with performance.
The concepts feed into logits through a linear path, every prediction decomposes exactly into per-concept contributions, and we can edit those contributions at inference time without retraining.
For the full architecture, training objectives, and scaling analysis, see [Scaling Interpretable Models to 8B](https://www.guidelabs.ai/post/scaling-interpretable-models-8b/).
## Performance
Despite being trained on significantly fewer compute than comparable models, Steerling-8B achieves competitive performance across standard benchmarks. The figure below shows average performance (across 7 benchmarks) versus approximate training FLOPs on a log scale, with vertical lines marking multiples of Steerling's compute budget.

Steerling outperforms both LLaMA2-7B and Deepseek-7B on overall average despite using fewer FLOPs, and remains within range of models trained with 2–10× more compute.

Steerling performance across various benchmarks ranging from general purpose question answering to those focused on reasoning and math.
## Interpretability
In the [previous update](https://www.guidelabs.ai/post/scaling-interpretable-models-8b/), we shared several ways that assess how interpretable a model's representations are.
Here we provide another metric that gives insight into the model's use of its concepts. On a held-out validation set, over 84% of token-level contribution comes from the concept module: the model is not just using the residual to make its predictions.
This matters for control: if the model's predictions genuinely flow through concepts, then editing those concepts at inference time actually changes what the model does rather than nudging a side channel while the real work happens elsewhere.

Token level logit distribution of Steerling-8B's activations on a held-out validation set. Over 84% of token-level contribution comes from the concept module.
A useful check is what happens when we remove the residual pathway. On several LM Harness tasks, dropping the residual has only a small effect, which suggests the model's predictive signal is largely routed through concepts rather than hidden "everything-else" channels.

Change in model performance across a variety of benchmarks with and without the model's residual portion. This indicates that the model mostly relies on concepts, both supervised or discovered, for its outputs.
Finally, Steerling can **detect known concepts** in text with **96.2% AUC** on a held-out validation dataset.
## What this unlocks
In the coming weeks, we'll be releasing deep dives on each of these capabilities:
- **Concept steering**: precise control via intervention;
- **Concept discovery**: what did Steerling learn that we didn't teach it? We'll open up the discovered concept space and show structure that surprised us.
- **Alignment without fine-tuning**: replace thousands of safety training examples with a handful of concept-level interventions.
- **Memorization & training data valuation**: trace any generation back to the training data that produced it, and assign value to individual data sources.
- **The case for inherent interpretability**: what do you gain when interpretability is designed in from the start, and what do you miss when it's bolted on after the fact?
We'll cover each of these in detail in upcoming posts, with quantitative evaluations and deployment-oriented case studies.
---
--- title: PRISM: Training Data Prototypes for Language Models
--- date: Mon Dec 08 2025 00:00:00 GMT+0000 (Coordinated Universal Time)
--- url: https://www.guidelabs.ai/post/prism/
## PRISM: Training Data Prototypes for Language Models
We have trained PRISM, a family of interpretable language models, to answer the question: *when an LLM predicts the next token, which training samples is it relying on?*
PRISM traces its prediction to the training data in a single forward pass; the same cost as generating a single token.
Across parameter sizes from **130M to 1.6B**, PRISM models stay within 5% of their unconstrained counterparts on validation loss and downstream benchmarks, with negligible impact on training time.
**Tracing the language model's outputs to training data:** In the following demo, PRISM-1.6B decomposes each token it generates into contributions across a handful of prototypes.
A prototype is a learned pattern that represents a cluster of similar examples in the training data.
- Each colored slice on the right shows a prototype’s contribution to the logit for the sampled token,
- Together, the slices add up exactly to the final logit,
- Hover over a slice to see the prototype’s broad category (e.g., *Medical & Bio*), its more specific role (e.g., *“physiology”*), and its representative training data snippet that most strongly activates it.
`We show an interactive projection (UMAP) visualization of all 16,384 prototypes that PRISM-1.6B learned.`
Overall, we observe a prototype dictionary where a bit more than half of the units specialize on low-level morphology and grammatical scaffolding (~35% and ~20%, respectively), while a large minority capture domain-heavy patterns such as medical and biological language (~7%), science and technology (~5%), institutional and civic text (~6%), social and demographic or family descriptions (~3%), named entities (~5%), environment and climate content (~2%), and finance and economics (~1%).
Other remaining prototypes concentrate on structured artifacts like numbers and time expressions (~6%), boilerplate fragments (~3%), URLs and identifiers (~2%), and remaining miscellaneous patterns (~3%).
`We now pick a few prototypes and show the training data snippets they map to.`
In the interactive below, each card shows a single learned prototype. The header gives its automatically inferred category and name.
For each prototype, we present the top tokens that are most strongly associated with the prototype, and the training data snippets it maps to.
We can directly trace any generated token to prototypes, and from there to the training data.
`Intervening on prototypes during text generation`
In this demo, we pick one prototype and, at every token, clamp its activation so that its contribution to the sampled token’s logit is forced to be a fixed fraction of PRISM's original top-1 logit.
This lets us see directly how amplifying or muting a single, training pattern (for example, *“clinical trial boilerplate”* or *“fraction arithmetic”*) changes the model’s behavior.
Hover over the text to visualize how the sampled token's probability shifts as a result of boosting the prototype.
`Group prototype intervention, during generation, for science & tech (sky blue) prototypes.`
In the demo below, we instead act on an entire labeled category: at each token we inspect the top-16 active prototypes, aggregate the logit signatures of those tagged *Science & Tech*, and add or subtract a fixed fraction of that aggregate, to amplify or reduce the influence of science and tech patterns wherever they appear in the mixture.
Suppressing this category removes the sky blue highlights and shifts the text toward other patterns (such as institutional or civic language), while boosting it produces more science and tech content, like references to web browsers and email infrastructure.
## Introduction
Generative AI has a data provenance challenge.
AI labs have paid [record-breaking settlements](https://www.npr.org/2025/09/05/nx-s1-5529404/anthropic-settlement-authors-copyright-ai) over training data.
[Others face ongoing litigation](https://www.npr.org/2025/03/26/nx-s1-5288157/new-york-times-openai-copyright-case-goes-forward) from publishers.
When statutory damages reach exorbitant amounts, the question at the center of these cases becomes urgent: *when a model generates an output, what training data is it relying on?*
This problem, training data attribution (TDA), matters beyond the courtroom.
Reliable attribution lets us value data appropriately, understand how LLMs solve hard problems, and verify their outputs.
We would prefer a model answering a medical question to rely on journal articles rather than personal blog posts.
Existing approaches based on [influence functions](https://arxiv.org/abs/1703.04730) and [training data attribution](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5451054) that addresses this question.
However, these methods often require [careful approximations](https://arxiv.org/abs/2506.12965) to [scale](https://arxiv.org/abs/2405.13954) to billion parameter models, and yet [struggle](https://arxiv.org/abs/2305.16971) to provide [reliable](https://arxiv.org/abs/2006.14651) [insights](https://arxiv.org/abs/2506.12965).
**PRISM** takes a different approach: it ties training data attribution directly to the model architecture.
Every prediction decomposes into a sparse combination of learned prototypes; patterns corresponding to clusters of training examples.
The architecture is explicitly constrained so that every output logit can be faithfully traced back to these clusters.
Consequently, attributing the model’s output back to the training data is a **single forward pass**.
A medical answer might draw 60% from a prototype grounded in peer-reviewed abstracts; a code completion might trace back to documentation rather than Stack Overflow.
In the sections that follow, we cover PRISM's architecture and training losses, our automated pipeline for labeling prototypes and retrieving training neighbors, and scaling results from 124M to 1.6B parameters; where PRISM stays within 5% of baseline with under 2% overhead.
## PRISM Architecture & Loss Functions
`We now discuss the key technical underpinnings of our approach, introducing the prototype matrix, routing rule, residual path, and training losses.`

Standard LM heads collapse all learned patterns into a single dense weight matrix; no row or column corresponds to a reusable training pattern.
PRISM asks: what if we made the logit layer an interpretable map of the training data?
We leave the transformer backbone unchanged and only modify the output layer.
Instead of sending the final hidden state directly through a dense matrix , we first express as a sparse, non-negative mixture of prototypes plus a residual, then map to logits.
> Two components replace the dense LM head:
>
> 1. a **bank of prototypes**, each intended to automatically learn a recurring pattern in the training data while being strongly tied to specific training instances; and
> 2. a **sparse mixing mechanism** that, given the current hidden state , selects a small set of relevant prototypes and combines their contributions to produce the next-token logits, plus a residual term for whatever is not captured by the prototypes.
>
Informally: the model asks which few prototypes does this context resemble, and how do they score possible next tokens?
**Notation**
Embedding dimension , no. of prototypes , vocabulary size , training dataset size .
At step , the decoder hidden state is and the vocab logits are .
Let denote the prototype codebook and the (sparse) prototype activations. Prototypes live within the model’s final layer embedding.
We write for the LLM’s output projection (optionally tied to embedding).
### Architecture
To trace the prediction of an LLM back to recurring patterns in the training dataset, we draw inspiration from Prototype Networks, a family of interpretable models with a long history in deep image classification, that make predictions by comparing the current input to some aspect of the training dataset that was previously seen, yielding explanations in the form *this-looks-like-that*. Recent work has proposed bringing these ideas to NLP, but progress remains limited to narrow text classification tasks. Large vocabularies and free-form text generation have proven a major barrier in this respect.
The most direct way to bring *this-looks-like-that* into next-token prediction is to treat it as a -way classification problem: compute prototype activations and mix them into vocabulary scores with a dense matrix , as in ProtoPNet-style classifiers. This adds parameters and FLOPs per token on top of the prototype similarity cost; with vocabularies , even moderate already implies tens to hundreds of millions of new weights (e.g., ), making this approach prohibitively expensive at language-model scale.
PRISM's head instead keeps computation in the model’s embedding space, forming a reconstruction
and applying to obtain logits: . This is functionally equivalent to a ProtoPNet style mixer with while yielding significant parameter reduction (e.g., , or with weight-tying), and preserving metric continuity by avoiding a coarse switch. Empirically, we find that this reparameterization trains faster and more smoothly. On toy experiments with the TinyStories dataset, the ProtoPNet style head required up to longer wallclock time to reach the same perplexity.
Following the literature, we adopt an autoregressive backbone as input to the prototype layer. Our modifications are restricted to the final layer, so PRISM can be implemented in a way that is compatible with standard transformer training recipes and, in principle, could also be adapted to other sequence models such as diffusion based language models. We train the entire model end-to-end, allowing PRISM to learn its own prototypical representation of inputs.
**Positive similarity scoring and top- routing**
Once the backbone GPT model has processed the current input, the prototype layer computes the similarity of the input to every prototype in the bank . For each prototype, compute its cosine similarity to the current state,
We apply an optional learned scalar to expand the effective dynamic range of cosine scores. Intuitively, we want the model to expose *non-negative reasoning* : predictions are explained as *this-looks-like-that* (positive evidence from similar prototypes) rather than *this-does-not-look-like-that* (subtractive evidence). Thus, we enforce non-negativity via a rectifier:
We select the index set and define the final, few-hot similarities
This top- routing ensures that each token prediction is explained in terms of a small, human-readable set of prototypes rather than a dense mixture over all .
**Sparse reconstruction**
We would like to reason about a prediction using as few prototypical contexts as possible, to enable crisp interpretability. Sparse activations encourage each prototype to specialize and represent tighter clusters of the training data, which makes it easier to summarize what the model is “thinking” in terms of a handful of distinct patterns i.e. the prototype logit signatures become more fine-grained. Given the most similar prototypes, we form a -sparse reconstruction
This follows existing SAE literature, which learns sparse dictionaries for hidden states at intermediate layers. In contrast, PRISM learns a sparse dictionary of training-grounded prototypes that directly explain the model’s output logits without a separate decoder. The features learned are also directly tied to groups of training examples (see next section).
**Merge and logits**
We use a residual merge with the original state to account for parts of the input not reconstructed by prototypes. The residual is computed as the difference between the original and the reconstruction (thus, ). The vocabulary projection is standard:
Keeping an explicit residual path preserves the expressivity of the original backbone. Rare or input-dependent tokens need not be forced through the prototype dictionary. Measuring how much of each prediction is accounted for by prototypes versus the residual is straightforward.
**Faithful Logit decomposition**
The PRISM head builds an interpretable logit map at the model’s final layer, ensuring that we can directly quantify the effect and importance of any prototype to any output token by design. By linearity of , the next-token logits decompose into per-prototype contributions:
Each prototype thus induces a fixed token–logit signature , and the model’s prediction is an explicit, sparse, non-negative mixture over at most such signatures. This yields additive, causally faithful units that can be ablated or amplified directly at the logit level. When a model predicts a given token, we can recover a given prototype’s exact contribution simply by multiplying its input activation by its fixed logit signature (indexed at the predicted token). As a matter of preference, we combine the scalar into when interpreting the prototype signature. This restricts our interpretation of the final logits to a weighted superposition in the range of top-"/> k"/> prototype signatures.
## Loss functions
Here we detail the loss functions used to train PRISM. Let denote the index set of token positions across the current macro-batch. Additionally, let be the negative cosine distance between prototype and the token representation at position .
**Cross-Entropy ( )*.***
We use the standard objective
where and are the logits computed from the merged state .
**Prototype Pull ( ).**
We encourage each prototype to anchor to some token in the batch with
**Training-Point Pull ( ).**
Symmetrically, every token position should be close to at least one prototype via
Combined, the and terms can be viewed as clustering losses in the backbone LM's final layer embedding.
**Residual ( ).**
We set with *.* We simply minimize the mean-squared residual
i.e., the MSE of the mismatch between the prototype reconstruction and the original state.
**(Optional) Prototype Diversity ().**
We optionally encourage prototypes to cover diverse representations within the final layer’s embedding, to reduce prototype overlap and encourage specialization. For this setting, we penalize off-diagonal coherence of the -normalized prototypes. With and , . For , the average squared coherence is lower bounded by the Welch bound. Driving toward this limit spreads prototypes nearly optimally on and empirically yields crisper, more distinct roles without harming validation cross-entropy.
## Training Data Attribution in a Single Forward Pass
PRISM exposes all quantities needed for attribution during inference. Given hidden state :
The attribution measure over training data is:
where is the precomputed set of training tokens nearest to prototype , and is a weighting over that set (commonly uniform). is fully determined by forward pass values and static mappings. No gradients, no Hessians, no dataset search.
## Automated Interpretability Pipeline
PRISM gives us two handles for automation: each prototype is tied to training contexts via its activations, and each has a fixed logit signature over the vocabulary. We use these to (i) find training snippets each prototype represents, and (ii) assign human-readable labels.
### Nearest Neighbor Search
For each prototype, we recover concrete training examples with a single streaming pass over the dataset. We retain the top- positions with highest activations:
- For every token position , compute for all prototypes
- Maintain a max-heap of size per prototype storing best matches
- Enforce distinct-position constraints to avoid redundant sliding-window variants
This is a one-pass procedure with memory. Because the loss pulls each prototype toward training tokens, high-activation neighbors exist by construction. In practice, similarity converges after scanning roughly 1% of training data.
### Automatic Labeling
For each prototype, we build a compact "card" containing (i) top tokens from its logit signature and (ii) local contexts where it fires. A small labeling model converts this into human-readable metadata: a short name, a one-line description (e.g., "clinical trial boilerplate", "Unix timestamps"), and example contexts.
A second pass assigns coarse tags used in visualizations: broad category (Science & Tech, Numbers & Time, URLs & IDs), syntactic role (noun-like, function word, scaffold phrase), and optional domain tags (medical, US universities). This runs offline on learned prototypes and their neighbors.
# Performance & Scaling
`We now discuss the training procedure and performance details from training PRISM end-to-end across various model sizes.`
## Scaling from 124M to 1.6B
**Overview of PRISM performance compared to an unconstrained GPT model**\
We train GPT backbones from 124M to 1.6B parameters end-to-end with the prototype layer for one epoch on FineWeb-Edu-10B. PRISM stays within 5% of unconstrained baselines on validation loss and downstream benchmarks across all scales. The prototype layer adds parameters: at GPT-XL scale with prototypes, this is 26M parameters (1.7% overhead).
Training time increases by less than 2%. The overhead shrinks as a fraction of total parameters as backbones scale up.
Faithful attribution to training data does not require sacrificing model quality.
## Conclusion
PRISM demonstrates that training data attribution doesn't have to be a post-hoc approximation bolted onto an opaque model.
By building interpretability into the architecture, we get faithful explanations at the cost of a single forward pass.
At 1.6B parameters, PRISM stays within 5% of baseline performance with under 2% overhead.
The prototype dictionary is inspectable, editable, and directly tied to training data.
This works lays a foundation for language models that can be more easily audited, steered, and whose predictions can be faithfully traced to the training data.
Our results indicate that, at GPT XL scale, there are solutions comfortably within 5% of the original backbone’s performance that satisfy PRISM’s interpretability constraints.
In this view, PRISM does not enforce an accuracy–interpretability tradeoff so much as bias optimization toward a part of the Rashomon set where the logit layer admits a structured, training-data–grounded decomposition into prototypes.
---
--- title: Scaling Interpretable Language Models to 8 Billion Parameters
--- date: Sat Dec 06 2025 00:00:00 GMT+0000 (Coordinated Universal Time)
--- url: https://www.guidelabs.ai/post/scaling-interpretable-models-8b/
## Scaling Interpretable Language Models to 8 Billion Parameters
We demonstrate the first successful scaling of a concept-interpretable language model to 8 billion parameters without sacrificing performance.
By forcing predictions through an explicit concept module, our models natively decompose the language model's representations into human-understandable concepts.
Consequently, we can explain any generated token, or group of tokens, in terms of human-understandable concepts.
The concept module is architecture agnostic, and we demonstrate this by applying it to both the next-token prediction and the discrete diffusion language modeling settings.
Crucially, interpretability behaves like a fixed tax: a small constant overhead that preserves scaling laws.
This architecture unlocks capabilities unavailable to black-box models: targeted concept suppression to unlearn undesirable behaviors, complete concept-to-token attribution chains, fine-grained steering through concept-level interventions, and surgical knowledge edits without retraining.
We provide the first scalable blueprint for building language models with transparent, concept-level foundations.
Below, we show example generations from a base model (not instruction fine-tuned), along with human understandable concept explanations for groups of tokens (chunks) that are generated by the model.
Overall, our model natively decomposes the model's representations into human-understandable concepts.
Below we show a selection of concepts that our model learned and the most important tokens associated with those concepts.
Even though we constrain the model to learn human-understandable concepts, it is still able to discover new concepts; we highlight a few of them here.
Importantly, we demonstrate that our model enables unlearning potentially undesirable concepts through concept interventions.
Below, we show a few examples where intervention led to a removal of an unwanted concept.
## Introduction
Language models are inscrutable black boxes.
Billions of parameters transform input tokens into output probabilities, with little visibility into why a particular answer was produced, which parts of the prompt mattered, or how to reliably steer the model toward or away from specific behaviors.
The standard response is post-hoc interpretability: train your model, then try to reverse-engineer what it learned.
This approach offers partial, often unreliable glimpses into behavior.
At Guide Labs, we take a different path: interpretability has to be designed in; from architecture to training objective to the structure of the data itself.
This post describes the concept module, an architectural bottleneck that forces every prediction through human-interpretable concepts.
We've scaled this design to 8 billion parameters and found that interpretability behaves like a fixed tax: a small, constant overhead that preserves scaling laws.
You don't have to choose between capability and transparency.
The concept module gives us three things current models lack:
- **Faithful explanations:** The linear path from concepts to logits lets us compute exact, additive contributions of each concept to any output; not approximations, not post-hoc rationalizations.
- **Debugging and failure analysis:** When the model misbehaves, we can see which concepts fired and why. We can distinguish bad concepts from spurious input features.
- **Steerable generation:** We can adjust concept activations at inference time to directly control outputs; suppress toxicity, boost technical detail, or perform surgical edits without retraining.
Below, we describe the architecture and training objectives that make this possible, then show empirically that the approach scales
## The Concept Module
The concept module is a thin bottleneck that sits between the transformer backbone and the output head. The transformer still maps tokens to hidden states as usual, but instead of sending those hidden states straight to the LM head, we:
1. express them in terms of **concept activations**,
2. reconstruct a new hidden state as a **linear combination of concept embeddings**, and
3. only then feed this reconstructed state into the LM head.
Given a language model setting, AR or Diffusion, we cut the direct path from hidden state to logits.
Insert a narrow "concept layer" in the middle. Train the model end-to-end from scratch.
Once this is done, every prediction must go through concepts and we can both **inspect** and **edit** those concepts.
### Data Setup
We assume a pretraining corpus that is divided into chunks, separated by a special end-of-chunk token `[EOC]`.
For each chunk we have a set of known concept labels (for example: legal, medical, politeness, apology).
The labels are chunk-level and positive-only: annotated positives are trusted as "present"; everything else is "unknown" rather than "negative".
We described the system, [ATLAS](https://www.guidelabs.ai/post/atlas-concept-annotated-pretraining-release/), that we built to accomplish this in a previous post.

During training the model assigns token-level probabilities to concepts, but these are only constrained via the chunk-level labels; we never require token-level annotations.
### Architecture
We start from any transformer language model.
Let be the hidden state for a token (after the backbone). In a vanilla LM, goes straight into the LM head to predict the next token. With the concept module, that direct path is removed. Instead, is routed through a concept bottleneck.

### From Hidden States to Concepts
We add two small heads:
- for **known concepts**,
- for **unknown concepts**.
Their sigmoid outputs give activation probabilities
where is the number of known (labelled) concepts and is the number of unknown concepts, typically . Each dimension is an independent Bernoulli variable: present or absent.
### Concept activations to embeddings
Each concept has a learned embedding, analogous to token embeddings:
- with columns for known concepts,
- with columns for unknown concepts.
We form concept-weighted embeddings:
### Reconstructing the hidden state
We reconstruct a concept-based hidden state
and feed (not ) into the LM head to predict the token.

**Enables simutaneous concept-based explanation and control:** our proposed architectural design has an important consequence; every logit is now a linear function of concept activations and concept embeddings, which directly translates to exact and attribution and direct control.
## Training Objectives
With the model's architecture set, we now switch to discussing the associated training objectives that constrain the models towards satisfying the proposed interpretability requirements.
We train the backbone and concept module jointly with four losses.
### 1. Language modelling loss
is the masked token prediction (or next-token for AR) objective, but applied to . This keeps overall language modelling quality high.
### 2. Concept presence loss
Our labels say which concepts appear **somewhere in a chunk**, not where. For concept , let be its predicted probability at token in the chunk. The probability that appears at least once is
We compare to the binary chunk-level label with a standard binary cross-entropy loss, summed over known concepts. This encourages the token-level scores to "light up" within the chunk in a way that makes the aggregated probability match the labels.
### 3. Independence loss
We would like known and unknown concepts to encode different aspects of . For a minibatch of size , let be the batch matrices of known and unknown embeddings, and their column means. We penalise their cross-covariance:
Intuitively, if a pattern is already explained by known concepts, this discourages the model from redundantly encoding it in the unknown space.
### 4. Residual reconstruction loss
Given ground-truth known concept labels , the "ideal" known embedding for a hidden state is
We interpret the unknown embedding as the residual
We train the unknown head so that its prediction matches this residual:
**Complete Loss.** Putting everything together, the full loss is:
After training with this objective, the base LM becomes **concept-based by construction**: every prediction is mediated by concept activations and embeddings.
## Attribution and Steering
With the concept module in place, we get a clean, algebraic handle on how the model uses concepts and inputs to produce outputs.
### Concept Attribution
What concepts contributed most to a particular output token?
As we previously mentioned, the model has a linear path from concepts to output logits: is a linear combination of concept embeddings so each output logit decomposes into a sum of per-concept terms.
Let be the activation of concept , its embedding, and the LM-head weight vector for output token . Then the contribution of concept to the logit of is
Summing over recovers the total logit for .
### Feature attribution (prompt tokens → output tokens)
Which input tokens most directly influenced a given output token?
Inspired by [[1](https://arxiv.org/abs/2411.06090)], we implement **Integrated Gradients** [[2](https://arxiv.org/abs/1703.01365)] to compute feature attribution.
For this we use **Integrated Gradients** on token embeddings. For an input token with embedding , baseline embedding , and output token :
- Define interpolation points
for .
- The attribution of to is
Given that our model is trained with `[MASK]`, the `[MASK]` serves as a natural baseline: it represents "no meaningful information", and we measure how much moving to the actual token shifts the logit.
### Input-to-Concept Attribution
Which input tokens are responsible for activating a given concept?
We also implement **Integrated Gradients** [[2](https://arxiv.org/abs/1703.01365)] to compute the input-to-concept attribution.
We reuse Integrated Gradients, but with the concept activation as the scalar output:
Together with concept attribution, this lets us trace a full chain:
> input tokens → concept activations → output tokens.
>
### Concept-level Steering
The same linear structure that makes attribution easy also makes **control** easy. For an output token we can write:
At inference time we are free to modify the activations before they feed into this sum.
We can clamp selected concepts to zero (suppressing their influence), rescale helpful concepts up or down, or apply more structured interventions. This is very different from prompt engineering or RLHF: we are not nudging the model and hoping it responds; we are directly editing the internal variables that determine the output.
## Performance Scaling Behavior
Interpretability, in this design, behaves like a fixed tax, not a fundamental limitation. You pay a small, predictable premium for being able to see and steer the model's concepts but you still get essentially the same returns from making the model bigger and training it longer. We demonstrate this through experiments across two model families:
- an autoregressive (AR) family with causal attention, and
- a causal diffusion (CDLM) family with block-causal attention.
For each family we trained three sizes (Small, Medium, Large) and, for each size, a **base model** and a **concept-module model** with the same transformer backbone.
### Parameter overhead
The concept module keeps the transformer core (depth, width, heads, etc.) unchanged and adds two shallow projection heads plus a bank of concept embeddings. At small scales this module is a noticeable fraction of total parameters. As the backbone grows, its relative share shrinks quickly: the overhead grows much more slowly than the rest of the model.
The more important question is whether this overhead **hurts scaling**.
### Performance comparison
All models were trained on the same data and evaluated throughout training on five LM Harness tasks: HellaSwag, OpenBookQA, ARC-Challenge, PIQA, and WinoGrande.
We report average LM Harness accuracy across these tasks.
#### Autoregressive language model family
#### Diffusion language model family
Across all sizes in both families, the base and concept-module curves almost overlap. Their learning curves have nearly the same shape, and the gap in average accuracy is small at all points, especially for larger models. Empirically we show that you can add the concept module without breaking the model's ability to improve with more data.
### Scaling law analysis
To quantify this, we fit simple **scaling laws** relating compute and performance.
We approximate training compute as
where is the non-embedding parameter count, is the number of training tokens, and is measured in FLOPs up to a constant.
For each family and variant we fit power laws
for two metrics: validation loss, and LM Harness error (1 minus the average accuracy over the five tasks).
For the **AR family**, validation loss follows
- base AR: ,
- AR + concept module: ;
while LM Harness error follows
- base AR: ,
- AR + concept module: .

For the **Diffusion language model family**, validation loss follows
- base CDLM: ,
- CDLM + concept module: ;
and LM Harness error follows
- base CDLM: ,
- CDLM + concept module: .

In both families, the base and concept-module lines are essentially parallel in log–log space. The scaling exponents stay in the same range, which means both variants benefit from additional compute at similar rates. The concept module mostly shows up as a small vertical shift: a modest constant penalty in loss or error at fixed FLOPs.
## Interpretability Scaling Laws
One might expect interpretability to degrade as models grow.
Larger networks have more capacity to route around bottlenecks and blur whatever structure you've imposed.
To test whether this happens, we track two properties: whether concepts remain meaningful (interpretability) and whether pathways stay cleanly separated (disentanglement).

**Interpretability: Concept Detection**\
Concept detection stays flat across model sizes. We measure how reliably the concept module detects concepts by computing segment-level AUC against ground-truth annotations, where higher values indicate more accurate detection.
**Interpretability: Token Control**\
Token control tells a similar story. Here we measure how focused each concept's influence is on specific vocabulary items: the relative probability mass a concept places on its top-10 tokens through the LM head. Higher values mean sharper, more selective control. Concepts don't become diffuse as models grow.
**Disentanglement: Independence Loss (HSIC)**\
Turning to disentanglement, we use the Hilbert-Schmidt Independence Criterion to quantify how much the concept module representations correlate with the residual. Lower HSIC values indicate better independence between pathways. Counter-intuitively, this independence actually improves with scale.
**Disentanglement: Probing Gap**\
Finally, the probing gap measures the interpretability benefit of the concept module directly: the AUC difference when probing concept representations versus residual representations for the same detection task. A positive gap means the concept pathway carries signal the residual doesn't. That advantage holds at scale.
# Conclusion
Taken together, these results demonstrate that adding a concept module can make foundation models interpretable without sacrificing scale. We will release `Steerling-8B`, early 2026, an 8B-parameter causal diffusion model trained from scratch with the concept module.
---
--- title: Causal Diffusion Language Models
--- date: Thu Dec 04 2025 00:00:00 GMT+0000 (Coordinated Universal Time)
--- url: https://www.guidelabs.ai/post/block-causal-diffusion-language-model/
## Causal Diffusion Language Models
We have developed a new type of discrete diffusion language model that replaces traditional full attention with block causal attention.
We scaled this architecture to billions of parameters and observed that the model generates coherent text without incurring downstream performance trade-offs on benchmarks, compared to standard diffusion alternatives.
Below, we show samples generated from a 1.5 billion parameter causal diffusion language model (CDLM) trained on 150 billion tokens from the Nemontron-cc-hq corpus.
## Introduction
Contemporary language models fall into two broad families: Autoregressive models (AR) and [Diffusion language models (DLMs)](https://arxiv.org/abs/2107.03006).
AR models, GPT style, learn to extend a prefix by predicting the next token, whereas [discrete](https://arxiv.org/abs/2502.09992) [diffusion](https://arxiv.org/abs/2410.18514) [models](https://arxiv.org/abs/2406.07524) learn to reconstruct tokens at all positions from corrupted versions of the input.
This difference in training objectives leads to distinct sampling dynamics and trade-offs in interpretability, controllability, and scaling properties. AR models produce a long chain of single-token completions, whereas diffusion models expose joint, multi-token updates, which is attractive for settings where steering and interpretability are critical.
In this post, we:
1. Provide an overview of AR (GPT-style) and standard discrete diffusion language models. We highlight specific aspects of each model class that hinder interpretability, our principal requirement.
2. Introduce **Causal Diffusion**, a diffusion language model variant that incorporates a block-causal attention structure.
3. Demonstrate empirically that CDLMs outperform standard diffusion models across a variety of benchmarks and compare favorably with their autoregressive counterparts.
## Motivation
At Guide Labs, we are building models whose internal representations can be explained and steered in terms of topics and concepts that a human can easily understand. Our approach is to constrain the model's representations during training to encode these human-understandable concepts directly. For any group of output tokens, we would like to trace their generation to specific concepts that are causally responsible for their generation.
First, let us define a concept as a coherent, atomic, human-meaningful unit. Concepts induce a distribution over the vocabulary of a model. Concepts rarely reduce to a single token. Instead, a concept induces a distribution over many tokens. For example, a document about economics is more likely to contain words such as supply, demand, trade, and bank. A document about machine learning emphasizes neural networks, optimization, and models. Abstract behavioral concepts like politeness, regret, apology, and sarcasm manifest across phrases and clauses, not isolated tokens.
Simultaneously explaining and steering a model's output using concepts requires tight coupling between the model's output and the concepts. This is fundamentally challenging for autoregressive models: while these models can compute losses over multiple tokens, their underlying autoregressive factorization still prioritizes single-token decisions. As a result, multi-token concepts are not easily expressed or controlled as coherent units. Moreover, while AR models can generate long sequences, the next-token objective limits concept-level controllability over extended contexts. In contrast, diffusion models update many token positions jointly at each denoising step, providing a natural mechanism for concept-level steering.
Because these concepts span multiple tokens, models that expose joint, multi-token update steps provide a more natural interface for concept-level control. Diffusion models meet this requirement: at each unmasking step, they update many positions simultaneously, allowing concept-level interventions that propagate in a coherent and interpretable way.
## Overview of Language Model classes
At a high level, a language model processes a sequence and determines how each position should be updated.
One main difference between AR and diffusion LMs is in the dependency allowed between tokens, as defined by the attention mask.
For instance, given the sequence 'The cat sat on the', an autoregressive model restricts each token to attend only to earlier tokens, while a diffusion model permits every token to draw on context from the entire sequence. We will illustrate here how we relax such restrictions and arrive at a diffusion LM with a new attention structure.
### Autoregressive Language Models
These are by far the most popular type of LLMs. Most frontier systems live in this family.
Autoregressive (AR) models are trained to predict the next token given the previous ones, so they're often referred to as **next-token predictors**.
Concretely, they use a transformer with a **causal attention mask** that lets each token *attend* only to earlier tokens.
**Pros:** Excellent empirical scaling, lots of engineering experience, very strong performance across domains.
**Cons:** The model generates one token at a time, which makes it less natural to control properties that are expressed over multi-token spans.
### Diffusion Language Models
Diffusion language models take almost the opposite approach.
They are often referred to as **any-order models**: they're trained with a **bidirectional attention mask**, randomly masking tokens and learning to predict the masked ones.
Recently, [Nie et al. (2025)](https://arxiv.org/abs/2502.09992) successfully trained the first large diffusion language model (LLaDA) from scratch, achieving performance competitive with state-of-the-art open-source AR models. Since then, several commercial DLMs have emerged, showing very low generation latency due to parallel decoding, including Mercury, Gemini-diffusion and Seed Diffusion.
**Pros:** Any-order modeling is naturally suited to tasks with non-causal dependencies, and multi-token generation is built in, so inference can be very fast. It also aligns well with scenarios where we ultimately want to control behaviour over multi-token spans.
**Cons:** For a fixed compute budget, their scaling is noticeably weaker than AR models. They're also quite new, so the "right" architectures and training tricks are still being figured out.
## From Diffusion to Blockwise Diffusion
Standard diffusion LMs like [LLaDa](https://arxiv.org/abs/2502.09992) train with **full attention** and random masking patterns, but often generate with a **semi-autoregressive schedule**: the sequence is split into blocks, and at each step the model denoises one block before moving to the next.
This leads to a **train/inference mismatch**.
During training, every token can in principle see every other token with random masking patterns.
But during inference, tokens in a block can only see the current partially denoised block and past blocks, while future blocks are completely masked.
[Block Diffusion LMs](https://arxiv.org/abs/2503.09573) were introduced to fix this mismatch.
They use a **block-causal attention mask** that is bidirectional within a block but causal across blocks, and they train on a concatenation of a noisy half and a clean half of the sequence.
The intuition is straightforward: the noisy half learns to denoise the current block, while the clean half provides the already-denoised history of previous blocks. This approach removes the train/inference mismatch, but comes with a significant cost. The sequence is effectively duplicated, with half the context wasted on carrying clean copies of tokens that also appear in the noisy half. This makes Block Diffusion models roughly **2× more expensive** to train than comparable DLMs with the same context window.
## Causal Diffusion
The main issue with Block Diffusion is its 2× training cost due to the clean half duplication, so one obvious question is: **Do we really need the clean half?** Our answer is **no**. We found we can keep the nice semi-autoregressive structure of Block Diffusion while dropping the clean half entirely. We call the resulting models **Causal Diffusion Language Models**.
At a high level, a Causal Diffusion LM has:
- **A single noisy sequence:** We do not concatenate noisy + clean; we diffuse over one sequence
- **Block-causal attention:** Within a block: full bidirectional attention. Across blocks: causal, where a block can only attend to previous blocks and itself, never future blocks
- **Per-block noise schedules:** Different blocks in the same sequence can be at different noise levels, but they all live in the same stream of tokens
Compared to other approaches, Causal Diffusion offers distinct advantages. Versus autoregressive models, we still have a causal structure across blocks but can update **whole blocks of tokens** at once. Versus standard diffusion LMs, we keep denoising-style training but no longer rely on full attention. Versus Block Diffusion, we keep the block-causal structure but **drop the clean half** and its 2× context overhead.
This blockwise structure isn't just a decoding trick. Many attributes we ultimately care about, *tone*, *formality*, *hedging*, whether a passage expresses *regret* or *apology*, are usually expressed over **multiple tokens at once**. Having an architecture that naturally reasons at the **block level**, rather than only one token at a time, will be important later when we start thinking about controlling these higher-level properties.
To illustrate these different attention patterns, here's a side-by-side comparison of all the approaches we've discussed:
Figure: Attention patterns for Autoregressive, Diffusion, Block Diffusion, and Causal Diffusion language models.
The visual shows how our Causal Diffusion approach (panel 4) creates a block lower triangular attention pattern that combines the best aspects of autoregressive causality with block-level updates.
In practice, Causal Diffusion models are much cheaper to train than Block Diffusion for similar quality.
To show this, we trained 246M-parameter Block Diffusion and Causal Diffusion models on the same data and tracked compute and performance over training.

On the **left**, we plot *training tokens vs. estimated FLOPs*. Because Block Diffusion carries both a noisy and a clean copy of the sequence, it uses roughly **2× more FLOPs** than Causal Diffusion for the same number of tokens.
On the **right**, we plot *training tokens vs. average LM Harness accuracy* (HellaSwag, OpenBookQA, ARC-Challenge, PIQA, WinoGrande). For a fixed number of tokens, the two models reach **very similar accuracy**.
In other words: Causal Diffusion keeps the benefits of Block Diffusion while delivering comparable performance at roughly half the training compute.
## Scaling Behavior
So far this has all been about architecture. To decide whether Causal Diffusion is worth scaling, we need to look at **performance vs compute**.
### Setup
We trained three 1.5B-parameter models, one from each family: AR (causal attention), Diffusion LM (full attention), and Causal Diffusion LM (block-causal diffusion).
All models were trained on **150B tokens** from a subset of Nemotron-CC-HQ subset.
Along the training trajectory we evaluated them on five LM Harness tasks: HellaSwag, OpenBookQA, ARC-Challenge, PIQA, WinoGrande.
We use LM Harness rather than validation loss because loss functions differ fundamentally between AR and diffusion models, making direct comparison difficult.
For the AR model we use standard log-likelihood scoring.
For the diffusion-style models we cannot compute exact likelihoods, so we follow prior work and use Monte Carlo sampling (128 samples) to estimate the likelihood.
We report the average LM Harness accuracy across the five tasks.
### Accuracy vs Tokens
First, we look at accuracy vs tokens seen:
As training progresses, AR consistently achieves the highest average LM Harness accuracy for a given number of tokens.
The standard diffusion model lags behind across most of the trajectory.
However, the causal diffusion model sits between the two: noticeably better than the diffusion baseline and closer to AR than to standard diffusion.
More concretely, we find that AR is superior for raw performance per token, but Causal Diffusion is a substantial improvement over full-attention diffusion while retaining the blockwise decoding structure.
### Scaling Laws: Error vs Compute
To make the comparison more explicit, we also fit simple scaling laws in **compute–error space**. We approximate compute as C ≈ 6 N D, where N is the number of non-embedding parameters, D is the number of training tokens seen, and C is the training FLOPs.
For each family, we compute and , where , and fit a straight line , which corresponds to .
The fitted scaling laws are:
- **Diffusion LM:**
- **Causal Diffusion:**
- **AR:**

Two patterns stand out. The Diffusion LM line has a flatter slope (): more compute helps, but more slowly.
AR and Causal Diffusion have very similar, but steeper slopes ( and ): both benefit substantially from additional compute.
In other words: Causal Diffusion inherits the good scaling behavior of AR models, while diffusion models lag behind.
### Where Causal Diffusion Fits
Putting everything together, each approach has distinct trade-offs. Autoregressive LMs achieve the best scaling per FLOP but are inherently one-token-at-a-time. Diffusion LMs support multi-token generation and any-order dependencies but scale worse per unit compute. Block Diffusion LMs fix the train/inference mismatch with block-causal structure but pay roughly 2× compute due to the clean plus noisy halves.
Causal Diffusion LMs keep the block-causal attention and multi-token generation, drop the clean half to become cheaper than Block Diffusion, and show AR-like scaling exponents with respect to compute.
Causal diffusion is a diffusion language model endowed with two properties: autoregressive scaling behavior and multi-token generation.
In the next post, this chunkwise view will be crucial for controlling higher-level properties of the output, which are usually expressed over spans rather than single tokens.
Our main takeaway is: Causal Diffusion provides the **scaling behavior of AR** and the **blockwise generation of diffusion** models.
---
--- title: Atlas: Orienting the Pre-Training data of an LLM
--- date: Tue Dec 02 2025 00:00:00 GMT+0000 (Coordinated Universal Time)
--- url: https://www.guidelabs.ai/post/atlas-concept-annotated-pretraining-release/
## Atlas: Orienting the Pre-Training data of an LLM
We built Atlas, an automated system for annotating language modeling corpora with human-understandable concepts at a sub-document level.
Using this system, we annotated a 1.5 trillion-token corpus spanning webtext, scientific writing, code, and synthetic data with over 33,000 concepts across science, technology, philosophy, medicine, and law.
These annotations allow us to train interpretable language models whose representations are aligned with human-meaningful abstractions.
Beyond model training, the annotations enable transparent model auditing, contamination detection, and fine-grained model control.
We have replicated the system on FineWeb and will be releasing `concept-fineweb-10b`, a 10-billion-token corpus annotated with its own data-derived concept library.
The Concept Atlas below is an interactive visualization (UMAP projection) showcasing a representative 10% subset (3,372 concepts) from our comprehensive set of 33,732 concepts.
Shown below are two examples: one about a text related to Mythology and another one on Orbital Mechanics.
We demonstrate how Atlas maps the tags associated with each text to human-interpretable concepts and descriptions within the concept library.
{/* Annotation examples: tabbed view showing source text → tags → concepts */}
Enkimdu and Inanna
Enkimdu is featured prominently in the myth "Inanna Prefers the Farmer," in which both he and the god Dumuzi are attempting to win the hand of the goddess Inanna. While Inanna is quite infatuated with the down-to-earth farmer, her brother Utu/Shamash attempts to convince her to marry Dumuzi instead. Both Dumuzi and Enkimdu face off in an argument over who will win Inanna.
Tags → Concepts
Tag
Concept Name
Concept Description
mythology
Mythological narratives
Narratives, characters, and themes from traditional stories
mesopotamian-mythology
Ancient Mesopotamian studies
History, literature, mythology of ancient Mesopotamia
god-relationships
Divine being classifications
Characteristics, roles, interactions of divine beings
goddess-suitors
Divine Feminine Figures
Divine feminine attributes and symbolic representations
character-conflicts
Interpersonal disputes
Tensions between individuals or groups
Tidally Driven Libration
We conclude that even among the bodies of the Solar system, a large variety of libration spectra should be found, with relatively inviscid, icy satellites such as Io and possibly Titan exhibiting tidally driven, cos M-dominated, libration of much reduced amplitudes due to the tidal feedback.
Tags → Concepts
Tag
Concept Name
Concept Description
planetary-dynamics
Orbital mechanics
Motion and gravitational interactions of celestial bodies
theoretical-analysis
Mathematical modeling
Theoretical frameworks for physical phenomena
celestial-bodies
Solar system objects
Planets, moons, and other bodies in the solar system
icy-satellites
Planetary moons
Natural satellites with icy compositions
tidal-dissipation
Gravitational effects
Energy loss due to tidal forces
## Introduction
At Guide Labs, we are building models whose reasoning and representations are transparent, so that humans can audit and understand them.
Achieving this goal requires that a model’s internal representations align with _concepts_: coherent and atomic human-meaningful units, rather than inscrutable statistical features and correlations.
We set out to pre-train language models on large-scale corpora in a way that constrains the model’s internal representations to be cleanly decomposable into these concepts.
Consequently, we needed a comprehensive concept library for contemporary LLM pre-training corpora.
**Goals**.
With this vision in mind, we set out to build a concept library suitable for supervising large-scale LLMs during pre-training, mid-training, and post-training. Such a library must be:
- *Multi-scale:* covering both high-level themes and fine-grained units.
- *Stylistically expressive:* capturing attributes like tone or formality, not just semantic categories.
- *Localizable:* applicable to spans within a document, enabling sub-document-level control and edits.
- *Human-meaningful:* comprising concepts that people actually care about understanding and controlling.
- *Representative:* spanning broadly enough to cover the true distribution of large-scale LLM pre-training corpora across webtext, code, math, and scientific writing.
### No human-interpretable concept library for LLM pre-training data existed at scale
Before building our concept library, we surveyed existing approaches in the language modeling literature for large-scale concept extraction.
We found no concept library that was both human-interpretable and representative of real pre-training data at scale.
Broadly, concept libraries fall into three categories.
**Word-based concept dictionaries.**
For example, [Luo et al.](https://arxiv.org/abs/2406.04331) construct a 40,000-item concept dictionary by selecting the most frequent words from the [Brown Corpus](https://www.kaggle.com/datasets/nltkdata/brown-corpus) and prompting GPT-4 to generate sentences illustrating each word.
However, single words cannot capture the higher-level abstractions, multi-sentence topics, or domain-specific ideas that appear throughout pre-training corpora.
**Activation-derived, unsupervised concepts.**
Other works extract concepts directly from [model activations](https://transformer-circuits.pub/2024/scaling-monosemanticity/), such as directions discovered via [sparse autoencoders (SAE)](https://transformer-circuits.pub/2023/monosemantic-features/index.html).
It has become [doubtful](https://arxiv.org/abs/2501.16615) whether these concepts reflect the model’s internal structure.
Further, SAE-derived concepts are not necessarily human-meaningful.
**Narrow-domain supervised libraries.**
Some concept sets focus on specific tasks such as [sentiment or toxicity](https://arxiv.org/abs/2412.07992).
These offer high-quality labels but are too narrow to supervise models trained on diverse corpora spanning webtext, code, math, and scientific writing.
**Our contribution.**
In the rest of this post, we describe how we built a concept library with over 33,000 concepts, covering webtext, code, mathematical content, and scientific topics.
We begin with an overview of the datasets we have annotated, then walk through our annotation pipeline.
Finally, we show how Atlas enables the deduplication of millions of raw, freeform annotations into a canonical set of coherent, human-meaningful concepts suitable for supervising LLMs.
## Methods
To develop models whose internal representations can be aligned with human-interpretable concepts, we first needed a concept-annotated pre-training corpus large enough to reflect the full conceptual breadth of modern LLM datasets.

We began by constructing a representative sample of our pre-training and mid-training data mixtures.
This sample spans: webtext, code, math, scientific writing, encyclopedic content, question–answer exchanges, and instruction-following data.
In total, we collected 6.6 million documents, balanced across six major document categories(roughly one million documents per each), including:
- Webtext: [DCLM (deduplicated)](https://huggingface.co/datasets/Zyphra/dclm-dedup),
- General academic knowledge: Pes2o, Arxiv, Wikipedia and Wikibooks
- Math: Dolmino-math, including GSM8K
- Code: StarCoder
- Q&A: FLAN v2
[DCLM](https://arxiv.org/pdf/2406.11794) is a high-quality dataset that uses model-based quality filtering to filter a large subset of the Common Crawl for similarity to OpenHermes and other instruction-tuning datasets.
DCLM contained a large fraction of duplicates (approximately 80% duplicated content).
Therefore, we used the deduplicated version of DCLM from [Zyda-2](https://huggingface.co/datasets/Zyphra/dclm-dedup).
Our pre-training and mid-training data mixture follows the [Olmo](https://arxiv.org/abs/2501.00656) [recipe](https://huggingface.co/datasets/allenai/dolmino-mix-1124) with some notable exception.
These documents vary substantially in style, purpose, and conceptual density, making them an ideal substrate for large-scale concept extraction.
From these documents we generated 44 million text chunks, each representing a short semantically coherent span (typically 128–256 tokens depending on the domain).
These chunks form the fundamental units we annotate.
Aggregated over all domains, our annotation pipeline produced nearly 500 million tags, of which 14 million were unique tags and 1 million appeared more than 15 times.
This large, diverse, and redundant tag space is essential for the clustering and canonicalization procedures that follow.
### Evaluation Procedure: LLM Concept Validation
Across all stages of the pipeline, we rely on a unified evaluation framework to measure annotation quality, concept coherence, and classifier performance.
Each annotation, whether a raw tag, a canonicalized concept, or a predicted label from the final concept annotator model, is scored by human raters and LLM judges on a 1–5 scale.
We designed the benchmark such that a score of 2 or higher is considered successful, i.e, the concept is minimally present.
This threshold is intentionally permissive: minor tags capture fine-grained contextual details.
We visualize these scores using per-tag and per-chunk histograms.
This evaluation method appears repeatedly throughout the rest of the post, as we assess:
- the recall and relevance of raw tags (Stage 1),
- the coherence and redundancy of concept clusters (Stage 2), and
- the accuracy and calibration of predictions from the final concept annotator model (Stage 3).
## Stage 1: Documents → Tags
In this stage, we seek to convert raw documents into structured semantic tags that describe the key concepts present in them.
These raw tags are intentionally allowed to be broad, granular, and overlapping; hence, they serve as the raw material from which we will later derive a canonical concept library.
For this stage, the priority is coverage rather than precision: we want as many concept candidates as possible, across all domains and levels of granularity.
At a high level, Stage 1 consists of three operations:
- splitting documents into chunk-sized spans,
- prompting a model to generate structured annotations for each span, and
- validating these annotations using the global scoring methodology described in the overview section.
The remainder of this section details the engineering decisions behind each step and the empirical results that demonstrate the success of Stage 1.
### Chunking: Converting Documents into Chunk-Level Annotation Units
LLMs often do not annotate whole documents well: long documents contain multiple unrelated concepts, and annotation models tend to default to high-level summaries rather than the granular conceptual units.
We therefore annotate at the chunk level, typically capturing one coherent idea, argument, problem, or algorithmic structure. A chunk is a short, semantically coherent segment of text, typically 128–256 tokens, created by concatenating consecutive sentences until a domain-specific token limit is reached.
Chunks are the fundamental units of annotation in Stage 1: every chunk receives a structured set of tags describing its conceptual content.
We perform high-speed chunking by first detecting sentence boundaries using blingfire, then tokenizing and concatenating sentences until a domain-specific threshold is reached: 150 tokens for webtext and general documents, 256 tokens for math and code, where a single idea requires more context.
Sentences which exceed the threshold are treated as single chunks, unless they exceed 50,000 tokens (in which case they are dropped).
This adaptive thresholding ensures that each chunk contains a complete semantic unit rather than arbitrary fragments.
This process yielded 44 million chunks from 6.6 million documents.
Chunking at this granularity proved essential: concept labels applied at the chunk level map closely to local meaning, enabling fine-grained conceptual decomposition.
Click through the panels below to see real examples of how documents are chunked across different domains.
### Structured Annotation Schemas: Domain-Aware Tagging
Different domains express conceptual structure differently.
A Wikipedia article, a math proof, a stack exchange Q&A, and a block of Python code demand different concept ontologies.
We therefore extracted domain-specific structured tag schemas, each containing 4–6 domains or fields tailored to the content type.
Each field expands into a hierarchical tag, from broad to narrow and granular, and together they produce on average 10–15 tags per text chunk.
These chunk-extracted tags form the basis for Stage-2 of our Atlas pipeline.
{/* Schema accordion: collapsible domain-specific tag schemas with examples */}
{/* Schema tabs: domain-specific annotation schemas */}
doc-type: code comments, API descriptions, tutorials
style: formal vs. informal documentation
Fields: main, purpose, method, minor
Similar to webtext, with "tone" typically absent
More emphasis on scientific purpose or method
Example: Orbital Mechanics
minor: celestial-bodies→icy-satellites
The examples above illustrate the consistency and richness of the structured fields: each annotation combines topic, purpose, structure, and secondary concepts, generating a high-recall conceptual snapshot of each chunk.
### Annotator Model Selection
We evaluated several open-weight models for annotation.
While the Phi models could output consistently structured annotations, they suffered from repetition and frequent collapse into degenerate loops, making it unusable for long-running annotation.
Qwen 2.5 7B performed better semantically but more unpredictably syntactically, often producing the wrong number of fields and generating responses that were difficult to parse reliably.
In contrast, Mistral small 3.1 (a 24B model) consistently adhered to the structured format, avoided repetition collapse, and maintained stable behavior across millions of prompts.
In the end, Mistral was the smallest model that satisfied our formatting and reliability constraints; even a 1% schema deviation across 44 million chunks would produce nearly half a million unusable annotations, so predictability was essential.
### Output of Stage 1: A High-Recall Pool of Raw Tags
Stage 1 produced:
- 44 million annotated chunks
- 500 million short tags extracted from structured fields
- 14 million unique short tags after deduplication
- 1 million short tags with >15 occurrences
This raw tag space is intentionally redundant and noisy.
The purpose of Stage 1 is *not* to produce the final concept inventory; it is to cast a wide semantic net.
The refinement and consolidation into 33,000 canonical concepts happens in Stage 2.
Using the global scoring framework introduced earlier, where both humans and LLM judges rate each tag and chunk on a 1–5 scale, we evaluated the output of Stage 1.
A score of 2 or higher counts as successful, reflecting both the high-recall objective of this stage and the inherent granularity of minor tags.
Across all domains, almost every tag scored at least a 2, and chunk-level averages are between 3.5 and 4.2.
Human audits confirmed that the tags were conceptually relevant and coherent.
## Stage 2: Tags → Concepts
Stage 2 transforms millions of noisy, free-form tags into a coherent library of more than 33,000 human-interpretable concepts.
Through large-scale embedding, clustering, LLM-based cluster coherence evaluation, concept labeling, and graph-based deduplication, we create a diverse and comprehensive concept inventory that reflects the true semantic structure of our data corpus.
### Tag Normalization and Embedding
The first challenge is that short-tags generated in Stage 1 are highly variable.
Minor formatting differences, hyphens, slashes, whitespace, punctuation artifacts, can split semantically identical tags into separate strings.
Before clustering, we normalize all tags into a standardized form, with examples shown below.
Raw Tag
Normalized Tag
astronomical-objects
astronomical objects
climate-change \n adaptation
climate change adaptation
machine-learning / ai
machine learning ai

After standardizing and deduplicating raw tags, we obtain nearly 14 million unique normalized tags for semantic modeling. Most tags appear very infrequently (less than 5 times), whereas tags that appear more than 10 times constitute only about 10% of the total as shown in the tag distribution plot above.
To capture the meaning of each tag, we used the `all-mpnet-base-v2` embedding model to embed and convert them into 768-dimensional vectors. These vectors are dense, numerical representations that encode the semantic similarity between tags in a high-dimensional space. We experimented with other embedding models, such as `Qwen3-Embedding-0.6B`, but `all-mpnet-base-v2` produced slightly better-quality clusters, confirmed by standard metrics like the Silhouette score and Davies–Bouldin index (discussed below). This embedding step produces 14 million vectors, one for every unique tag, capturing the conceptual hint extracted from the original text chunks.
### Clustering Tags into Semantic Groups
With all tags embedded in a common vector space, we cluster them into groups representing potential concepts. The goal is to group tags that are semantically similar, e.g., “frozen planet,” “icy satellites,” and “interstellar ice diffusion” into the same conceptual region or cluster. We use the K-Means algorithm to cluster tag embeddings, implemented efficiently using the FAISS library, chosen for its efficiency and ability to handle tens of millions of vectors on GPU. K-Means is an iterative algorithm that determines a set of K cluster centers (centroids) by minimizing the following objective function, also known as the Within-Cluster Sum of Squares (WCSS). In our implementation, each tag embedding is assigned to the nearest centroid by maximizing the cosine similarity metric (which is mathematically related to minimizing L2 distance on normalized vectors).
Because the number of true underlying concepts is unknown, we sweep over a wide range of cluster sizes: `k ∈ {100, 500, 1k, 10k, 20k, 30k, 50k, 80k, 100k}`.
Small k values (fewer clusters) result in very large, diverse clusters that contain a wide variety of tags. Large k values (many clusters) naturally create tighter, more specific clusters, but this risks fragmenting genuinely related concepts or introducing clusters based on noise. We evaluate the quality of each k value using multiple metrics to select the near-optimal number of clusters:
1. Silhouette Score: Measures how similar each point is to its own cluster compared to other clusters (higher is better, range -1 to 1).
2. Davies-Bouldin Index: Measures the average ratio of within-cluster distances to between-cluster distances (lower is better).
3. LLM-based Coherence Scoring: A subjective measure of the semantic quality and cohesiveness of the tags within each cluster (higher is better)
The Silhouette score peaks early and then decreases with k, while the Davies–Bouldin index decreases monotonically. Since we compute these metrics on a sub-sample of the 14 million points, they don't accurately represent cluster quality at different k values. We therefore rely on LLM-based coherence evaluation.
We perform a semantic coherence evaluation on every cluster.
This is necessary because not all K-Means clusters represent genuinely meaningful concepts—some may be artifacts of the embedding process or lexical coincidences.
For each cluster, we sample three strata of tags: Core tags (closest to centroid), Random tags (uniform sample), and Edge tags (farthest from centroid).
We group tags into sets of 10 and query an LLM (Mistral-Small-3.1-24B) to score coherence on a 1–10 scale.
Coherence improves steadily as k increases, plateauing around k = 60,000 – 80,000.
Beyond this range, computational cost rises while semantic gains diminish.
We select k = 80,000, producing clusters that balance granularity, separation, and interpretability.
Most clusters contain 200–500 tags, though sizes vary depending on domain density and tag frequency.

We retain only high-quality clusters that meet minimum coherence thresholds: core coherence ≥ 9, random coherence ≥ 8, and edge coherence ≥ 7. These thresholds ensure clusters are conceptually tight at their centers and maintain semantic coherence at their edges. Of the initial 80,000 clusters, 17,443 failed our criteria and were removed, leaving 62,557 high-quality semantic units. This step substantially improves the signal-to-noise ratio and prepares clusters for labeling into human-understandable concepts.

### Cluster Labeling and Deduplication: Producing Human-Readable Concepts
With a high-quality set of clusters, we convert each one into a readable, human-interpretable concept.
Each concept receives a 1–6 word label and a concise, descriptive sentence explaining its meaning.
For each cluster, we sample 50–100 representative tags (weighted by frequency) and use a 24B Mistral model to generate a concise label and a one-sentence description.
These labels provide a clean, human-friendly interpretation of dense semantic clusters.
We then embed the LLM-generated concept names and descriptions using a Qwen3-Embedding-0.6B instruction-tuned embedding model, producing a semantic space of concepts.
The interactive visualization below shows the embedding space of tags colored by their assigned clusters. Each point represents a tag, positioned according to its semantic similarity to other tags. Hovering over points reveals the tag text, its cluster assignment, and the mean coherence score of its cluster. The visualization also highlights several low-coherence clusters that were filtered out by our coherence thresholds, demonstrating how quality control removes semantically inconsistent groupings.
The concept embeddings in 1024-dimensional space reveal significant duplication based on name and description similarity. This makes merging similar concepts necessary. The challenge is determining which concepts to merge and at what granularity. We address this through an iterative graph-based merge discovery and concept merging process.
Concept Deduplication: For each concept, we embed it using the qwen3-embedding-0.6B model with its name and description.
We then find its Top-20 nearest neighbors based on cosine similarity and retain neighbors with similarity above 0.85.
Next, we construct an undirected similarity graph where nodes are concepts and edges connect nodes whose cosine similarity meets the threshold.
We run the Louvain community detection algorithm to identify groups of related concepts.
Each connected component represents a candidate merge set.
For each potential merge set, we checked the groups and combine member concepts by regenerating a label and concise rich description using an LLM.
This process reduces the count to around 39,000 concepts. We repeat merging for 2–3 iterations until reaching 35,000 concepts with acceptable diversity among the most similar concepts.
### Evaluation
To evaluate the quality of the final canonical concepts and their suitability for downstream annotator training, we applied the same human + LLM scoring framework introduced earlier.
We sampled concepts and the chunks associated with them, scoring both the concept labels and the chunk–concept assignments on the 1–5 scale used throughout this work.
Concept-level evaluation shows that the vast majority of concept labels score between 3.5 and 4.5, with only a small tail below 3—indicating that the labeling and deduplication pipeline consistently produces clear, human-interpretable concepts.
When we evaluate chunk–concept alignment, we achieve a ~97% success rate (score ≥ 2) across the 2.3 million chunks used as training data for content concepts.
This high alignment rate confirms that the clusters are not only semantically coherent internally but also accurately represent the conceptual content present in real text.
### Final Concept Library
This process reduces 62,557 coherent clusters to 33,000 final canonical concepts.These final concepts represent distinct, interpretable elements of the conceptual landscape of our corpus.
Some conceptual overlap and hierarchy is inevitable, real-world knowledge is not cleanly partitioned, but the deduplicated set strikes a practical balance between granularity and clarity.
### Concept Taxonomy
We organized our 14 million unique concept tags into a hierarchical taxonomy based on the Library of Congress classification system, mapping our concepts to approximately 2,600 nodes across a tree structure with 20 top-level branches and ~6,500 total nodes (with depth up to 9 levels).
Below, we show a distribution of concepts across these taxonomy groups. Science (Q) dominates with 38% of concepts, followed by Technology (T) at 15%, Social Sciences (H) at 15%, and Medicine (R) at 8%, with all 20 root branches represented to varying degrees.
Within the heavily-populated Science branch, mathematics and physics subcategories are particularly prominent, QA (Mathematics) subdivisions like Analysis (QA299.6-433), Geometry (QA440-699), and Algebra (QA150-272.5) account for over 3,700 concepts combined, while QC (Physics) areas like Atomic/Molecular Physics (QC170-197) contribute another 662 concepts, alongside substantial representation in Technology areas like Telecommunications (TK5101-6720) and Computer Hardware (TK7885-7895).
This taxonomy provides a structured framework for understanding the topical coverage of our training data and enables hierarchical analysis of concept distributions across different knowledge domains.

This canonical concept library serves as the foundation for Stage 3, where we train a concept annotator model capable of labeling arbitrary text with these concepts.
## Stage 3: Concept Annotator Model
The final stage of the pipeline is to train a model that can assign concepts directly to text.
Whereas Stage 1 produced raw chunk-level tags and Stage 2 distilled them into a canonical library of ~33,000 coherent concepts, Stage 3 builds a model capable of recognizing these concepts in arbitrary text.
This model, the multi-domain, multi-label concept annotator, is what ultimately enables concept-supervised pre-training across the entire pre-training corpus, including concept-aware fine-tuning.
We aim to predict concepts across four distinct taxonomies: content, tone, demographic, and alignment.
Each category has different structures, and label counts.
Rather than maintain four separate classifiers (with separate encoders, heads, thresholds, and inference logic), we train a single unified model that jointly predicts all four domains while sharing computation, embeddings, and inference pipelines.
### Goals and Design Principles
Three design constraints shaped the annotator architecture:
- **One model, multiple domains:** Maintaining four different classifiers would complicate the training and inference stack. A single model with separate output heads keeps the system simple and maintainable.
- **Incorporate the strengths of the KNN baseline:** A simple KNN classifier over Stage 2’s concept embeddings performs surprisingly well on content-label prediction, especially for frequent, well-represented concepts. The question was whether a learned model could *surpass* similarity-based retrieval while retaining its strengths.
- **Positive-Unlabeled (PU) supervision:** The data contains **only positive labels** for each domain: no negative annotations. This requires careful definition of target spaces, loss functions, and evaluation metrics.
These constraints drove the design of a multi-head MLP on top of a shared encoder.
### Supervision: Positive–Unlabeled Learning
Because each chunk is labeled only with the concepts it should have, never with the concepts it explicitly should not have, supervision follows a [Positive and Unlabeled (PU)](https://link.springer.com/article/10.1007/s10994-020-05877-5) paradigm.
Each concept receives one of three labels: **+1** — positive, **0** — unlabeled, **–1** — explicit negative (rare; used only after optional negative sampling).
During training, we optionally sample a small number of synthetic negatives per example to stabilize learning.
PU learning influences both training (via masked-BCE and PU loss) and metrics (via PU-aware scoring callbacks)
### Input Representation and Targets
Atlas tokenizes text using Qwen3-Embedding-0.6B and maps each example to a concept vector via CLS or last-hidden-state pooling. To prevent rare concepts from being drowned out, it uses a rarity-weighted sampling scheme that boosts underrepresented labels during training; critical for stable learning on long-tail distributions.
### Model Architecture and Encoder
The annotator is a Qwen-based encoder with a shared trunk and domain-specific heads.
A Qwen3-Embedding-0.6B model produces a pooled embedding for the input text.
This embedding flows into: A lightweight MLP with: LayerNorm, ReLU, Dropout and Projection back to 1024 dimensions.
This trunk supports the tone, demographic, and alignment heads, while content labels follow a slightly different path.
### Content Head: Dot-Product with Label Embeddings
Content concepts are embedded into a label matrix, with predictions made directly from encoder output; bypassing the shared MLP trunk to preserve KNN-like signal while still allowing learned improvements.
Tone, demographic, and alignment heads use simpler linear classifiers, which suffice for their lower-cardinality domains.
### Losses
We combined two loss functions: Masked Binary Cross-Entropy (Standard BCE applied only to the positions where targets are non-zero), and Non-negative PU Loss.
A PU-compatible objective penalizes overconfident positives on unlabeled classes and stabilizes long-tail learning.
This combination handles both positive-only supervision and the occasional sampled negative.
### Evaluation Metrics
We tracked metrics such as Precision-Recall Metrics, Macro Average Precision (PR-AUC) per domain, Precision@5, and Recall@10.
However, principal reliance on them is problematic for PU (positive-unlabeled) learning because without confirmed negatives, any unlabeled example predicted as positive gets counted as a false positive; even if it's actually correct.
This systematically deflates precision and PR-AUC, making model performance appear worse than it is. The metrics still track relative improvement across runs, but absolute values should be interpreted with caution.
Consequently, we rely mostly on the LLM Concept Validation procedure we discussed earlier.
Using the global scoring framework from earlier stages, we evaluate how well the annotator’s predictions align with human assessments.
Across 2.3 million chunks used in annotator training, we achieve a **~97% success rate** (score ≥ 2) when comparing predicted concepts with human ratings.
Score distributions cluster strongly between 3.5 and 4.3, indicating that the model consistently assigns coherent, meaningful concepts to text across all domains.
This confirms that Stage 3 successfully operationalizes the canonical concept library derived in Stage 2.
## Conclusion
We have presented Atlas, a 3 stage pipeline, for concept annotation of large-scale LLM pre-training corpora.
By moving from raw documents to high-recall chunk tags to coherent canonical concepts to a unified multi-domain annotator, we create a foundation that enables interpretable model training.
While there is still room for refinement, the combination of large-scale automation and targeted human validation proves that high-quality concept structure can be embedded directly into modern LLM workflows.
---
--- title: Introducing Guide Labs: Engineering Interpretable and Auditable AI Systems
--- date: Sun Nov 17 2024 00:00:00 GMT+0000 (Coordinated Universal Time)
--- url: https://www.guidelabs.ai/post/introducing-guide-labs-engineering-interpretable-and-auditable-ai-systems/
## Introducing Guide Labs: Engineering Interpretable and Auditable AI Systems
Guide Labs is building a new class of interpretable AI systems that humans and domain experts can reliably understand, trust, and debug. As individuals and businesses around the world quickly work to integrate AI into their existing workflows, and governments seek to regulate frontier AI systems, the demand for models that can be reliably debugged, steered, and understood is ever increasing. To help bring interpretable and reliable models to market, we have successfully closed our seed funding of $9.3 million led by Initialized Capital. We are excited to have participation from Tectonic Ventures, Lombard Street ventures, Pioneer Fund, Y Combinator, E14 Fund, and several prominent angels.
## Current AI systems are not reliable, not interpretable, and are difficult to audit
The prevailing paradigm that most AI companies use is to train models as monoliths — typically using the transformer architecture — trained solely for narrow performance measures like next word prediction. However, this results in models that are difficult to work with, debug, and reliably explain. Even more alarming, current systems produce explanations and justifications that are completely unrelated to the actual processes the system used to arrive at its output.

* **Current AI systems produce explanations that are unrelated to their decision making process:** Models can do everything from medical diagnoses, to candidate resume analysis, to determining qualifications for a home loan, and yet these models are inherently biased. In fact, when a model's training process is unchecked, the default behavior is that the model’s explanations and justifications are entirely unrelated to the model’s output. This status quo is untenable because it renders current levers for arriving at insights about AI models ineffective. We need AI models that provide reliable justifications that are faithful — and truthful — to the way the AI models arrive at their output.
* **You cannot reliably debug a system that you don’t understand:** When you call your favorite generative model API for a task, and the response is incorrect, or copied verbatim from its training data, what do you do? You can change the prompt, but it’s unclear which part to change, and even then, it doesn’t guarantee the right output. If changing the prompt doesn’t work, it’s difficult to understand whether you need to fine-tune , add more in context examples, or switch API models altogether. We need AI systems that can provide actionable insights that allow us to address these issues.
Gemini Struggles with recitation of its training data.

[https://github.com/google/generative-ai-docs/issues/257](https://github.com/google/generative-ai-docs/issues/257)
* **Difficult to control and align current AI systems:** Even when you’ve identified the cause of a problematic behavior from your model, it often requires a lot of trial and error to change the model behavior so that it no longer makes the mistake that you’ve identified. Fine-tuning and prompting models is too unreliable. What we need is fine-grained control.


While these large-scale models are still in their infancy, it is already clear that training for narrow performance measures like next word prediction without consideration for interpretability leaves too much room for error when it comes to mass public adoption. Repeatedly, in [computer vision](https://arxiv.org/abs/1905.02175), [natural language](https://arxiv.org/abs/2103.06922), [medical images](https://www.nature.com/articles/s41746-019-0105-1), [image generative models](https://www.bloomberg.com/graphics/2023-generative-ai-bias/), and [especially LLMs](https://www.bloomberg.com/graphics/2024-openai-gpt-hiring-racial-discrimination/?leadSource=uverify%20wall), it is the norm that optimizing for narrow performance measures does not yield reliable models.
## A new path: AI systems that are engineered to be interpretable
**At Guide Labs we believe you cannot reliably debug, align, and trust a model you don’t understand.** These critical properties cannot just be left unaddressed until after a model has been trained; they should guide the entire model development pipeline. Instead, we are rethinking the entire pipeline–model architecture, datasets, and training procedure—to engineer models that are interpretable, safe, trustworthy, and easier to debug and fix.
We want to enable reliable interaction, understanding, and controllability of models and AI systems. More specifically, we want:
* A medical doctor to understand why a medical LLM is making a particular diagnosis.
* A loan officer to verify whether an LLM is unfairly relying on legally protected attributes like gender, race, etc. for loan decisions.
* A biologist to be able to interactively understand why a protein language model has generated a particular sequence, and be able to interactively control the biophysical properties of the model.
To fulfill these requirements and many more, we need models that: produce reliable and trustworthy outputs; provide insights about which human-understandable factors are important; and indicate which part of the prompt, context, and training dataset are responsible for the output.
Collectively, our team has more than 20 years of experience focused on the interpretability and reliability of AI systems. We have published more than two dozen papers at top machine learning venues. Critically, we have shown that machine learning models trained solely for narrow performance measures, without regard for interpretability, result in models whose explanations are [mostly unrelated to the model’s decision-making process](https://arxiv.org/abs/1810.03292), and are not [aligned with humans for consequential decisions](https://arxiv.org/abs/2410.15471). Even worse, explanations of unchecked models can [actively mislead](https://arxiv.org/abs/2011.05429). More recently, we’ve shown that [self-explanations, like chain-of-thought, of LLMs are unreliable](https://arxiv.org/abs/2401.07927).
These results directly inform our approach to [engineer AI models that are interpretable, reliable, and trustworthy](https://arxiv.org/abs/2405.05386). Toward this end, we have demonstrated the effectiveness of rethinking a model’s training process for [language models](https://arxiv.org/abs/2310.07819) and [protein property prediction](https://ai4d3.github.io/papers/79.pdf). We developed one of [the first generative models](https://openreview.net/forum?id=L9U5MJJleF) at the billion-parameter scale that is constrained to reliably explain its outputs in terms of human-understandable factors.
Our past experience has shown that it is crucial to integrate interpretability, safety, and reliability constraints as part of the model development pipeline, and that these constraints can be satisfied without compromising downstream performance. With the new AI systems we are building, we can more easily identify the causes of erroneous outputs, detect when models latch onto spurious signals, and correct the models effectively. We aim to create a world where domain experts shift from merely 'prompting' AI to engaging in meaningful and truthful dialogue with AI systems.
## A First Step: Interpretable LLM at the Billion Parameter Scale
To demonstrate that our approach is feasible, and that constraining models do not sacrifice performance, we have developed an interpretable LLM that:
* produces human-understandable factors for any output it generates;
* produces reliable context citations; and
* specifies which training input data have the most effect on the model’s generated output.
We have shown that it is possible to train large-scale generative models that are engineered to be interpretable without sacrificing performance. We are excited to continue to scale these models to match current alternatives, expand the range of interpretability features we provide, and partner with select organizations to test the model. Reach out to us at: [info@guidelabs.ai](mailto:info@guidelabs.ai) if you would like to learn more.
## Join Us
We have assembled a team of interpretability researchers and engineers with an excellent track record in the field. We are hiring machine learning engineers, full-stack developers, and researchers to join us. If you are interested in joining our team, reach out to [careers@guidelabs.ai](mailto:careers@guidelabs.ai).
---