Interpretable Intelligence: AI you can Understand and Trust

The march toward superintelligence is afoot, yet the AI systems leading it remain fundamentally opaque. At Guide Labs, we are building large-scale AI systems that are auditable, transparent, and steerable by design.

We have developed the first large-scale interpretable LLM, an 8 billion parameter model capable of explaining its outputs through mechanisms that humans can understand. Over the coming days, we will be sharing the key steps that led to this achievement:

Atlas: an automated system that annotates trillion-token datasets with human-interpretable concepts,
Causal Diffusion Language Models: a new discrete diffusion architecture, using block causal attention, that outperforms other diffusion alternatives,
An 8B Interpretable Model: the first 8 billion parameter interpretable language model to scale a concept-controlled architecture, and
PRISM: an architecture that allows you to trace a language model’s output to its training data.

Background

Increasingly, current AI systems achieve feats only a select few humans can match, such as gold medal performances at the International Mathematical Olympiad, the International Olympiad in Informatics, and the International Collegiate Programming Contest. These systems are also increasingly used in real world settings where we expect them to meet strict standards for reliability and accountability, such as screening job candidates, informing credit scoring, assisting clinicians, supporting legal discovery, and helping synthesize potential drug candidates.

Unfortunately, current AI systems fall short of these standards, and failures appear in unexpected, sometimes costly ways. Lawyers have submitted briefs containing fabricated judicial decisions generated by an AI assistant. AI customer-support agents have issued unauthorized refunds, and promised benefits that do not exist. AI agents have deleted production databases, wiping out critical systems despite explicit prompts intended to limit their actions.

These failures may appear paradoxical given the impressive capabilities of modern AI systems, but we have seen this duality before. Capability does not guarantee reliability. Before AlphaGo’s historic match against Lee Sedol, the Google Deepmind team invited European champion, Fan Hui, to probe the system’s strengths and weaknesses. He uncovered a critical flaw, stating: “I played with AlphaGo to understand where is the strong points of AlphaGo and where is maybe the weakness… And I find something, I find big weakness about AlphaGo. It’s a big one.” Project lead, David Silver, explained the underlying difficulty: “There will be these tricky lumps of knowledge it understands very poorly… It can be completely delusional.”

Despite this weakness, AlphaGo produced Move 37: one of the most creative moves in Go’s history. A move most human professionals would never play. AlphaGo is capable of both superhuman creativity and deep, invisible flaws, within the span of a single match. Superhuman insight and invisible failure modes, within the same system.

These failures occur because modern AI systems remain fundamentally opaque and inscrutable. Today’s models use dense, entangled internal states that neither researchers nor developers can reliably understand. When a system produces an output, we lack reliable tools to see which mechanism caused it, why it occurred, or how to correct it.

Interpretable Intelligence vs Interpreting Intelligence

Until recently, it was widely assumed that building large-scale interpretable models was impossible without sacrificing performance. Over the past year, at Guide Labs, we have shown this assumption to be false.

Contemporary interpretability attempts to answer these questions in a post-hoc manner. These methods interpret the intelligence of ML models by interrogating models only after they have been trained. This approach has succeeded only at offering partial, and often unfaithful, glimpses into a system’s behavior. It has not proven capable of completely explaining, reliably, the behavior of current AI systems. As AI grows more capable and becomes further embedded in critical infrastructure, this opacity is no longer a scientific inconvenience; it is a structural risk.

At Guide Labs, we are pioneering Interpretable Intelligence, a new paradigm in which models are constructed, from the ground up, to be transparent, controllable, and causally understandable. These models have human-interpretable concepts built into their computational structure and therefore are inherently interpretable.

Today, we are sharing how we did it.

Four Core Advances in Interpretable Intelligence

1. Atlas

We built Atlas, an automated system that annotates trillion-token datasets with human-interpretable concepts. These annotations enable interpretable model training, transparent dataset auditing, contamination detection, and fine-grained control over internal representations.

2. Causal Diffusion Language Models (CDLMs)

We developed a new discrete diffusion architecture using block causal attention, scaling it to billions of parameters while preserving coherent text generation and incurring no performance trade-offs relative to standard diffusion language models.

3. Scaled Interpretable Models (8B Parameters)

We trained the first 8-billion-parameter interpretable language model from scratch, demonstrating that concept–constrained architectures scale cleanly across both autoregressive and diffusion settings, and do so without the long-assumed interpretability–performance trade-off.

4. PRISM: Training data attribution in a forward pass

We built PRISM, a family of 130M–1.6B-parameter models that reveal which training-data patterns drive each next-token prediction. Prism matches baseline quality within 5%, adds under 2% parameter overhead, and shows that transparent attribution can be achieved with negligible training cost.

Taken together, these advances constitute the first time that large-scale language models have achieved native, inherent, interpretability. We have demonstrated that understanding, reliability, and auditability can be fundamental properties of models.

If you’d like early access, you can sign up here: signup link. We’ll be opening our model to researchers, practitioners, and organizations as we scale.

←

Next blog
Atlas: Orienting the Pre-Training data of an LLM

Previous blog
Introducing Guide Labs: Engineering Interpretable and Auditable AI Systems

→