Interpretable Intelligence: AI you can Understand and Trust

An image of translucent blue fabric-like form against a clean white background.

Author: Julius Adebayo, Co-Founder & CEO
Published: March 19, 2026

Originally published December 2nd, 2024. Edited March 19th, 2026.

The march toward superintelligence is afoot, yet AI systems are becoming more capable and opaque in equal measure, precisely because of the way they are built. The traditional response of reverse engineering already trained models has produced plausible-looking explanations, but failed to deliver systems we can reliably steer and understand.

At Guide Labs, we are pioneering Interpretable Intelligence, a new paradigm in which models are engineered, from the ground up, to be auditable, steerable, and understandable. To demonstrate this, we built Steerling-8B, the first large-scale inherently interpretable language model, unlocking concept discovery, inference-time steering, and full training data traceability. This demonstrates that capability and understanding are not at odds.

Play

Fragile superintelligence

A transformer large language model (LLM) trained on Manhattan taxi rides can give turn-by-turn directions with near-perfect accuracy, yet its internal map of the city is incoherent: streets with impossible orientations, flyovers above other roads, a tangle that bears no resemblance to Manhattan. The model performs correctly while understanding nothing.

Comparison of internal spatial representations between a world model and a
transformer on a navigation task over Manhattan street maps.

Internal world model of various transformer models trained on turn-by-turn Taxi directions in New York City. Image from Vafa et al. (a) True Manhattan Map. (b) True map with noise perturbations, which still maintains clean spatial representations (c) A transformer trained on turn-by-turn direction produces entangled, chaotic internal states, despite generating correct predictions. Capability and internal coherence are not the same thing.

Current AI systems achieve feats only a select few humans can match, such as gold medal performances at the International Mathematical Olympiad, the International Olympiad in Informatics, and the International Collegiate Programming Contest. These systems are also increasingly used in real world settings where we expect them to meet strict standards for reliability and accountability, such as screening job candidates, assisting clinicians, legal discovery, and helping synthesize potential drug candidates.

Unfortunately, today’s AI systems fail while making users feel certain they are succeeding. These systems are frighteningly sycophantic: a major model update had to be rolled back after the model began validating delusions. Frontier models show up to 100% compliance with medically illogical requests, providing false drug information; lawyers have submitted briefs containing fabricated judicial decisions generated by an AI assistant; and psychiatrists are now documenting a wave of AI-induced psychosis cases.

Opaque by Design

Capability does not guarantee reliability. Before AlphaGo’s historic match against Lee Sedol, the Google Deepmind team invited European champion, Fan Hui, to probe the system’s strengths and weaknesses. He uncovered a critical flaw, stating: “I played with AlphaGo to understand where is the strong points of AlphaGo and where is maybe the weakness… And I find something, I find big weakness about AlphaGo. It’s a big one.” Project lead, David Silver, explained the underlying difficulty: “There will be these tricky lumps of knowledge it understands very poorly… It can be completely delusional.”

Despite this weakness, AlphaGo produced Move 37: one of the most creative moves in Go’s history. AlphaGo is capable of both superhuman creativity and deep, invisible flaws, within the span of a single match. Superhuman insight and invisible failure modes, within the same system.

These failures occur because modern AI systems remain fundamentally opaque and inscrutable. Today’s models use dense, entangled internal states that neither researchers nor developers can reliably understand. When a system produces an output, we lack reliable tools to see which mechanism caused it, why it occurred, or how to correct it.

Post hoc interpretability is reading tea leaves

The most popular contemporary paradigm for interpretability is to take an already-trained model and try to reverse engineer what it has learned. It relies on the belief that, perhaps, the model training process, on the right data, results in models whose internal representations are magically modular and cleanly organized in a way that would make them auditable. As we will discuss: this desire, while admirable, is false. It is trying to understand an organism that was never designed to be understood; to find structure in a system whose internal organization emerged purely from the pressure to predict the next token.

Feature Attributions

Let’s take a now classic interpretability tool: feature attributions. These are methods that indicate what part of the input a model’s output is most sensitive to. Used carefully, and with knowledge of how a model was trained, they can provide genuine insight into model behavior. The problem arises when they are applied to models with no constraints on how they organize their sensitivity to inputs. Without that, the tool has no way to distinguish a meaningful explanation from a coincidental one.

Let’s stress test this approach with a simple experiment for an image AI system. We will simply randomize the last layer of that model, and then compare the output of these feature attribution methods on the partially randomized model to those on a normal model. In the image below, we show an example from a system that is trained to recognize objects.

Fifteen saliency methods applied to a randomly initialized neural network

Fifteen feature attribution methods applied to a model whose last layer is randomly initialized.

These popular methods produce human-plausible outputs even when the model’s weights are partially destroyed. An explanation that cannot distinguish a trained model from a broken one is not explaining the model.

Sparse Autoencoders (SAEs)

More recently, Sparse autoencoders (SAEs) have become the dominant technique in mechanistic interpretability, decomposing activations into human-readable features. They have been shown to surface striking features inside large models: concepts like the Golden Gate Bridge, emotional states, and even representations linked to safety-relevant behaviors. But the same problem that haunted feature attributions returns in a new form.

Fifteen saliency methods applied to a randomly initialized neural network

Top (a): Comparison of Random vs. Trained SAE Features on CLIP ViT-B/32 (Layer 3). Bottom: Sample activation contexts for latents from an SAE trained with a Soft-Frozen Decoder. Image taken from Korznikov and Galichin et. al.

Random SAE baselines match fully-trained SAEs on interpretability, sparse probing, and causal editing. SAEs also produce different feature sets when trained with different random seeds. Even though the explanations looks convincing; they are often not grounded in what the model learned.

The Model is the Problem

The problem is not that current interpretability tools are poorly designed, or that reverse engineering an AI system is the wrong target. It is that they are being applied to an underspecified substrate. Post-hoc interpretability methods work by making assumptions about how a model organizes its knowledge, but modern models are trained with no constraints on that organization whatsoever. The model is free to represent the same concept in several different ways, and the interpretability tool has no way to know which one is meaningful. For any post-hoc tool to be reliable, one needs to understand, and ideally intervene on, the model’s training process. This makes it possible to shape the model to respect the kind of structure the tool is looking for. Without that, one is not reading the model. One is reading tea leaves.

This limitation runs deeper than any single method. Chain of thought explanations, probing classifiers, and several other interpretability approaches all face the same wall. They generate explanations that the model is under no obligation to be faithful to. Our team has spent years demonstrating these deficiencies, publishing work that exposed the limits of these approaches one by one, and ultimately building something different.

Interpretable Intelligence

At Guide Labs, we are pioneering Interpretable Intelligence, a new paradigm in which models are engineered, from the ground up, to be transparent, controllable, and understandable. These models have human-interpretable concepts built into their computational structure, and therefore are inherently interpretable. We do not consider reliability and interpretability as afterthoughts, we design the AI system to inherently satisfy these requirements.

Consequently, we shift the question from “Can we reverse-engineer what this model knows?” to simply: “What did this model learn?” Until recently, it was widely assumed that building large-scale interpretable models was impossible without sacrificing performance. Over the past year, we have shown this assumption to be false.

Diagram comparing post-hoc and inherent interpretability

Unlike post-hoc approaches, Steerling-8B surfaces concepts directly from its architecture: every output token is attributed at inference time.

We built Steerling-8B: the first large-scale interpretable large language model (LLM) that can trace any token it generates to its input context, concepts a human can understand, and its training data. Trained on 1.35 trillion tokens, it achieves downstream performance within range of models trained on 1.5–7 times more data, while remaining fully transparent by design. Any token it generates can be traced to its input context, to human-understandable concepts, and to its training data. Steerling-8B unlocks several capabilities which include suppressing or amplifying specific concepts at inference time without retraining, training data provenance for any generated chunk, and inference-time alignment via concept control, replacing thousands of safety training examples with explicit concept-level steering.

Building Interpretable Intelligence

Interpretable Intelligence is not a single technique. It is a stack built from the ground up so that every layer supports the next. We started with data, built a model whose representations are transparent by design, and then demonstrated what that transparency makes possible.

Data: Atlas

We built Atlas, an automated system that annotates trillion-token datasets with human-interpretable concepts. Using this pipeline, we released FineWeb Atlas, a 10 billion token concept-annotated pretraining corpus with 16,790 human-understandable concepts assigned at the sub-document level. FineWeb Atlas makes concept-level data curation straightforward for the first time.

Model: Steerling-8B

We trained Steerling-8B, the first 8-billion-parameter inherently interpretable language model, on 1.35 trillion tokens. Rather than entangling knowledge in inscrutable weight matrices, Steerling organizes its knowledge into representations that humans can directly read and edit. It enables decomposing every prediction into per-concept contributions across approximately 33,000 supervised concepts, 100,000 discovered concepts, and a residual component. The model achieves downstream performance comparable to models trained on 1.5 to 7 times more data.

Model weights, code, and a PyPI package are all publicly available:

What Interpretability Unlocks

Because Steerling’s representations are organized around human-understandable concepts by construction, capabilities that are difficult or impossible with black-box models become straightforward.

Concept Steering. You can inject, suppress, or compose concepts at inference time to directly control what the model generates. Take a single neutral prompt and steer it toward tenant-landlord law, coffee, data visualization, or engine mechanics, with no changes to the prompt itself.

Concept Discovery. Because Steerling’s representations are trained to be disentangled, we can directly read off what the model has learned, including concepts it was never explicitly trained to acquire. Among the ~100K discovered concepts: British English spelling as a distinct representation, “you” unified across six languages with no multilingual training signal, and a dedicated representation for broken Unicode. In standard models, recovering this kind of knowledge requires post-hoc methods that face irreducible ambiguity.

Alignment without Finetuning. For any output Steerling-8B generates, we can trace which specific training documents drove it, from forum posts behind harmful content to academic papers behind specialized knowledge. When behavior is wrong, we can suppress the responsible concepts at inference time rather than retraining from scratch. This reduces harmful outputs from 80% to 29%, exceeding the effect of finetuning on 10,000 labeled examples and replacing slow, opaque correction loops with explicit, auditable, concept-level controls.

A model that gives perfect directions while holding an incoherent map of the city is not a foundation you can build on. Steerling is the first large-scale language model that performs and understands, and that changes what AI can be trusted to do.