The march toward superintelligence is afoot, yet the AI systems leading it remain fundamentally opaque. At Guide Labs, we are building large-scale AI systems that are auditable, transparent, and steerable by design.
We have developed the first large-scale interpretable LLM, an 8 billion parameter model capable of explaining its outputs through mechanisms that humans can understand. Over the coming days, we will be sharing the key steps that led to this achievement:
Increasingly, current AI systems achieve feats only a select few humans can match, such as gold medal performances at the International Mathematical Olympiad, the International Olympiad in Informatics, and the International Collegiate Programming Contest. These systems are also increasingly used in real world settings where we expect them to meet strict standards for reliability and accountability, such as screening job candidates, informing credit scoring, assisting clinicians, supporting legal discovery, and helping synthesize potential drug candidates.
Unfortunately, current AI systems fall short of these standards, and failures appear in unexpected, sometimes costly ways.
Lawyers have submitted briefs containing fabricated judicial decisions generated by an AI assistant.
AI customer-support agents have issued unauthorized refunds, and promised benefits that do not exist.
AI agents have deleted production databases, wiping out critical systems despite explicit prompts intended to limit their actions.
These failures may appear paradoxical given the impressive capabilities of modern AI systems, but we have seen this duality before. Capability does not guarantee reliability. Before AlphaGo’s historic match against Lee Sedol, the Google Deepmind team invited European champion, Fan Hui, to probe the system’s strengths and weaknesses. He uncovered a critical flaw, stating: “I played with AlphaGo to understand where is the strong points of AlphaGo and where is maybe the weakness… And I find something, I find big weakness about AlphaGo. It’s a big one.” Project lead, David Silver, explained the underlying difficulty: “There will be these tricky lumps of knowledge it understands very poorly… It can be completely delusional.”
Despite this weakness, AlphaGo produced Move 37: one of the most creative moves in Go’s history. A move most human professionals would never play. AlphaGo is capable of both superhuman creativity and deep, invisible flaws, within the span of a single match. Superhuman insight and invisible failure modes, within the same system.
These failures occur because modern AI systems remain fundamentally opaque and inscrutable. Today’s models use dense, entangled internal states that neither researchers nor developers can reliably understand. When a system produces an output, we lack reliable tools to see which mechanism caused it, why it occurred, or how to correct it.
Until recently, it was widely assumed that building large-scale interpretable models was impossible without sacrificing performance. Over the past year, at Guide Labs, we have shown this assumption to be false.
Contemporary interpretability attempts to answer these questions in a post-hoc manner. These methods interpret the intelligence of ML models by interrogating models only after they have been trained. This approach has succeeded only at offering partial, and often unfaithful, glimpses into a system’s behavior. It has not proven capable of completely explaining, reliably, the behavior of current AI systems. As AI grows more capable and becomes further embedded in critical infrastructure, this opacity is no longer a scientific inconvenience; it is a structural risk.
At Guide Labs, we are pioneering Interpretable Intelligence, a new paradigm in which models are constructed, from the ground up, to be transparent, controllable, and causally understandable. These models have human-interpretable concepts built into their computational structure and therefore are inherently interpretable.
Today, we are sharing how we did it.
We built Atlas, an automated system that annotates trillion-token datasets with human-interpretable concepts. These annotations enable interpretable model training, transparent dataset auditing, contamination detection, and fine-grained control over internal representations.
We developed a new discrete diffusion architecture using block causal attention, scaling it to billions of parameters while preserving coherent text generation and incurring no performance trade-offs relative to standard diffusion language models.
We trained the first 8-billion-parameter interpretable language model from scratch, demonstrating that concept–constrained architectures scale cleanly across both autoregressive and diffusion settings, and do so without the long-assumed interpretability–performance trade-off.
We built PRISM, a family of 130M–1.6B-parameter models that reveal which training-data patterns drive each next-token prediction. Prism matches baseline quality within 5%, adds under 2% parameter overhead, and shows that transparent attribution can be achieved with negligible training cost.
Taken together, these advances constitute the first time that large-scale language models have achieved native, inherent, interpretability. We have demonstrated that understanding, reliability, and auditability can be fundamental properties of models.
If you’d like early access, you can sign up here: signup link. We’ll be opening our model to researchers, practitioners, and organizations as we scale.