Introducing Guide Labs: Engineering Interpretable and Auditable AI Systems

An illustration of a Black man, shoulder-up view, on a light blue background. The man has a light orange shirt. His face is covered in colorful brush strokes.

Author: Guide Labs Team
Published: November 17, 2024

Guide Labs is building a new class of interpretable AI systems that humans and domain experts can reliably understand, trust, and debug. As individuals and businesses around the world quickly work to integrate AI into their existing workflows, and governments seek to regulate frontier AI systems, the demand for models that can be reliably debugged, steered, and understood is ever increasing. To help bring interpretable and reliable models to market, we have successfully closed our seed funding of $9.3 million led by Initialized Capital. We are excited to have participation from Tectonic Ventures, Lombard Street ventures, Pioneer Fund, Y Combinator, E14 Fund, and several prominent angels.

Current AI systems are not reliable, not interpretable, and are difficult to audit

The prevailing paradigm that most AI companies use is to train models as monoliths — typically using the transformer architecture — trained solely for narrow performance measures like next word prediction. However, this results in models that are difficult to work with, debug, and reliably explain. Even more alarming, current systems produce explanations and justifications that are completely unrelated to the actual processes the system used to arrive at its output.

Bloomberg Technology + Equality article titled "Humans are biased. Generative AI is even worse"

Gemini Struggles with recitation of its training data.

A Github issue describing a frustrating error where Gemini stops generation.

https://github.com/google/generative-ai-docs/issues/257

A comment describing how a user is struggling to make a model not mention that it is affiliated with OpenAI

AI generated images of US senators from the 1800s. The images include Black, Asian, and Indigenous people.

While these large-scale models are still in their infancy, it is already clear that training for narrow performance measures like next word prediction without consideration for interpretability leaves too much room for error when it comes to mass public adoption. Repeatedly, in computer vision, natural language, medical images, image generative models, and especially LLMs, it is the norm that optimizing for narrow performance measures does not yield reliable models.

A new path: AI systems that are engineered to be interpretable

At Guide Labs we believe you cannot reliably debug, align, and trust a model you don’t understand. These critical properties cannot just be left unaddressed until after a model has been trained; they should guide the entire model development pipeline. Instead, we are rethinking the entire pipeline–model architecture, datasets, and training procedure—to engineer models that are interpretable, safe, trustworthy, and easier to debug and fix.

We want to enable reliable interaction, understanding, and controllability of models and AI systems. More specifically, we want:

To fulfill these requirements and many more, we need models that: produce reliable and trustworthy outputs; provide insights about which human-understandable factors are important; and indicate which part of the prompt, context, and training dataset are responsible for the output.

Collectively, our team has more than 20 years of experience focused on the interpretability and reliability of AI systems. We have published more than two dozen papers at top machine learning venues. Critically, we have shown that machine learning models trained solely for narrow performance measures, without regard for interpretability, result in models whose explanations are mostly unrelated to the model’s decision-making process, and are not aligned with humans for consequential decisions. Even worse, explanations of unchecked models can actively mislead. More recently, we’ve shown that self-explanations, like chain-of-thought, of LLMs are unreliable.

These results directly inform our approach to engineer AI models that are interpretable, reliable, and trustworthy. Toward this end, we have demonstrated the effectiveness of rethinking a model’s training process for language models and protein property prediction. We developed one of the first generative models at the billion-parameter scale that is constrained to reliably explain its outputs in terms of human-understandable factors.

Our past experience has shown that it is crucial to integrate interpretability, safety, and reliability constraints as part of the model development pipeline, and that these constraints can be satisfied without compromising downstream performance. With the new AI systems we are building, we can more easily identify the causes of erroneous outputs, detect when models latch onto spurious signals, and correct the models effectively. We aim to create a world where domain experts shift from merely ‘prompting’ AI to engaging in meaningful and truthful dialogue with AI systems.

A First Step: Interpretable LLM at the Billion Parameter Scale

To demonstrate that our approach is feasible, and that constraining models do not sacrifice performance, we have developed an interpretable LLM that:

We have shown that it is possible to train large-scale generative models that are engineered to be interpretable without sacrificing performance. We are excited to continue to scale these models to match current alternatives, expand the range of interpretability features we provide, and partner with select organizations to test the model. Reach out to us at: [email protected] if you would like to learn more.

Join Us

We have assembled a team of interpretability researchers and engineers with an excellent track record in the field. We are hiring machine learning engineers, full-stack developers, and researchers to join us. If you are interested in joining our team, reach out to [email protected].