Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Large language models (LLMs) have become very good at generating text and code, translating languages, and writing different kinds of creative content. However, the inner workings of these models are hard to understand, even for the researchers who train them.
This lack of interpretability poses challenges to using LLMs in critical applications that have a low tolerance for mistakes and require transparency. To address this challenge, Google DeepMind has released Gemma Scope, a new set of tools that sheds light on the decision-making process of Gemma 2 models.
Gemma Scope builds on top of JumpReLU sparse autoencoders (SAEs), a deep learning architecture that DeepMind recently proposed.
Understanding LLM activations with sparse autoencoders
When an LLM receives an input, it processes it through a complex network of artificial neurons. The values emitted by these neurons, known as “activations,” represent the model’s understanding of the input and guide its response.
By studying these activations, researchers can gain insights into how LLMs process information and make decisions. Ideally, we should be able to understand which neurons correspond to which concepts.
However, interpreting these activations is a major challenge because LLMs have billions of neurons, and each inference produces a massive jumble of activation values at each layer of the model. Each concept can trigger millions of activations in different LLM layers, and each neuron might activate across various concepts.
One of the leading methods for interpreting LLM activations is to use sparse autoencoders (SAEs). SAEs are models that can help interpret LLMs by studying the activations in their different layers, sometimes referred to as “mechanistic interpretability.” SAEs are usually trained on the activations of a layer in a deep learning model.
The SAE tries to represent the input activations with a smaller set of features and then reconstruct the original activations from these features. By doing this repeatedly, the SAE learns to compress the dense activations into a more interpretable form, making it easier to understand which features in the input are activating different parts of the LLM.
Gemma Scope
Previous research on SAEs mostly focused on studying tiny language models or a single layer in larger models. However, DeepMind’s Gemma Scope takes a more comprehensive approach by providing SAEs for every layer and sublayer of its Gemma 2 2B and 9B models.
Gemma Scope comprises more than 400 SAEs, which collectively represent more than 30 million learned features from the Gemma 2 models. This will allow researchers to study how different features evolve and interact across different layers of the LLM, providing a much richer understanding of the model’s decision-making process.
“This tool will enable researchers to study how features evolve throughout the model and interact and compose to make more complex features,” DeepMind says in a blog post.
Gemma Scope uses DeepMind’s new architecture called JumpReLU SAE. Previous SAE architectures used the rectified linear unit (ReLU) function to enforce sparsity. ReLU zeroes out all activation values below a certain threshold, which helps to identify the most important features. However, ReLU also makes it difficult to estimate the strength of those features because any value below the threshold is set to zero.
JumpReLU addresses this limitation by enabling the SAE to learn a different activation threshold for each feature. This small change makes it easier for the SAE to strike a balance between detecting which features are present and estimating their strength. JumpReLU also helps keep sparsity low while increasing the reconstruction fidelity, which is one of the endemic challenges of SAEs.
Toward more robust and transparent LLMs
DeepMind has released Gemma Scope on Hugging Face, making it publicly available for researchers to use.
“We hope today’s release enables more ambitious interpretability research,” DeepMind says. “Further research has the potential to help the field build more robust systems, develop better safeguards against model hallucinations, and protect against risks from autonomous AI agents like deception or manipulation.”
As LLMs continue to advance and become more widely adopted in enterprise applications, AI labs are racing to provide tools that can help them better understand and control the behavior of these models.
SAEs such as the suite of models provided in Gemma Scope have emerged as one of the most promising directions of research. They can help develop techniques to discover and block unwanted behavior in LLMs, such as generating harmful or biased content. The release of Gemma Scope can help in various fields, such as detecting and fixing LLM jailbreaks, steering model behavior, red-teaming SAEs, and discovering interesting features of language models, such as how they learn specific tasks.
Anthropic and OpenAI are also working on their own SAE research and have released multiple papers in the past months. At the same time, scientists are also exploring non-mechanistic techniques that can help better understand the inner workings of LLMs. An example is a recent technique developed by OpenAI, which pairs two models to verify each other’s responses. This technique uses a gamified process that encourages the model to provide answers that are verifiable and legible.
Source link