|
--- |
|
license: llama3.3 |
|
library_name: goodfire-llama-3.3-70b-instruct-sae-l50 |
|
language: |
|
- en |
|
tags: |
|
- mechanistic interpretability |
|
- sparse autoencoder |
|
- llama |
|
- llama-3 |
|
--- |
|
|
|
## Model Information |
|
|
|
The Goodfire SAE (Sparse Autoencoder) for [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) |
|
is an interpreter model designed to analyze and understand |
|
the model's internal representations. This SAE model is trained specifically on layer 50 of |
|
Llama 3.3 70B and achieves an L0 count of 121, enabling the decomposition of complex neural activations |
|
into interpretable features. The model is optimized for interpretability tasks and model steering applications, |
|
allowing researchers and developers to gain insights into the model's internal processing and behavior patterns. |
|
As an open-source tool, it serves as a foundation for advancing interpretability research and enhancing control |
|
over large language model operations. |
|
|
|
__Model Creator__: [Goodfire](https://huggingface.co/Goodfire), built to work with [Meta's Llama models](https://huggingface.co/meta-llama) |
|
|
|
By using __Goodfire/Llama-3.3-70B-Instruct-SAE-l50__ you agree to the [LLAMA 3.3 COMMUNITY LICENSE AGREEMENT](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/blob/main/LICENSE) |
|
|
|
|
|
## Intended Use |
|
|
|
By open-sourcing SAEs for leading open models, especially large-scale |
|
models like Llama 3.3 70B, we aim to accelerate progress in interpretability research. |
|
|
|
Our initial work with these SAEs has revealed promising applications in model steering, |
|
enhancing jailbreaking safeguards, and interpretable classification methods. |
|
We look forward to seeing how the research community builds upon these |
|
foundations and uncovers new applications. |
|
|
|
#### Feature labels |
|
|
|
To explore the feature labels check out the [Goodfire Ember SDK](https://www.goodfire.ai/blog/announcing-goodfire-ember/), |
|
the first hosted mechanistic interpretability API. |
|
The SDK provides an intuitive interface for interacting with these |
|
features, allowing you to investigate how Llama processes information |
|
and even steer its behavior. You can explore the SDK documentation at [docs.goodfire.ai](https://docs.goodfire.ai). |
|
|
|
## How to use |
|
|
|
View the notebook guide below to get started. |
|
|
|
<a href="https://colab.research.google.com/drive/1IBMQtJqy8JiRk1Q48jDEgTISmtxhlCRL" target="_blank"> |
|
<img |
|
src="https://colab.research.google.com/assets/colab-badge.svg" |
|
alt="Open in Colab" |
|
width="200px" |
|
style={{ pointerEvents: "none" }} |
|
/> |
|
</a> |
|
|
|
## Training |
|
|
|
We trained our SAE on activations harvested from Llama-3.3-70B-Instruct on the [LMSYS-Chat-1M dataset](https://arxiv.org/pdf/2309.11998). |
|
|
|
## Responsibility & Safety |
|
|
|
Safety is at the core of everything we do at Goodfire. As a public benefit |
|
corporation, we’re dedicated to understanding AI models to enable safer, more reliable |
|
generative AI. You can read more about our comprehensive approach to |
|
safety and responsible development in our detailed [safety overview](https://www.goodfire.ai/blog/our-approach-to-safety/). |
|
|
|
Toxic features were removed prior to the release of this SAE. If you are a safety researcher that |
|
would like access to the features we’ve removed, you can reach out at <a href="mailto:[email protected]">[email protected]</a> for access. |