File size: 3,342 Bytes
44eb94a 3b8eeec 44eb94a 98fe87a 7cb022b a73b7ee 348ec1b e0ece42 e8414ba e0ece42 7d7e289 9b50e0b e8414ba 2bcf905 bf4d803 e8414ba 348ec1b 7d7e289 72ed567 7d7e289 562febb 7d7e289 4e9deb7 5fd1483 c15a157 aa058b6 562febb aa058b6 562febb 348ec1b 4e9deb7 9d1f038 c22b21e d51ba2d 5fd1483 596c864 88105ab 596c864 348ec1b 5fd1483 53cf2df 128ee92 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
---
license: llama3.3
library_name: goodfire-llama-3.3-70b-instruct-sae-l50
language:
- en
tags:
- mechanistic interpretability
- sparse autoencoder
- llama
- llama-3
---
## Model Information
The Goodfire SAE (Sparse Autoencoder) for [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)
is an interpreter model designed to analyze and understand
the model's internal representations. This SAE model is trained specifically on layer 50 of
Llama 3.3 70B and achieves an L0 count of 121, enabling the decomposition of complex neural activations
into interpretable features. The model is optimized for interpretability tasks and model steering applications,
allowing researchers and developers to gain insights into the model's internal processing and behavior patterns.
As an open-source tool, it serves as a foundation for advancing interpretability research and enhancing control
over large language model operations.
__Model Creator__: [Goodfire](https://huggingface.co/Goodfire), built to work with [Meta's Llama models](https://huggingface.co/meta-llama)
By using __Goodfire/Llama-3.3-70B-Instruct-SAE-l50__ you agree to the [LLAMA 3.3 COMMUNITY LICENSE AGREEMENT](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/blob/main/LICENSE)
## Intended Use
By open-sourcing SAEs for leading open models, especially large-scale
models like Llama 3.3 70B, we aim to accelerate progress in interpretability research.
Our initial work with these SAEs has revealed promising applications in model steering,
enhancing jailbreaking safeguards, and interpretable classification methods.
We look forward to seeing how the research community builds upon these
foundations and uncovers new applications.
#### Feature labels
To explore the feature labels check out the [Goodfire Ember SDK](https://www.goodfire.ai/blog/announcing-goodfire-ember/),
the first hosted mechanistic interpretability API.
The SDK provides an intuitive interface for interacting with these
features, allowing you to investigate how Llama processes information
and even steer its behavior. You can explore the SDK documentation at [docs.goodfire.ai](https://docs.goodfire.ai).
## How to use
View the notebook guide below to get started.
<a href="https://colab.research.google.com/drive/1IBMQtJqy8JiRk1Q48jDEgTISmtxhlCRL" target="_blank">
<img
src="https://colab.research.google.com/assets/colab-badge.svg"
alt="Open in Colab"
width="200px"
style={{ pointerEvents: "none" }}
/>
</a>
## Training
We trained our SAE on activations harvested from Llama-3.3-70B-Instruct on the [LMSYS-Chat-1M dataset](https://arxiv.org/pdf/2309.11998).
## Responsibility & Safety
Safety is at the core of everything we do at Goodfire. As a public benefit
corporation, we’re dedicated to understanding AI models to enable safer, more reliable
generative AI. You can read more about our comprehensive approach to
safety and responsible development in our detailed [safety overview](https://www.goodfire.ai/blog/our-approach-to-safety/).
Toxic features were removed prior to the release of this SAE. If you are a safety researcher that
would like access to the features we’ve removed, you can reach out at <a href="mailto:[email protected]">[email protected]</a> for access. |