File size: 3,342 Bytes
44eb94a
 
3b8eeec
44eb94a
 
98fe87a
 
 
7cb022b
 
a73b7ee
 
348ec1b
e0ece42
e8414ba
 
 
e0ece42
 
 
 
7d7e289
 
9b50e0b
e8414ba
2bcf905
bf4d803
e8414ba
348ec1b
7d7e289
72ed567
7d7e289
 
 
562febb
7d7e289
4e9deb7
 
5fd1483
 
c15a157
aa058b6
562febb
 
aa058b6
562febb
348ec1b
4e9deb7
9d1f038
c22b21e
d51ba2d
 
 
 
 
 
 
 
5fd1483
596c864
 
88105ab
596c864
348ec1b
5fd1483
 
 
 
53cf2df
 
 
128ee92
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
---
license: llama3.3
library_name: goodfire-llama-3.3-70b-instruct-sae-l50
language:
- en
tags:
- mechanistic interpretability
- sparse autoencoder
- llama
- llama-3
---

## Model Information

The Goodfire SAE (Sparse Autoencoder) for [meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) 
is an interpreter model designed to analyze and understand 
the model's internal representations. This SAE model is trained specifically on layer 50 of 
Llama 3.3 70B and achieves an L0 count of 121, enabling the decomposition of complex neural activations 
into interpretable features. The model is optimized for interpretability tasks and model steering applications, 
allowing researchers and developers to gain insights into the model's internal processing and behavior patterns. 
As an open-source tool, it serves as a foundation for advancing interpretability research and enhancing control 
over large language model operations.

__Model Creator__: [Goodfire](https://huggingface.co/Goodfire), built to work with [Meta's Llama models](https://huggingface.co/meta-llama)

By using __Goodfire/Llama-3.3-70B-Instruct-SAE-l50__ you agree to the [LLAMA 3.3 COMMUNITY LICENSE AGREEMENT](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/blob/main/LICENSE)


## Intended Use

By open-sourcing SAEs for leading open models, especially large-scale 
models like Llama 3.3 70B, we aim to accelerate progress in interpretability research. 

Our initial work with these SAEs has revealed promising applications in model steering, 
enhancing jailbreaking safeguards, and interpretable classification methods. 
We look forward to seeing how the research community builds upon these 
foundations and uncovers new applications.

#### Feature labels

To explore the feature labels check out the [Goodfire Ember SDK](https://www.goodfire.ai/blog/announcing-goodfire-ember/), 
the first hosted mechanistic interpretability API. 
The SDK provides an intuitive interface for interacting with these 
features, allowing you to investigate how Llama processes information 
and even steer its behavior. You can explore the SDK documentation at [docs.goodfire.ai](https://docs.goodfire.ai).

## How to use

View the notebook guide below to get started.

<a href="https://colab.research.google.com/drive/1IBMQtJqy8JiRk1Q48jDEgTISmtxhlCRL" target="_blank">
  <img
    src="https://colab.research.google.com/assets/colab-badge.svg"
    alt="Open in Colab"
    width="200px"
    style={{ pointerEvents: "none" }}
  />
</a>

## Training

We trained our SAE on activations harvested from Llama-3.3-70B-Instruct on the [LMSYS-Chat-1M dataset](https://arxiv.org/pdf/2309.11998).

## Responsibility & Safety

Safety is at the core of everything we do at Goodfire. As a public benefit 
corporation, we’re dedicated to understanding AI models to enable safer, more reliable 
generative AI. You can read more about our comprehensive approach to 
safety and responsible development in our detailed [safety overview](https://www.goodfire.ai/blog/our-approach-to-safety/).

Toxic features were removed prior to the release of this SAE. If you are a safety researcher that 
would like access to the features we’ve removed, you can reach out at <a href="mailto:[email protected]">[email protected]</a> for access.