ai2s (AI Student Society)

@victor unprompted feature request: I'd love to have a toggle for a HF collection to control whether new items are added to the top or to the bottom. At the moment everything gets added at the bottom, but it would be great to have newer elements on top to make fresh content easily accessible without having to scroll all the way!

3 replies

·

gsarti

posted an update 8 months ago

Post

2940

🔍 Today's (self-serving) pick in Interpretability & Analysis of LMs:

A Primer on the Inner Workings of Transformer-based Language Models
by @javifer @gsarti @arianna-bis and M. R. Costa-jussà
( @mt-upc , @GroNLP , @facebook )

This primer can serve as a comprehensive introduction to recent advances in interpretability for Transformer-based LMs for a technical audience, employing a unified notation to introduce network modules and present state-of-the-art interpretability methods.

Interpretability methods are presented with detailed formulations and categorized as either localizing the inputs or model components responsible for a particular prediction or decoding information stored in learned representations. Then, various insights on the role of specific model components are summarized alongside recent work using model internals to direct editing and mitigate hallucinations.

Finally, the paper provides a detailed picture of the open-source interpretability tools landscape, supporting the need for open-access models to advance interpretability research.

📄 Paper: A Primer on the Inner Workings of Transformer-based Language Models (2405.00208)

🔍 All daily picks: https://huggingface.co/collections/gsarti/daily-picks-in-interpretability-and-analysis-ofc-lms-65ae3339949c5675d25de2f9

gsarti

authored a paper 8 months ago

A Primer on the Inner Workings of Transformer-based Language Models

Paper • 2405.00208 • Published Apr 30, 2024 • 9

gsarti

posted an update 9 months ago

Post

2452

🔍 Today's pick in Interpretability & Analysis of LMs: by @aadityasingh T. Moskovitz, F. Hill, S. C. Y. Chan, A. M. Saxe ( @gatsbyunit )

This work proposes a new methodology inspired by optogenetics (dubbed "clamping") to perform targeted ablations during training to estimate the causal effect of specific interventions on mechanism formation.

Authors use this approach to study the formation of induction heads training a 2L attention-only transformer to label examples via context information.

Notable findings:

- The effects of induction heads are additive and redundant, with weaker heads compensating well for the ablation of a strong induction head in case the latter is ablated.
- Competition between induction heads might emerge as a product of optimization pressure to converge faster, but it is not strictly necessary as all heads eventually learn to solve the task.
- Previous token heads (PTH) influence induction heads in a many-to-many fashion, with any PTH eliciting above-chance prediction from a subsequent induction head
- Three subcircuits for induction are identified, respectively mixing token-label information (1 + 2), matching the previous occurrence of the current class in the context (3qk + 4), and copying the label of the matched class (3v + 5).
- The formation of induction heads is slowed down by a larger number of classes & labels, with more classes and more labels slowing down the formation of the matching and copying mechanisms, respectively. This may have implications when selecting a vocabulary size for LLMs: larger vocabularies lead to an increased compression ratio and longer contexts, but they might make copying more challenging by delaying the formation of induction heads.

💻 Code: https://github.com/aadityasingh/icl-dynamics

📄 Paper: What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation (2404.07129)

🔍 All daily picks: https://huggingface.co/collections/gsarti/daily-picks-in-interpretability-and-analysis-ofc-lms-65ae3339949c5675d25de2f9

gsarti

posted an update 9 months ago

Post

1773

I'm super happy to co-organize the (Mechanistic) Interpretability social at #ICLR2024 with @nikhil07prakash ! 🔍

If you plan to attend, help us make this meetup awesome by filling the form below! 😄

📅 Wed, May 8, 12:45-2:15 PM
🔗 RSVP & share your ideas here: https://forms.gle/FWap4KW2ikdntjfb8

5 replies

·

gsarti

posted an update 9 months ago

Post

2388

🔍 Today's pick in Interpretability & Analysis of LMs: LM Transparency Tool: Interactive Tool for Analyzing Transformer Language Models (2404.07004) by @igortufanov @mahnerak @javifer @lena-voita

The LLM transparency toolkit is an open source toolkit and visual interface to efficiently identify component circuits in LMs responsible for their predictions, using the Information Flow Routes approach ( Information Flow Routes: Automatically Interpreting Language Models at Scale (2403.00824)).

The tool enables fine-grained customization, highlighting the importance of individual FFN neurons and attention heads. Moreover, vocabulary projections computed using the logit lens approach are provided to examine intermediate predictions of the residual stream, and tokens promoted by specific component updates.

💻 Code: https://github.com/facebookresearch/llm-transparency-tool

🚀 Demo: facebook/llm-transparency-tool-demo

🔍 All daily picks: https://huggingface.co/collections/gsarti/daily-picks-in-interpretability-and-analysis-ofc-lms-65ae3339949c5675d25de2f9

gsarti

posted an update 9 months ago

Post

2405

🔍 Today's pick in Interpretability & Analysis of LMs: x2 edition!

Today's highlighted works aim reproduce findings from Transformer-centric interpretability literature on new RNN-based architectures such as Mamba and RWKV:

Does Transformer Interpretability Transfer to RNNs? (2404.05971) by @MrGonao T. Marshall @norabelrose

Locating and Editing Factual Associations in Mamba (2404.03646) by @sensharma @datkinson @davidbau

The first paper applies contrastive activation addition, the tuned lens and probing for eliciting latent knowledge in quirky models to Mamba and RWKV LMs, finding these Transformer-specific methods can be applied with slight adaptation to these architectures, obtaining similar results.

The second work applies the ROME method to Mamba, finding weights playing the role of MLPs in encoding factual relations across several Mamba layers, and can be patched to perform model editing. A new SSM-specific technique is also introduced to emulate attention knockout (value zeroing) revealing information flows similar to the ones in Transformers when processing factual statements.

💻 Code: https://github.com/arnab-api/romba

🔍 All daily picks: https://huggingface.co/collections/gsarti/daily-picks-in-interpretability-and-analysis-ofc-lms-65ae3339949c5675d25de2f9

gsarti

posted an update 9 months ago

Post

2230

🔍 Today's pick in Interpretability & Analysis of LMs: Context versus Prior Knowledge in Language Models by @kdu4108 @vesteinn @niklasstoehr J. C. White A. Schein @rcotterell

This work examines the influence of context versus memorized knowledge in LMs through the lens of the shift caused by contexts at various degrees of informativeness to the models' predictive distribution. Understanding this difference is especially important in the context of knowledge conflicts between memorized and contextual information.

Authors propose disentangling context influence in terms of "persuasion", i.e. how impactful is the inclusion of the context for answers of a given query/entity pair, and "susceptibility", i.e. how much answers of a given query/entity pair are likely to be swayed by the presence of context, and operationalize these concepts using information-theoretic measures akin to mutual information.

The two metrics are validated using a synthetic dataset sourced from a knowledge graph. Analysis shows that: 
- The degree of persuasiveness of relevant contexts increases with the increase of model size (interesting implications here for the jailbreaking of LLMs!)
- assertive contexts tend to be more persuasive for closed queries (yes/no) and mid-sized models
- Negation affect context persuasiveness
- Familiar entities (explored as real vs. fake, more frequent in training data and more connected in the KG) are less susceptible to context influence

Finally, authors suggest applications of the persuasion/susceptibility framing for social science analyses and gender bias evaluation.

💻 Code: https://github.com/kdu4108/measureLM
📄 Paper: Context versus Prior Knowledge in Language Models (2404.04633)

🔍 All daily picks: https://huggingface.co/collections/gsarti/daily-picks-in-interpretability-and-analysis-ofc-lms-65ae3339949c5675d25de2f9

gsarti

posted an update 9 months ago

Post

2152

🔍 Today's pick in Interpretability & Analysis of LMs: Do language models plan ahead for future tokens? by W. Wu @jxm @lionellevine

This work aims to evaluate whether language models exhibit implicit planning during generation.

Authors propose two hypotheses that could result in planning-like behavior: 
- Pre-caching: the model engages in computation that is functional to future, but not current, predictions. 
- Breadcrumbs: Features contributing to the current prediction happen to also be the ones improving future ones.

To validate which behavior is observed in practice, authors note that off-diagonal gradients for weight matrices across the model are the ones responsible for pre-caching, and craft a variant of gradient descent (myopic descent) to remove such terms from the optimization procedure.

Using a synthetic dataset, authors demonstrate that pre-caching occurs in Transformers language models. However, for natural language settings the LM is observed to leverage breadcrumbs from previous passes even in the case of myopic training, rendering the latter hypothesis more plausible to account for model behavior.

📄 Paper: Do language models plan ahead for future tokens? (2404.00859)

🔍 All daily picks: https://huggingface.co/collections/gsarti/daily-picks-in-interpretability-and-analysis-ofc-lms-65ae3339949c5675d25de2f9

gsarti

posted an update 9 months ago

Post

2159

🔍 Today's pick in Interpretability & Analysis of LMs: ReFT: Representation Finetuning for Language Models by @zhengxuanzenwu @aryaman Z. Wang @atticusg D. Jurafsky @manning @cgpotts

This work introduces Representation fine-tuning (ReFT), a framework using learned inference-time interventions as efficient yet effective alternatives to PEFT weight adaptation. LoReFT, a ReFT variant intervening linearly on a representation subspaces, is evaluated against several PEFT approaches showing SOTA performances across popular benchmark with 10-50x speedup. The 🤗-compatible pyreft library is introduced to simplify ReFT usage.

This is one of the most convincing practical applications of interpretability methods/insights I've seen in recent years, and I'm looking forward to people combining this with methods to disentangle features like SAEs and Backpack LMs for making interventions more interpretable!

📄 Paper: ReFT: Representation Finetuning for Language Models (2404.03592)

🔍 All daily picks: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9

gsarti

posted an update 9 months ago

Post

2193

🔍 Today's pick in Interpretability & Analysis of LMs: Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models by @sammarks C. Rager @eircjm @belinkov @davidbau @amueller

This work proposes using features and errors from sparse autoencoders trained to reconstruct LM activations as interpretable units for circuit discovery. The authors then introduce SHIFT, a technique for editing model behavior by ablating interpretable elements from sparse feature circuits. This method is applied alongside unsupervised circuit discovery at scale by means of clustering, showing highly interpretable feature circuits interacting to produce behaviors like predicting sequence increments.

I found the experiment of Section 4 especially convincing and exciting in terms of downstream applications: authors trained a classifier over a biased dataset, and showcased how SHIFT intervention in feature space leads to performances matching those of the same model trained on an unbiased data distribution!

📄 Paper: Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models (2403.19647)

🔍 All daily picks: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9

gsarti

posted an update 10 months ago

Post

1263

🔍 Today's pick in Interpretability & Analysis of LMs: Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms by @mwhanna @sandropezzelle @belinkov

Edge attribution patching (EAP) is a circuit discovery technique using gradients to approximate the effects of causal intervening on each model edge. In the literature, its effectiveness is validated by comparing the overlap of its resulting circuits with those found via causal interventions (much more expensive).

This work:

1. Proposes a new method for faithful and efficient circuit discovery named edge attribution patching with integrated gradients (EAP-IG)
2. Evaluates the faithfulness EAP, EAP-IG and activation patching, i.e. whether behavior of the model remains consistent after all non-circuit edges are ablated.
3. Highlights that, while the no-overlap and full-overlap of EAP-like methods with activation patching results are generally good indicators of unfaithful and faithful (respectively) circuit identification, circuits with moderate overlap cannot generally assumed to be faithful to model behavior.

An advantage of EAP-IG is enabling the usage of KL-Divergence as a target for gradient propagation, which is not possible in the case of raw gradient-based EAP.

EAP-IG runtime is approximately similar to the one of EAP, with a small number of steps to approximate the gradient integral.

Importantly, circuit faithfulness does not imply completeness, i.e. whether all components participating towards a specific task were accounted for. This aspect is identified as interesting for future work.

📄 Paper: Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms (2403.17806)

🔍 All daily picks: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9

gsarti

posted an update 10 months ago

Post

2212

Our 🐑 PECoRe 🐑 method to detect & attribute context usage in LM generations finally has an official Gradio demo! 🚀

gsarti/pecore

Highlights:
🔍 Context attribution for several decoder-only and encoder-decoder models using convenient presets
🔍 Uses only LM internals to faithfully reflect context usage, no additional detector involved
🔍 Highly parametrizable, export Python & Shell code snippets to run on your machine using 🐛 Inseq CLI (https://github.com/inseq-team/inseq)

Want to use PECoRe for your LMs? Feedback and comments are welcome! 🤗

3 replies

·

gsarti

posted an update 10 months ago

Post

🔍 Today's pick in Interpretability & Analysis of LMs: Information Flow Routes: Automatically Interpreting Language Models at Scale by @javifer @lena-voita

This work presents a novel method to identify salient components in Transformer-based language models by decomposing the contribution of various model components into the residual stream.

This method is more efficient and scalable than previous techniques such as activation patching, as it only requires a single forward pass through the model to identify critical information flow paths. Moreover, it can be applied without a contrastive template, which is observed to produce results dependent on the selected contrastive example for activation patching.

Information flow routes are applied to Llama 2, showing that:

1. Models show “typical” information flow routes for non-content words, while content words don’t exhibit such patterns.
2. Feedforward networks are more active in the bottom layers of the network (where e.g. subject enrichment is performed) and in very last layer.
3. Positional and subword-merging attention heads are among the most active and important throughout the network.
4. Periods can be treated by the model as BOS tokens by leaving their residual representation mostly untouched during the forward pass.

Finally, the paper also demonstrates that some model components are specialized for specific domains, such as coding or multilingual texts, suggesting a high degree of modularity in the network. The contribution of domain-specific heads obtained by projecting right singular values of the OV circuit to the unembedding matrix show highly interpretable concepts being handled in granular model components.

📄 Paper: Information Flow Routes: Automatically Interpreting Language Models at Scale (2403.00824)

🔍 All daily picks: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9

gsarti

posted an update 10 months ago

Post

🔍 Today's pick in Interpretability & Analysis of LMs: AtP*: An efficient and scalable method for localizing LLM behaviour to components by J. Kramár T. Lieberum R. Shah @NeelNanda

The attribution patching method (AtP) can provide fast and effective approximations of activation patching, requiring only two forward passes and one backward pass to estimate the contribution of all network components for a given prompt pair.

While previous work highlighted the effectiveness of attribution patching, authors identify two settings leading to false negatives using AtP:

- When estimating the contribution of pre-activation components, if clean and noise inputs don’t lie in the same activation region, the first-order gradient approximation provided by the gradient leads to large errors (Fig 3).
- When the sum of direct and indirect effects is close to 0, even small approximation errors introduced by nonlinearities can greatly affect the estimated contribution.

Authors propose two changes to the AtP method to mitigate such issues:

- recomputing the attention softmax for the selected component, and then taking a linear approximation to the remaining part of the model (QK Fix)
- Iteratively zeroing gradients at layers contributing to the indirect effects causing cancellation (GradDrop)

AtP and AtP* are compared across several patching settings for Pythia models, finding them effective while much less computationally expensive than other approaches. A new methodology is also proposed to estimate the magnitude of AtP* false negatives given a set of samples and desired confidence levels.

📄 Paper: AtP*: An efficient and scalable method for localizing LLM behaviour to components (2403.00745)

🔍 All daily picks: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9

5 replies

·

gsarti

posted an update 10 months ago

Post

🔍 Today's pick in Interpretability & Analysis of LMs: CausalGym: Benchmarking causal interpretability methods on linguistic tasks by @aryaman D. Jurafsky @cgpotts

TL;DR: Introduce a revisited benchmark to evaluate the effectiveness and reliability of intervention methods across several linguistic phenomena.

While several interpretability methods are currently used to discover task-relevant model components, their performance and reliability is seldom tested broadly.

This paper adapts the SyntaxGym benchmark, originally conceived for the study of psycholinguistic phenomena such as subject-verb agreement and garden-path sentences, to evaluate intervention-based interpretability methods. In practice, faithful interventions over model components are expected to cause a predictable change in model prediction (e.g. singular -> plural verb).

Various methods are benchmarked on Pythia models ranging from 14M to 6.9B params, finding Distributed Alignment Search (DAS) to consistently outperform other approaches, followed by probing. When recurring to control tasks to account for the expressivity of supervised methods, probing is found to be more reliable than DAS in larger model sizes.

Authors conclude with an evaluation of how features driving linguistically plausible behaviours emerge during model training. These features are observed to emerge in Pythia models after 1k training steps, and become progressively more complex over time.

📄 Paper: CausalGym: Benchmarking causal interpretability methods on linguistic tasks (2402.12560)
💻 Code: https://github.com/aryamanarora/causalgym
🔡 Dataset: aryaman/causalgym

🔍 All daily picks: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9

AI Student Society

AI & ML interests

Recent Activity

ai2s's activity

Driving Enhanced Exciton Transfer by Automatic Differentiation

Non Verbis, Sed Rebus: Large Language Models are Weak Solvers of Italian Rebuses

Multi-property Steering of Large Language Models with Dynamic Activation Composition

Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation

A Primer on the Inner Workings of Transformer-based Language Models

AI & ML interests

Recent Activity

Team members 2

ai2s's activity