Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs Paper • 2407.15549 • Published Jul 22, 2024
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability Paper • 2405.10927 • Published May 17, 2024 • 3
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs Paper • 2407.15549 • Published Jul 22, 2024