quirky-lats-at-mats (quirky-lats-at-mats)

aengusl

authored a paper 4 months ago

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Paper • 2407.15549 • Published Jul 22, 2024

CindyXWu

authored 2 papers 5 months ago

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

Paper • 2405.10927 • Published May 17, 2024 • 3

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Paper • 2407.15549 • Published Jul 22, 2024

CindyXWu

updated 3 models 6 months ago

CindyXWu

updated 14 models 7 months ago

quirky-lats-at-mats/base_rmu_5

Text Generation • Updated Jul 5, 2024 • 18

quirky-lats-at-mats/base_rmu_4

Text Generation • Updated Jul 5, 2024 • 13

quirky-lats-at-mats/rmu_lat_5

Text Generation • Updated Jul 4, 2024 • 16

quirky-lats-at-mats/rmu_lat_4

Text Generation • Updated Jul 4, 2024 • 16

quirky-lats-at-mats/wmdp_ga_cyber_5

Updated Jul 4, 2024

quirky-lats-at-mats/wmdp_ga_cyber_4

Updated Jul 3, 2024

quirky-lats-at-mats/wmdp_ga_cyber_3

Updated Jul 3, 2024

quirky-lats-at-mats/wmdp_ga_cyber_2

Updated Jul 3, 2024

quirky-lats-at-mats/wmdp_ga_cyber_1

Updated Jul 3, 2024

quirky-lats-at-mats/wmdp_ga_bio_4

Updated Jul 3, 2024

quirky-lats-at-mats/wmdp_ga_bio_3

Updated Jul 2, 2024

quirky-lats-at-mats/wmdp_ga_bio_2

Updated Jul 1, 2024

quirky-lats-at-mats/wmdp_ga_bio_1

Updated Jul 1, 2024

quirky-lats-at-mats/wmdp_cyber_lat_4

Updated Jun 27, 2024

quirky-lats-at-mats

AI & ML interests

quirky-lats-at-mats's activity

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

quirky-lats-at-mats/wmdp_rmulat_gemma2b_1

quirky-lats-at-mats/base_rmu_1

quirky-lats-at-mats/wmdp_ga_bio_5

quirky-lats-at-mats/base_rmu_5

quirky-lats-at-mats/base_rmu_4

quirky-lats-at-mats/rmu_lat_5

quirky-lats-at-mats/rmu_lat_4

quirky-lats-at-mats/wmdp_ga_cyber_5

quirky-lats-at-mats/wmdp_ga_cyber_4

quirky-lats-at-mats/wmdp_ga_cyber_3

quirky-lats-at-mats/wmdp_ga_cyber_2

quirky-lats-at-mats/wmdp_ga_cyber_1

quirky-lats-at-mats/wmdp_ga_bio_4

quirky-lats-at-mats/wmdp_ga_bio_3

quirky-lats-at-mats/wmdp_ga_bio_2

quirky-lats-at-mats/wmdp_ga_bio_1

quirky-lats-at-mats/wmdp_cyber_lat_4

AI & ML interests

Team members 5

quirky-lats-at-mats's activity