arxiv:2405.08366
Alex Makelov
amakelov
AI & ML interests
Interpretability
Recent Activity
authored
a paper
about 2 months ago
Towards Deep Learning Models Resistant to Adversarial Attacks
authored
a paper
6 months ago
Is This the Subspace You Are Looking for? An Interpretability Illusion
for Subspace Activation Patching
authored
a paper
6 months ago
Towards Principled Evaluations of Sparse Autoencoders for
Interpretability and Control
Organizations
None yet