The Stanford Natural Language Processing Group

This talk is part of the NLP Seminar Series.

Sparse Autoencoders: Assessing the evidence

Neel Nanda, Google DeepMind
Date: 11:00am - 12:00 noon PT, May 1 2025
Venue: Room 287, Gates Computer Science Building

Abstract

Sparse autoencoders are a technique for interpreting which concepts are represented in a model’s activations, and have been a big focus of recent mechanistic interpretability work. In this talk, Neel will assess what we’ve learned about how well sparse autoencoders work over the past 1.5 years, the biggest problems with them, and what he sees as next steps for the field.

Bio

Neel runs the mechanistic interpretability team at Google DeepMind. Prior to this he was an independent researcher, and did mechanistic interpretability research at Anthropic under Chris Olah. Neel is excited about helping build the mechanistic interpretability community and created the TransformerLens library, does far too much mentoring, and enjoys making educational materials.