Conor Houghton
The latest news (2024-09-09) New preprint: Residual stream analysis with multi-layer SAEs I am excited and lucky to be involved in this work which is almost totally due to my collaborators. Like everyone else we are interested in how transformers work and, in particular, in how they represent and manipulate the features of language. It turns out that sparse autoencoders are a useful way to find out what these features are. So far this approach has used a different autoencoder on each layer of a transformer. However, the residual stream is often thought of a sort of scratch pad, a representation of the input which gets acted on and updated across successive processing steps. If this is true, then it should be possible to train a single autoencoder and see the same feature crop up, sometimes in one layer, sometimes in another. This is what we did and sort of what we saw, while there seem to be some features that are layer specific, some occur in different layers. Our preprint has graphs to show this and, more importantly, now we know the approach works, we can look to understanding what language looks like to a transformer! |