Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Published:

Recommended citation: Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. (2023) https://transformer-circuits.pub/2023/monosemantic-features

Summary:

Leave a Comment