Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small
A downloadable tool
We identify the broad structure of a circuit that is associated with correctly predicting a gendered pronoun given the subject of a rhetorical question. Progress towards identifying this circuit is achieved through a variety of existing tools, namely Conmy’s Automatic Circuit Discovery and Nanda’s Exploratory Analysis tools.
We present this report, not only as a preliminary understanding of the broad structure of a gendered pronoun circuit, but also as (perhaps) a structured, re-implementable procedure (or maybe just naive inspiration) for identifying circuits for other tasks in large transformer language models.
Further work is warranted in refining the proposed circuit and better understanding the associated human-interpretable algorithm.
Comments
Log in with itch.io to leave a comment.
Note dumb bug: The names dataset should not include names: Carol, Karen, Julie and Judy. These are all tokenized as 2 tokens (unlike others that are 1). Leaving this error in the notebook so results match report but watch out for these token errors :').
I believe prepend_bos should also be False across notebook to match ACDC implementation. Again leaving in notebook to match report. This work was produced as part of a weekend hackathon, beware of bugs :D
The two token name fact doesn't seem true? https://colab.research.google.com/drive/17pU4A_DHH6GczbCwoVQcuQAIspjGhkRV?usp=sh...
Impressive work!