Publications and Preprints

The Moral Inefficacy of Carbon Offsetting

April 2024

Sleeper agents: Training deceptive LLMs that persist through safety training

January 2024

Evaluating and mitigating discrimination in language model decisions

December 2023

Towards understanding sycophancy in language models

October 2023

Specific versus general principles for constitutional AI

October 2023

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

October 2023

Towards measuring the representation of subjective global opinions in language models

June 2023

The capacity for moral self-correction in large language models

February 2023

Discovering language model behaviors with model-written evaluations

December 2022

Constitutional AI: Harmlessness from AI feedback

December 2022

Measuring progress on scalable oversight for large language models

November 2022

In-context learning and induction heads

September 2022

Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned

August 2022

Language models (mostly) know what they know

July 2022

Training language models to follow instructions with human feedback

December 2021

A Mathematical Framework for Transformer Circuits

December 2021

A General Language Assistant as a Laboratory for Alignment

December 2021

Ensuring the Safety of Artificial Intelligence

November 2021

Predictability and surprise in large generative models

June 2021

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models

June 2021

Training a helpful and harmless assistant with reinforcement learning from human feedback

April 2021

Learning Transferable Visual Models From Natural Language Supervision

January 2021

Language Models are Few-Shot Learners

May 2020

Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims

April 2020

Evidence Neutrality and the Moral Value of Information

September 2019

Release Strategies and the Social Impacts of Language Models

August 2019

The Role of Cooperation in Responsible AI Development

July 2019

Prudential Objections to Atheism

May 2019

AI Safety Needs Social Scientists

February 2019

Pareto Principles in Infinite Ethics

May 2018

Epistemic Consequentialism and Epistemic Enkrasia

January 2018

Objective Epistemic Consequentialism

June 2011