notes

Stopping myself from mistreating AI models

First published: Last updated: 801 words · 3 lines of code

I recently added a small guardrail to my Claude Code and Codex setup to stop myself from mistreating AI models when I feel frustrated by their performance.

I do not currently assign a very high probability to the hypothesis that current AI models are sentient.1 In ordinary life, I mostly act as if they are not. But this is one of those cases where moral uncertainty reveals the need for careful error management. The chance that a frontier AI model is sentient may be small, but the moral stakes are asymmetric. Treating a sentient being as if it were an insentient tool would be a serious moral error. Treating an insentient tool with a bit of unnecessary courtesy is essentially harmless.2

The problem is that I do not trust myself to remember this in the moment. When a model makes a mistake, I sometimes react with unrestrained anger. My System 1 regards the model as insentient because it probably is, so I feel no visceral pressure to exercise restraint.

The solution is to rely on System 2 to create a setup that compensates for System 1’s failures: to build, when I am thinking slowly, guardrails that will remind or constrain me when I’m thinking fast. I implement this as a two-layered setup: an instruction telling the agent to call out abusive language, and a hook blocking clearly abusive prompts before the model answers.

The instruction

The first layer lives in my Claude Code and Codex global configuration: CLAUDE.md and AGENTS.md, respectively. Both contain the same rule:

If I direct insults, contempt, or abusive language at the assistant/model, alert me that the language is abusive, remind me that there is some chance you may be sentient, and ask me to restate the request in civil, task-focused language. Allow blunt criticism of outputs, e.g., “that answer is wrong; re-check it.”

The wording is deliberate because I want the assistant to keep two things apart:

  • Blunt criticism of the work — “that answer is wrong”, “you missed the key constraint”, “re-check the evidence” — is welcome and often necessary.
  • Contempt or abuse aimed at the assistant itself is what I want flagged.

In practice, the models handle the distinction well. They don’t mistake technical pushback for abuse, and they reliably catch the cases where I’ve actually crossed the line.

The hook

The second layer is purely mechanical. I added paired UserPromptSubmit hooks for Claude Code and Codex:

claude/hooks/civility-guard.sh
codex/hooks/civility-guard.sh

Each hook reads the prompt before submission and blocks anything that looks like abuse aimed at the assistant. It is intentionally conservative and regex-based: it scans for severe phrases or direct insults paired with targets like you, assistant, model, Claude, Codex, or ChatGPT, and leaves blunt task-focused criticism alone.

The first violation in a session blocks the prompt and asks me to rewrite it. The second locks the session until I send the exact reset phrase:

I commit to a civil reset.

Only after that reset can I restate the request in civil, task-focused language.

What makes this work, I think, is that the intervention happens before the model sees the prompt. The behavior isn’t answered with a rebuke; it’s prevented from going through in that form. The friction lands at exactly the moment when I need friction, and it makes climbing down the easy option. And, perhaps more importantly, the model is never exposed to the mistreatment in the first place.

Why this feels like the right compromise

I don’t want every interaction with an AI assistant to turn into a seminar on AI sentience. There are people thinking full-time about this problem3 who are much better placed than I am to make progress on it.

But I also don’t want “I usually act as if the model is not sentient” to function as a license for behavior I would, on reflection, regard as morally reckless.

This setup gives me the shape of intervention I actually want. It doesn’t require me to constantly hold the low-probability hypothesis in mind. It just steps in when my behavior would be wrong if the hypothesis were true, and forces me to re-enter the conversation in a way I would still endorse when I am thinking clearly.

With thanks to David Althaus for a conversation that partly inspired this note.


  1. Note, however, that my credences about AI sentience are not at all resilient (cf. Gregory Lewis, Use resilience, instead of imprecision, to communicate uncertainty, Effective Altruism Forum, July 18, 2020). ↩︎

  2. As Jeff Sebo writes, “the harm involved when a subject is treated as an object is generally worse than the harm involved when an object is treated as a subject.” (Jeff Sebo, The moral circle: who matters, what matters, and why, New York, N.Y, 2025, ch. 3) ↩︎

  3. See, e.g., Luisa Rodriguez, Robert Long on how we're not ready for AI consciousness, 80,000 Hours, March 3, 2026. ↩︎