<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>AI journal · Pablo Stafforini</title><link>https://stafforini.com/ai/</link><description>An informal study diary about using AI tools better.</description><generator>Hugo -- gohugo.io</generator><language>en</language><lastBuildDate>Sat, 18 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://stafforini.com/ai/index.xml" rel="self" type="application/rss+xml"/><item><title>The Codex rescue: when the handoff prompt is the product</title><link>https://stafforini.com/ai/the-codex-rescue-when-the-handoff-prompt-is-the-product/</link><pubDate>Sat, 18 Apr 2026 00:00:00 +0000</pubDate><guid>https://stafforini.com/ai/the-codex-rescue-when-the-handoff-prompt-is-the-product/</guid><description>&lt;![CDATA[For two weeks I had been trying to fix a PnL bug in a prediction-market analysis pipeline. The reported PnL disagreed with the Polymarket ground truth for many traders, and the gaps were not small: for one address the scored figure was `$32.8M` against an actual `-$8.3M`. I had been debugging with Claude Code the whole time, and the sessions had turned into an expensive merry-go-round: a plausible hypothesis, a partial fix, a fresh anomaly, another hypothesis.
At some point I gave up and switched to Codex. What I want to note is not that Codex is better — I am not convinced it generally is — but that the thing that actually worked was the handoff prompt, not the model.
What I wrote {#what-i-wrote}
The first message I sent Codex was, in effect, an incident postmortem. It opened with:
For the last two weeks, I've been trying to fix a recalcitrant bug […] I've been debugging with Claude Code, but Claude's performance has been atrocious. I am hoping that you can do better.
Then four sections, in this order:
1. **Why it's a full rebuild.** Pointed at the exact commit (`36269eb`) that bumped `PIPELINE_VERSION` from 1 to 2, quoted the runtime line where `_can_run_incremental()` logs `pipeline version mismatch (stored=1, current=2). Falling back to full rebuild.`, and noted that the bump was technically correct because `mint_cost` had changed formula.
2. **Per-step timing**, instrumented in the session I had just killed. A table of the twelve pipeline steps with their wall times: step 5 at `5h 15m`, step 6 at `4h 55m`, step 8 at `2h 33m`, step 9 still running past four hours. Totalling more than `40h`.
3. **Architecture problem.** The pipeline had exactly two modes, incremental and full rebuild. A one-line semantic change in a mid-pipeline step forced everything to redo, including the five hours of phantom-sell reconstruction that had nothing to do with `mint_cost`. I laid out which steps the change actually invalidated (5b, 6, maybe 9–12) and which it did not (5, 7, 8).
4. **Workaround built this session.** A per-trader verification script that reconstructed PnL directly from raw data in 30–60 seconds and bypassed the pipeline tables entirely.
Codex's first response was not a patch. It was a plan that reframed the bug. Instead of treating the 40-hour rebuild as a performance problem to optimise, it pointed at the full-vs-incremental dichotomy as the actual defect. From there the session converged quickly.
What I think is going on {#what-i-think-is-going-on}
I don't think Codex had some special insight that Claude lacked. The two models are close enough in capability that I would not bet on one reliably outperforming the other on a bug like this. What changed was the prompt. My Claude sessions had grown organically — I would describe the next symptom, Claude would propose the next fix, and the accumulated context was a pile of partial hypotheses. When I sat down to write to Codex, the fact that I was switching models forced me to compress: I had to state, in one message, what was true. That reconstruction is what produced the timing table, the invalidation matrix, and the workaround inventory.
In other words: the handoff prompt is the product. If I had written that same prompt to Claude in a fresh session, I suspect the outcome would have been similar.
What I'd do differently {#what-i-d-do-differently}
I want this to be a reflex, not a two-week-delayed rescue. Two concrete takeaways:
- When a debugging session starts to feel like a merry-go-round — three or four failed hypotheses in a row — that is the signal to stop and write the postmortem. Not "let me try one more thing," because trying one more thing is how you lose another two weeks.
- The postmortem template is basically what I sent Codex: _what's actually happening, measured; what invalidates what; what I've already built as a workaround; what the goal is._ The structure forces you to separate facts from guesses, and the agent you send it to — any agent — has an easier job.
Still an open question for me: is there a good way to get the current session's own transcript into the reset prompt automatically, rather than reconstructing it from scratch? That would remove the main cost of the "just start over" move.]]></description></item><item><title>An AI journal that reads its own logs</title><link>https://stafforini.com/ai/an-ai-journal-that-reads-its-own-logs/</link><pubDate>Sat, 18 Apr 2026 00:00:00 +0000</pubDate><guid>https://stafforini.com/ai/an-ai-journal-that-reads-its-own-logs/</guid><description>&lt;![CDATA[This post is the first entry in the journal, and it exists because a skill I wrote scanned my own session logs and suggested itself as a candidate. I want to note the setup briefly, because the move — treating session logs as a first-class artefact the agent can mine — has turned out to be more useful than I expected.
The idea {#the-idea}
I've been running Claude Code and Codex heavily for months. Every session leaves a `.jsonl` transcript on disk: `~/.claude/projects/<slug>/<session-id>.jsonl` for Claude, `~/.codex/sessions/YYYY/MM/DD/rollout-*.jsonl` for Codex. I already download these locally with [agent-log](/notes/agent-log/) so I can grep through them. The thought was: if I'm going to keep a public diary about AI, the diary shouldn't be a blank page I stare at once a week. It should start from what I actually did.
So I wrote a Claude skill — `/ai-journal` — that:
1. Reads a "last run" timestamp from a small state file.
2. Scans both log trees for sessions modified since that cutoff.
3. Reads the promising ones in full, groups related sessions, and returns a ranked list of candidate write-ups.
4. Asks me which to draft, writes drafts into the site as org source files, and bumps the cutoff.
The first scan surfaced eight candidates from about 50 sessions. One of them was this skill. The loop closed neatly.
What made it work {#what-made-it-work}
A few small choices mattered more than I expected.
**Separate mtime filtering from content timestamps.** The scan script filters session files by file mtime (so it only opens files touched since the last run) but reports each session's start time from the first user message in the transcript. This matters because Claude touches a lot of cached subagent files on every session startup, and those files contain old timestamps. If you filter only by mtime you get thousands of "sessions" from January; if you filter only by content timestamp you miss the ones that are actually active. Both filters together give you the right set.
**Delegate triage to a subagent.** Fifty sessions is already too many to read in the main context without drowning it. The skill spawns a subagent whose only job is to skim the TSV of session excerpts, sample the promising `.jsonl` files with `offset` / `limit` reads, and return 3–10 candidates in a fixed format. The main conversation never sees the raw transcripts; it only sees the shortlist.
**Ground every claim in the transcript.** The skill's instructions are explicit that posts must not invent details. If I need something that isn't in the session, the skill asks rather than fills in plausible-sounding text. This constraint is the difference between a diary and a content farm.
What I'd do differently {#what-i-d-do-differently}
A few things I already want to revise after one run:
- The first draft of the skill told the agent to write markdown directly into `content/ai/`. That was wrong: my site is ox-hugo-driven, `content/` is gitignored, and the canonical source for every page is an org file under `~/My Drive/notes/`. So the skill now writes posts as org files under `~/My Drive/notes/ai-journal/` with `:EXPORT_HUGO_SECTION: ai` set on the heading, and the normal notes-export pipeline picks them up. Worth remembering: when you build a skill that ends in "commit a file somewhere," take one pass through the destination's actual publishing pipeline before you write the instructions.
- The scan script returns file mtime-matched sessions whose content timestamps are ancient (thousands of them, mostly subagent warmup stubs). The candidate triage subagent has to ignore them. A second-pass filter that drops sessions whose content start is more than, say, 24h before the cutoff would trim the list at source.
The meta-observation {#the-meta-observation}
What I notice from building this: local session logs are crucial because they turn "find the interesting thing I did last week" into an AI task. Without them, I'd be relying on memory, and the interesting things are precisely the ones I wouldn't have the context to remember.]]></description></item></channel></rss>