Getting AI to Improve Itself

How a custom handoff command and project diary made my AI collaboration compound over time.

Over the holidays I built a family app—meal planning, fitness tracking, household coordination. Ten days, about $50 in API costs, and it works well enough that my wife uses it too. I built this to test a theory: building exactly what I need with AI is now better, faster and cheaper than adopting professionally built products for a massively expanded set of things, even as a non-engineer.

This isn’t exactly controversial right now. But the reason it’s true isn’t just that building is easy—it’s that maintaining, extending, and improving what you build is easy too. The interesting part of my holiday project wasn’t the app. It was how easy it was to build a system that improves itself—and therefore keeps getting better at building and maintaining my apps over time.

I tried ~a dozen different new tools and strategies across my Claude Code-centric approach, from prompting strategies to claude.md configurations to executor-evaluator patterns (which I highly recommend).

Two tactics really lifted my system: a custom /handoff command and an enforced project diary. Between the two, my agents started to work, learn and improve together over time in a way that made my job easier and easier as we went.

The /handoff command

Until this project I mostly relied on claude’s built-in compaction. And while I knew it was faulty and degraded performance, I didn’t want to go through the effort of context curation and didn’t trust myself to capture everything I needed. I resolved to buck that in this project.

After a few sessions and some inconsistent git committing in between dashing to family dinners, I decided to create a custom /handoff command that I could run as a work session was winding down. It would

  1. Run tests and commit changes — Making sure we wrapped work vs. left small threads hanging.
  2. Update the project diary — Capturing decisions, learnings, open questions to pair with the commit history (see next section) to reduce fear & downside of closing this session.
  3. Generate a prompt for the next session — What to pick up, what context matters, what might go wrong - this made my default switch from preserve the session to handoff, restarting my context more frequently.
  4. Reflect on the session and suggest improvements to system instructions — Based on what happened this session, what should change in root and project claude.md, what hooks or skills to add, etc.

That last step did a lot of heavy lifting early. After one session where a sub-agent produced code with lint errors, the handoff suggested:

“Always run lint after integrating subagent code.”

After another where an evaluator agent caught a feature silently using mock data, it suggested:

“Periodic product-level reviews catch integration gaps that pass technical tests.”

These weren’t generic tips. They were specific learnings from actual work - my actual work and prompting habits - proposed back as durable rules. I’d review the suggestion, say “yep, do it,” and the instruction set would update.

The first batch of handoffs all produced several meaningful improvements. After ~5 rounds, the suggestions tapered and sessions usually ended with “the system is working well, no improvements suggested.” About every 8-10 sessions a tweak would be suggested. The system had learned my preferences and didn’t need to keep adjusting.

The handoff also helped my own context management. Each one ended with a prompt for the next session plus a pre-mortem on upcoming work. I could close my laptop mid-project and pick it up days later without the usual “where was I?” friction. Between the commits, the diary, and the handoff notes, neither the AI nor I lost track of decisions or goals.

The Project Diary

The project diary became a critical artifact throughout the build.

Git tells you what happened—files changed, commit messages. The diary captured why a decision was made, what alternatives we considered, what we learned that should change future work. It’s context capture.

Git tells youDiary tells you
What happenedWhy it happened
Files changedTradeoffs considered
Commit messageWhat we learned

Every session—including sub-agents—logged to the same diary. Some actual entries from my build:

December 25:

“Used recipe.url but type was recipe.source. Build failed immediately, preventing runtime error. TypeScript earned its keep.”

December 28:

“Chat.tsx had task mutations declared with underscore prefix (_createTaskMutation)—a common pattern to suppress ‘unused variable’ warnings. But they were NEVER called. The UI showed ‘Added: {task}’ confirmation cards without actually executing any API calls.”

Learning: Underscore-prefixed variables are a red flag for dead code. If you’re suppressing unused warnings, question why the code exists.

December 28:

“Pantry sync used void syncPantryItemToNotion(...) to suppress the promise. The tRPC endpoint returned success before knowing if Notion write succeeded. Users saw ‘Added!’ but items never appeared.”

Learning: Fire and forget is only appropriate when failure is truly non-critical AND users have no expectation of immediate results.

The underscore-prefix bug and the fire-and-forget bug would have shipped without the diary. Writing down learnings forced reflection that I’d otherwise skip. The diary also tracked architectural decisions, so when a future session questions a past choice, the reasoning is there.

DecisionChoiceRationaleRevisit If
Component libraryNone (Tailwind only)Simpler, no abstraction needed for family appUI complexity grows significantly
Error handlingSurface errors, never hideApp makes health/food recommendations; failures must be visibleNever

With the diary going, I was able to stop maintaining a detailed backlog or set of specs, or worry so much about capturing notes as I went myself, because the diary made it easy to retrace what happened, where we were, and what needed revisiting. I knew I wanted to share learnings from this project with my team - the diary made it dead simple to synthesize later on.

AI collaboration starts to feel like org design

I started creating guides, skills, and custom commands beyond just the handoff in Claude Code. At first this felt overwhelming—so many different ways to achieve similar things. But I found it high-leverage to continually refine, and I also found my mindset shifting over time.

Eventually it stopped feeling like configuring a tool and started feeling like org design. My root instructions are company-wide policies. Project instructions are team norms. Guides are department playbooks. Custom commands are specialists I can call in for specific tasks. The project diary is institutional documentation and memory.

When I add a new guide or refine an instruction, I’m not tweaking settings. I’m deciding: what values apply everywhere? What rules are context-dependent? What skills get called in when? It’s the same set of questions you’d ask designing how a team works together.

The frame changed how I invest my time. Instead of optimizing individual prompts, I’m building infrastructure that compounds. Each project benefits from the last one’s learnings. Each session builds on previous sessions instead of starting fresh.

I suspect this is where things are heading. As models get more capable, the differentiator won’t be who has access to better AI or uses it more. It’ll be who builds better systems for their AIs to collaborate with each other and themselves.


I’m still refining this—particularly what belongs in the diary vs. what’s noise. What systems are you building to make AI collaboration improve over time?

AI as a self-improving team — LinkedIn Draft

Status: v2 — org design as throughline


After five handoffs, my AI stopped suggesting improvements. It had learned my preferences.

I built a family app over the holidays—meal planning, fitness tracking, household coordination. Ten days, $50 in API costs, works well enough that my wife uses it. But the interesting part wasn’t the app. It was realizing that working with AI had started to feel less like prompting and more like org design.

Root instructions are company-wide policies. Project instructions are team norms. Custom commands are specialists I call in for specific tasks. The project diary is institutional memory.

Once I saw it that way, I started designing accordingly. Two practices made the biggest difference.

The /handoff command

I created a command to run when a work session winds down. It commits changes, updates the project diary, generates a prompt for the next session, and reflects on what should change in my system instructions.

That last part did the heavy lifting. After one session where a sub-agent produced code with errors, the handoff suggested: “Always run lint after integrating subagent code.” After another where I caught a feature silently using mock data: “Periodic product-level reviews catch integration gaps that pass technical tests.”

These weren’t generic tips. They were learnings from my actual work, proposed back as durable rules. I’d review, approve, and the system would update itself.

After about five rounds, suggestions tapered. The system had learned how I work. Now a tweak surfaces every 8-10 sessions. The team had onboarded.

The project diary

Git tells you what happened. The diary captures why—what alternatives we considered, what we learned.

One entry from December 28: “The UI showed ‘Added!’ confirmation cards but the API calls were never executed. Users saw success; nothing happened.”

That bug would have shipped without the diary. Writing down learnings forced reflection I’d otherwise skip. It’s the same reason good teams do retrospectives—institutional memory compounds.

Infrastructure that compounds

The frame changed how I invest my time. Instead of optimizing individual prompts, I’m building systems that get better. Each project benefits from the last one’s learnings. Each session builds on previous sessions instead of starting fresh.

I suspect this is where things are heading. As models get more capable, the differentiator won’t be who uses AI more. It’ll be who builds better systems for AI to learn and collaborate—with each other, and with themselves.


Claude’s Notes (v2)

Character count: ~2,650 (down from ~3,770)

Key restructure:

  • Hook B lands the surprise, then org design frame appears in para 2—not saved for the end
  • “Root instructions are company-wide policies…” now sets up the tactics, rather than explaining them after
  • Handoff and diary sections reframed as org design practices (onboarding, retrospectives, institutional memory)
  • “The team had onboarded” callback reinforces the frame
  • Cut the redundant “controversial take” setup and verbose numbered list
  • Tighter close—same punch, fewer words