|
Hey friends, How can we work effectively with AI? What’s the workflow and how does it scale? And ideally, it should compound. Every finished artifact—code, docs, analysis, decisions—becomes context for the next session. And each correction updates a config that reduces future errors. While I’m still learning, I’ve repeated my answers often enough that I’m writing it here so the next time I’m asked I can share a link instead. I appreciate you receiving this, but if you want to stop, simply unsubscribe. 👉 Read in browser for best experience (web version has extras & images) 👈 If you use AI regularly, you likely already apply many of these practices. Nonetheless, I believe the underlying principles apply broadly: provide good context, encode your taste as config, make verification easy, delegate bigger tasks, and close the loop. If a practice does not fit, adapt the principle and invent your own. Also notice, as you read, that none of this is specific to AI. It’s simply how you onboard and work with any new collaborator. Context as infrastructureHelp models nagivate your context. For example, all my code lives in Connect models to your organization’s context. Models can benefit from organizational knowledge which likely lives in Slack, Drive, Mail, etc. Most have MCPs for Claude Code, Cowork, Claude.ai. On top of these, I also maintain a Onboard each new session like a new hire. With each new session, the model starts with a blank slate. Thus, it helps to treat the per-project Build your memory layer. By default, models don’t remember what happened in the last session, so anything worth persisting should be written to disk. I split my memory layer into two buckets. Taste as configurationStart with Scope it by directory: global, then repo, then project. Put preferences that apply everywhere (e.g., behavior, long-term goals, teaching) in When CLAUDE.md gets too long, split it out. A long If you do something ≥ once a week, make it a skill. A skill is a markdown file with a name, trigger, and procedure that the model loads on demand. Think of skills as workflows written in markdown. They can include logic. For example, my
I tend to keep Bootstrap skills by doing the task once and then asking the model to make it a skill. This is how I build most skills. First, I do the task once, interactively, in a normal session. Then, I ask the model to turn what we just did into a skill. Next, I run the skill on the same or similar task. Inevitably, I’ll need to correct the output, which I do in the same session so feedback is logged in the session transcript. Finally, I ask the model to update the skill based on the corrections and feedback. You can also seed a skill with exapmles of the desired output. Ask the model to extract the patterns, like how you organize your code, or the structure and tone of your docs. Refine skills via the transcript, not the file directly. The first version of the skill rarely works perfect because it overfits the original session. This is normal. When you run it and need to update the output, correct it within the session. Try not to open and edit Nonetheless, not every task needs this context. For brainstorming, exploration, and rough drafts, I enjoy using simple mode ( Verification for autonomyShift verification left; catch errors at write time. I think of verification as a ladder. The bottom is cheap and deterministic; the top is expensive and requires judgement. We want to address issues at the lowest possible rung. Near the bottom are post-edit hooks that run Make it easy for the model to verify the work. Give the model feedback loops to improve its output. If the system produces a metric, let the model run the eval and optimize it. If the output renders in a browser, let the model inspect it via Claude in Chrome. If neither, let the model run it and read the error. For example, when building Docker images, I let the model build, read the error, edit the Dockerfile, and rebuild. If I’m tuning a harness, the model runs evals, reads the transcripts, and fixes failures. When building a dashboard, the model checks in Chrome that tooltips render, labels don’t overlap, and the narrative matches the numbers. For long-running tasks, have models watch models. Long sessions can drift as errors build up. One fix is to run a secondary session with fresh context to read the original spec and the recent turns of the primary session. My minimal setup uses two tmux panes, one for the primary dev, one for the pair programmer. Initial instructions and follow-up prompts are appended to a shared file. Periodically, the pair programmer spins up, checks the spec against the primary’s recent transcript, and if something’s off, provides feedback to course correct. We can do this in various ways. For example, the pair programmer can watch for execution drift—is the model doing the task right? This is local and tactical, like ignoring an error, reporting a bad metric, or diverging from the spec. There’s also direction drift—is the model doing the right task? These are bigger picture and strategic, and occur when the model misinterprets the original intent and spends hours building the wrong thing. Check for execution drift often and direction drift occasionally. Scaling via delegationDelegate increasingly bigger chunks of work. Sometimes, we pair-program with models: short tasks, fast feedback, staying in the loop. This well works for fast iterations, exploratory analysis, and prototyping. But with increasingly stronger models, we should aim to delegate bigger tasks. Explain your intent, constraints, and success criteria upfront, then let the model work. You can’t delegate what you can’t verify, so this requires first defining success criteria and metrics. The shift is from giving instructions, one at a time, to fleshing out plans and letting the model execute them end to end: “Given these eval suites, build isolated containers per suite and smoke-test that each builds. Then, do the full run, log the eval metrics and transcripts, and use subagents to read the transcripts and confirm the evals ran correctly. Run each eval n times for confidence intervals. Finally, generate the report, verify it follows the report guide, and slack me the results and report URL.” Run sessions in parallel and find the bottleneck. Delegating bigger tasks means we can run more at once. Claude says I typically run three to six sessions simultaneously. The bottleneck has shifted from doing the work to writing clear specs and reviewing outputs fast enough to keep the pipeline moving—the middle is hollowing out. If parallel sessions share a repo, use git worktrees so each session gets its own checkout and don’t overwrite each other’s changes. Make sessions easy to observe. When running multiple sessions, I need to know their state and which one needs attention. On my mac, a stop hook plays a sound when a session finishes (example below). My tmux window titles use a status emoji (⏳ working; 🟢 complete) and a short Haiku-generated label so I know what each pane is doing. The Claude Code status line shows context usage and the current mode. Together, the stop-hook sound signals a finished task, the tmux titles shows which one, and the status line provides the details. You can check in even if AFK. Closing the loopKeep the context rich by working in the open. When we do our work in shared docs, repos, and channels, it makes it easier for everyone—including models—to retrieve and benefit from the context. What we share today becomes part of the org context tomorrow. Try this simple test: could a new teammate replicate your work from last week using only the shared context? If yes, you’re contributing well to the org context; if not, that precious context is stuck in your head. I automate this somewhat via instructions in my Mine your transcripts for config updates. Have the model read past session transcripts to find gaps. When I scanned ~2,500 of my past user turns, a sizable percentage contained phrases like “can you also…“, “did you check…“, “still wrong”, etc. These suggest that the model should have done something unprompted, and I should update the Refactor and prune periodically. As configs grow, they can overlap or conflict with each other. As a result, if the model ignores a rule, it can be because another rule contradicts it. Fix this by refactoring periodically. Each rule or preference should live in exactly one place (though critical instructions can be repeated in the main • • • While the specific setup will likely change as models get better, I think the principles will remain relevant: provide good context, encode your taste, make verification cheap, delegate more, and close the loop. What we’re doing is training a collaborator, one feedback at a time. And if you think about it, these principles apply to how we work with a human team too. To get started, have your model read this SETUP.txt and help you apply it. Also, I’d love to learn what practices or principles you’ve found valuable—please comment below or reach out! p.s. This isn’t just about personal tooling. It’s also how you’d design agent harnesses, set team norms, and build org infrastructure. Try reading it again with those layers in mind. |
I build ML, RecSys, and LLM systems that serve customers at scale, and write about what I learn along the way. Join 7,500+ subscribers!
Hey friends, After repeating myself for the nth time on how to build product evals, I figured I should write it down. There are three basic steps: (i) labeling a small dataset, (ii) aligning our LLM evaluators, and (iii) running the experiment + evaluation harness with each config change. I appreciate you receiving this, but if you want to stop, simply unsubscribe. • • • 👉 Read in browser for best experience (web version has extras & images) 👈 First, label some data It begins with sampling...
Hey friends, What makes an effective principal engineer or scientist? I’ve distilled what I’ve observed from role models and quoted some of their advice below. While my perspective is Amazon-centric, these ideas should also apply to most principal tech IC roles. As always, use your best judgment and assess if this advice applies to you and your situation. I appreciate you receiving this, but if you want to stop, simply unsubscribe. 👉 Read in browser for best experience (web version has extras...
Hi friends, I got nerdsniped when I first heard about Semantic IDs. The idea is simple: Instead of using random hash IDs for videos or songs or products, we can use semantically meaningful tokens that an LLM can natively understand. I wondered, could we train an LLM-recommender hybrid on the rich behavioral data that makes today’s recommender systems so effective? I appreciate you receiving this, but if you want to stop, simply unsubscribe. • • • 👉 Read in browser for best experience (web...